"EXISTS operation on large primary table and subtable often has low performance. Such a computati .."

chenhu RaqForum 81 No.
13 View • 4 Months ago

Practice #5: EXISTS at Large Primary-Subtable Join

SPL(68)

EXISTS operation on large primary table and subtable often has low performance. Such a computation, in essence, is a primary key-based join. If both the primary table and the subtable are stored by the primary key in order, you can use the order-based merge algorithm to effectively speed up the computation.

esProc SPL can transform the primary-subtable EXISTS operation to an order-based merge so that computing performance can be improved.

For example, take the orders table and details table, let’s look at how SPL speeds up the primary-subtable EXISTS operation through its external data querying speedup method.

First, prepare the data. Here you can directly use the CTX file generated in Practice #4.

Example 5.1 Group details records whose product IDs are 7 by customer ID and count orders in each group.

select o.customer_id, count(o.order_id)
from orders o
where o.order_date >= '2024-12-31'
    and o.order_date < '2025-01-01'
	and exists 
            (select * 
              from details d
              where d.order_id = o.order_id
                          and d.product_id=7)
group by o.customer_id;

It takes 41 seconds to finish executing the code.

SPL code example 23:

	A
1	=now()
2	=file("ordersQ4.ctx").open().cursor@m(order_id,customer_id;order_date>=date(2024,12,31)&&order_date<date(2025,01,01))
3	=file("detailsQ4.ctx").open().cursor(order_id;product_id==7;A2)
4	=A3.group@1(order_id)
5	=joinx(A2:o,order_id;A4:d,order_id)
6	=A5.groups(o.customer_id;count(o.order_id))
7	>output("query cost:"/interval@ms(A1,now())/"ms")

A4 groups order details records where product ID is 7 by order_id in order and only keeps records with the first order_id value. This way grouped subsets will not be generated, bringing better performance.

A5 performs order-based merge on the grouping result set and the orders table. Then A6 groups the merge result set and find the number of orders in each group.

In A3, the last parameter of the cursor() function is A2, which means multithreaded processing. To perform the multithreaded processing, details table will be also segmented by aligning with the orders table, ensuring that the two tables can be merged in order correctly.

It takes 0.2 second to finish executing the SPL script.

Example 5.2 Find the number of order details records with product ID being 6 that do not have corresponding records in orders table.

select count(d.order_id)
from details d
where d.product_id = 6
  and not exists 
       (select * 
        from orders o
        where o.order_id = d.order_id);

It takes 16 seconds to finish executing the code.

SPL code example 24:

	A
1	=now()
2	=file(“ordersQ4.ctx”).open().cursor@m(order_id)
3	=file("detailsQ4.ctx").open().cursor(order_id;product_id==6;A2)
4	=joinx@d(A3,order_id;A2,order_id)
5	=A4.skip()
6	>output("query cost:"/interval@ms(A1,now())/"ms")
7	=now()

In A4, joinx() function works with @d option to filter details table according to orders table and only keeps records whose order_id values do not match any of the records in orders table.

A5 performs COUNT on A4’s cursor to get the result.

It takes 0.2 second to finish executing the SPL script.

Example 5.3 Find orders records (excluding those where product ID is 9) that have more than one corresponding record in details table, group them by order date, and count orders in each group.

select o.order_date,count(distinct o.order_id) 
from orders o
where 
    exists (        
        select 1
        from details d
        where d.order_id = o.order_id
        group by d.order_id
        having count(*) > 1
    )
    and not exists (
        select 1
        from details d
        where d.order_id = o.order_id and d.product_id = 9
);

It takes MySQL 5 minutes to get nothing.

SPL code example 25:

	A
1	=now()
2	=file("ordersQ4.ctx").open().cursor@m(order_id,order_date)
3	=file("detailsQ4.ctx").open().cursor(order_id,product_id;;A2)
4	=A3.group(order_id)
5	=A4.select(~.count(order_id)>1 && !~.pselect(product_id==9))
6	=joinx(A2:o,order_id;A5:d,order_id)
	=A6.groups(o.order_date;count(o.order_id))
	=now()

A4 groups details table by order_id in order.

A5 performs filtering repeatedly on groups one by one, during which ~ represents the current group, and only keeps groups where the number of order_id values is greater than 1 and which do not contain records whose product_id values 9.

It takes 0.9 seconds to finish executing the SPL script.

Performance summary (unit:second):

	MYSQL	SPL
Example 5.1	41	0.2
Example 5.2	16	0.2
Example 5.3	More than 6 mins	0.9

Exercises:

1. Group orders records where product_id value is 8 by date and count orders in each group.

2. Critical thinking: Do you ever encounter large primary-subtable EXISTS operations in your familiar databases? Can you use the order-based merge algorithm to speed up the computations?

SPL Official Website 👉 https://www.esproc.com

SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL

SPL Learning Material 👉 https://c.esproc.com

SPL Source Code and Package 👉 https://github.com/SPLWare/esProc

Discord 👉 https://discord.gg/sxd59A8F2W

Youtube 👉 https://www.youtube.com/@esProc_SPL

SPL(68)

esProc

chenhu • 13 View • 4 Months ago