Practice#1: Regular Filtering and Grouping & Aggregation

 

We use orders table as an example to achieve esProc SPL data externalization to speed up regular filtering operation and grouping & aggregation operation.

..

We use SPL’s ETL tool to generate a script to achieve data dump. Find dft.exe in [SPL installation directory]\esProc\bin, run it, and select File -> New ETL. Open Tool -> Connect to Data Source:

..

Click OK and drag the database table to work area:

..

Double-click the data table or click Edit icon to set export options:

..

In Tool -> Data Path, set up the directory where the data file is saved:

..

Click Tool -> Create SPLX File:

..

Export data from the file. Remember to save the newly-created etl file as Q1.etl.

SPL code example 2: The SPL code works to export data from MySQL database and dump it into a bin file (BTX).


A

1

=connect("speed")

2

="d:\\speed\\etl\\"

3

=A1.cursor("SELECT order_id,customer_id,employee_id,employee_name,order_date,shipper_id,shipping_fee,shipping_address FROM orders")

4

=file(A2+"orders.btx").export@b(A3)

5

=A1.close()

Example 1.1 SQL code for grouping record by employee and computing freight in each group:

select 
    employee_id,
    count(*) as order_count,
    sum(shipping_fee) as total,
    avg(shipping_fee) as average,
    max(order_date) as latest_order_date
from 
    orders
where 
    order_date between '2024-01-01' and '2024-10-31'
    and shipper_id<>1  
    and shipping_fee > 10
group by 
employee_id;

It takes 11 seconds to finish executing the SQL code.

SPL code example 3:


A

1

=now()

2

>sd=date("2024-01-01"),ed=date("2024-10-31")

3

=file("d:/speed/orders.btx").cursor@b(employee_id,shipper_id,shipping_fee,order_date)

4

=A3.select(order_date>=sd && order_date<=ed && shipper_id != 1 && shipping_fee > 10)

5

=A4.groups(employee_id; count(1):order_count, sum(shipping_fee):total_fee, avg(shipping_fee) : average_fee, max(order_date) : latest_order_date)

6

>output("query cost:"/interval@ms(A1,now())/"ms")

As orders table is too large to be wholly read into the memory, A3 uses cursor to read and compute data batch by batch. Note that the cursor retrieves only the required fields so that there will be fewer objects to be generated, memory usage reduced, and performance increased.

In A4, select() function, which is equivalent to SQL WHERE clause, is used for filtering.

A5 groups filtering result set. Though the syntax is different from that in a SQL counterpart, both involve same elements. The part before the semicolon is grouping key, which amounts to SQL GROUP BY, and the part after the semicolon represents aggregate value, which is equivalent to the aggregation operation in SQL SELECT. The SPL grouping operation, by default, combines the grouping key and the aggregate values into the result set, whereas SQL requires to writing the grouping key again in the SELECT statement.

It takes 2.2 seconds to finish executing the SPL script. Yet the BTX file of columnar storage format cannot fully tap SPL’s performance potential.

Edit orders table using the ETL tool:

..

SPL example export code 4: Dump data to a column-wise stored composite table (CTX):


A

1

=connect("speed")

2

="d:\\speed\\etl\\"

3

=A1.cursor("SELECT order_id,customer_id,employee_id,employee_name,order_date,shipper_id,shipping_fee,shipping_address FROM orders")

4

=file(A2+"orders.ctx").create@y(order_id,customer_id,employee_id,employee_name,order_date,shipper_id,shipping_fee,shipping_address).append(A3).close()

5

=A1.close()

A CTX file is by default stores data column-wise. It is suitable for dealing with cases where total number of fields is large but where only a small number fields is involved in the computing process.

Specifying data structure is required for creating a CTX file. The process is a little bit more complicated than that of creating a BTX file.

SPL example code 5: Execute code under example 1.1 through a CTX file.


A

1

=now()

2

>sd=date("2024-01-01"),ed=date("2024-10-31")

3

=file("d:/speed/orders.ctx").open().cursor (employee_id,shipper_id,shipping_fee,order_date)

4

=A3.select(order_date>=sd && order_date<=ed && shipper_id != 1 && shipping_fee > 10)

5

=A4.groups(employee_id;count(1):order_count,sum(shipping_fee):total_fee,avg(shipping_fee):average_fee,max(order_date):latest_order_date)

6

>output("query cost:"/interval@ms(A1,now())/"ms")

The code is basically similar to that with a BTX file, except that A3 has different code for generating a cursor. With CTX, you need to first open the composite table object before creating the cursor.

It takes 1.9 seconds to finish executing the code, which is faster than it takes to execute code with the BTX.

Next let’s use the CTX cursor filtering technique to further speed up the computation: attach filtering condition (s) to the cursor; read values of the field(s) written in the conditions, and do not retrieve the other fields if they cannot make conditions true, or continue to retrieve the other fields if they make the conditions true and create the current record.

In this way disk reads can be reduced, unnecessary objects can be avoided, and performance is increased.

SPL code example 6:


A

1

=now()

2

>sd=date("2024-01-01"),ed=date("2024-10-31")

3

=file("d:/speed/orders.ctx").open().cursor(employee_id,shipping_fee,order_date; order_date>=sd && order_date<=ed && shipper_id != 1 && shipping_fee > 10)

4

=A4.groups(employee_id;count(1):order_count,sum(shipping_fee):total_fee,avg(shipping_fee):average_fee, max(order_date) : latest_order_date)

5

=output("query cost:"/interval@ms(A1,now())/"ms")

In A3, when cursor is used to retrieve data, orders.ctx’s order_date, shipper_id and d shipper_fee fields will first be retrieved to compute the condition expression. If the condition does not hold, do not retrieve the rest of the fields, such as employee_id; if the condition holds, the other required fields will be retrieved and the record is created.

It takes 1.8 seconds to finish executing the cursor filtering algorithm.

Continue to use the parallel processing to improve performance. Writing parallel code in SPL is convenient. You just need to set the number of parallel threads (here is 8), which should be consistent with the number of CPU cores.

..

SPL code example 7: Perform parallel computation on BTX.


A

1

=now()

2

>sd=date("2024-01-01"),ed=date("2024-10-31")

3

=file("d:/speed/orders.btx").cursor@bm(employee_id,shipper_id,shipping_fee,order_date)

4

=A3.select(order_date>=sd && order_date<=ed && shipper_id != 1 && shipping_fee > 10)

5

=A4.groups(employee_id; count(1):order_count, sum(shipping_fee):total_fee, avg(shipping_fee) : average_fee, max(order_date) : latest_order_date)

6

>output("query cost:"/interval@ms(A1,now())/"ms")

Just add @m option to cursor() function. It takes 0.6 second to finish executing the script.

SPL code example 8: Perform parallel computation on CTX.


A

1

=now()

2

>sd=date("2024-01-01"),ed=date("2024-10-31")

3

=file("d:/speed/orders.ctx").open().cursor@m(employee_id,shipping_fee,order_date; order_date>=sd && order_date<=ed && shipper_id != 1 && shipping_fee > 10)

4

=A4.groups(employee_id;count(1):order_count,sum(shipping_fee):total_fee,avg(shipping_fee):average_fee, max(order_date) : latest_order_date)

5

=output("query cost:"/interval@ms(A1,now())/"ms")

Also use the @m option in cursor() function. It takes 0.5 second to finish executing the script.

Performance summary (unit: second):


MYSQL

BTX

CTX

Serial

11

2.2

1.9

Parallel

11

0.6

0.5

It appears that MySQL’s parallel computing performance is relatively low. Even after the parallel processing parameters are set, the performance isn’t noticeably improved. But this isn’t our focus in this article, we just skip it.

Without specific explanations, examples in all later tests use the 8-threads parallel processing.

Note that SPL file storage technique has its specific application scenario. As it involves data export, it is more suitable for handling the static historical data. In fact, the scenario is commonly seen.

Try to do the following exercises:

1. Group data by customer and compute freight for each customer according to a filtering condition involving both order_date and employee_id.

2. Export a relatively large table from one of your familiar test databases and generate a BTX file and a CTX file. Try handling the previous computations using the BTX file and the CTX file.