How to Speed Up Conditional Filtering on Enumerated Fields with esProc

 

Enumerated fields in data tables have a limited number of predefined values. Filtering conditions on the enumerated field f are often expressed as f =v1 or f =v2 or ...; f !=v1 and f !=v2 and ...; in; or not in. The database has to compare f with n values. When the data table is large, the number of comparisons will be very high, leading to poor performance, and the performance worsens as n increases.

If comparisons can be avoided during filtering, performance will naturally improve significantly. esProcs aligned sequence mechanism can achieve this effect.

The following example uses the orders table to demonstrate how to use aligned sequences, and also compares the performance of esProc SPL and MySQL database in calculating conditional filtering on enumerated fields.

In the MySQL database, the orders table is a fact table with 10 million records, storing all orders for 2024. It has oid (order ID) as the primary key and includes fields cid (customer ID), odate (order date), ename (employee name), and amount.

cid and ename are enumerated fields. cid corresponds to the dimension table customer, while ename does not have a corresponding dimension table.

The customer table, with cid as its primary key, includes fields cname (customer name) and ccountry (customer country). It contains only 10,000 records.

Test environment: VMware virtual machine with the following specifications: 8-core CPU, 8GB RAM, SSD. Operating system: Windows 11. MySQL version: 8.0.

First, download esProc at https://www.esproc.com/download-esproc/. The standard version is sufficient.

After installing esProc, test whether the IDE can access the database normally. First, place the JDBC driver of the MySQL database in the directory ‘[Installation Directory]\common\jdbc,’ which is one of esProc’s classpath directories.

..

Establish a MySQL data source in esProc:

....

Return to the data source interface and connect to the data source we just configured. If the data source name turns pink, the configuration is successful.

In the IDE, create a new script, write SPL statements to connect to the database, and execute a SQL query to load data from the customer table.


A

B

1

=connect("mysql16")

2

=A1.query("select * from customer")

Press Ctrl+F9 to execute the script. The result of A2 is displayed on the right side of the IDE, which is very convenient.

..

First, we need to determine the value ranges of the enumerated fields. Since the field cid has a corresponding dimension table, its value range is already determined. For ename, which lacks a dimension table, we must pre-traverse the orders fact table to determine its value range and generate a dimension table.

This process is completed during the data preparation stage, and at the same time, the values of enumerated fields in the fact table are converted to their respective position numbers.


A

1

=connect("mysql16")

2

=A1.query("select * from customer order by cid").keys@i(cid)

3

=file("customer.btx").export@b(A2)

4

=A1.query("select distinct ename from orders order by ename").keys@i(ename)

5

=file("employee.btx").export@b(A4)

6

=A1.cursor("select oid,cid,ename,odate,amount from orders")

7

=A6.derive(A2.pfind(cid):cnum,A4.pfind(ename):enum)

8

=file("orders.ctx").create(oid,cnum,enum,odate,amount,cnum,enum)

9

=A8.append(A7)

10

>A1.close(),A9.close()

A2 retrieves the customer table, with cid as the primary key, to prepare for converting the cid values in the orders table to position numbers later.

A4 retrieves the distinct ename values from the orders table to create a new dimension table employee, with ename as the primary key, to prepare for converting the ename values in the orders table to position numbers later.

A6 creates a cursor for the orders table. A7 converts the cid and ename values to position numbers – that is, it uses the pfind function to find those position numbers from their respective dimension tables– in preparation for filtering calculations.

String fields, such as ename, take up a lot of space, and comparison calculations are slower. Converting them to integers can effectively improve performance.

The ename and cid fields are retained here to compare the performance difference between using and not using aligned sequences.

A8 writes the data converted to position numbers to the order composite table.

With the data prepared, you can now use esProc to speed up conditional filtering on enumerated fields.

Example 1: Find orders where the employee is James or Luke, then group by date and count the orders in each group.

SQL code:

select odate,count(1)
from orders
where 
    ename='James' or ename='Luke'
group by odate;

MySQL took 28 seconds.

esProc employs the aligned sequence mechanism:


A

1

=file("employee.btx").import@b()

2

=A1.(["James","Luke"].contain(ename))

3

=file("orders.ctx").open().cursor@m(odate;A2(enum))

4

=A3.groups(odate;count(1))

A2 loops through the dimension table employee, generating a boolean sequence of the same length. Within the loop, the contain function checks if the set ["James","Luke"] includes the ename value at the current position. If so, the corresponding member in the boolean sequence is assigned true; otherwise, it is assigned false. esProc refers to such a boolean sequence an aligned sequence.

To implement the ‘not in’ logic, simply prefix the expression enclosed by the outermost parentheses in A2 with !.

A3 performs conditional filtering on the cursor for the fact table orders. During data preparation, the enum field values were assigned their corresponding position numbers from the dimension table. Here, the position numbers are used to retrieve the corresponding members from the aligned sequence. If the element is true, the condition is met; otherwise, it is false. This transforms the comparison operation into a position-based retrieval of sequence members, reducing computational complexity.

esProc took 0.5 seconds.

The code not using the aligned sequence mechanism is:


A

1

=file("orders.ctx").open().cursor@m(odate;ename=="James" or ename=="Luke")

2

=A1.groups(odate;count(1))

Execution time: 0.5 seconds.

Because only two employee names are compared and the total data volume is small, the performance difference between string comparison and position-based member retrieval is not significant.

Example 2: Find the orders where the employee names are in a group of 10 (including James). Then, group by date and count.

Writing this example with or would be too verbose, so SQL uses in:

select odate,count(1)
from orders
where 
    ename in ('James', 'Rose', 'Luke', 'Tom',…)
group by odate

MySQL still took 28 seconds.

esProc uses the aligned sequence mechanism:


A

1

=file("employee.btx").import@b()

2

=A1.(["James","Rose","Luke","Tom",…].contain(ename))

3

=file("orders.ctx").open().cursor@m(odate;A2(enum))

4

=A3.groups(odate;count(1))

The code simply adds employee names to be compared based on Example 1.

esProc still took 0.5 seconds.

Code not using the aligned sequence mechanism:


A

1

= ["James","Rose","Luke","Tom",…].sort()

2

=file("orders.ctx").open().cursor@m(odate;A1.contain@b(ename))

3

=A2.groups(odate;count(1))

esProc took 0.8 seconds. As the number of employee names to be compared increases, the performance difference between string comparison and position-based member retrieval becomes obvious.

Example 3: Find the orders where the customer’s country is USA and calculate the total amount.

SQL requires using a subquery:

select sum(amount)
from orders
where 
    cid in (
        select cid
        from customer
        where ccountry='USA')

MySQL again took 28 seconds.

The esProc code using aligned sequence is basically the same as the previous two examples:


A

1

=file("customer.btx").import@b()

2

=A1.(ccountry=="USA")

3

=file("orders.ctx").open().cursor@m(amount;A2(cnum) )

4

=A3.total(sum(amount))

Execution time: 0.5 seconds.

Code not using the aligned sequence mechanism:


A

1

=file("customer.btx").import@b().keys@i(cid)

2

=A1.select@i(ccountry=="USA")

3

=file("orders.ctx").open().cursor@m(amount;cid:A2)

4

=A3.total(sum(amount))

The execution time is also 0.5 seconds. Since the cid field stores integer values, the performance difference between integer comparison and position-based member retrieval is not significant.

Test results:


MYSQL

esProc (using aligned sequence)

esProc (not using aligned sequence)

Example 1

28 sec

0.5 sec

0.5 sec

Example 2

28 sec

0.5 sec

0.8 sec

Example 3

28 sec

0.5 sec

0.5 sec

With a total data volume of ten million, esProc’s aligned sequence significantly improves the performance of conditional filtering on enumerated fields compared to string calculations in Example 2.

As the data volume increases further, the benefits of the aligned sequence will become more pronounced.

When preparing data, exporting data from the database is time-consuming, but since this is a one-time operation, the longer duration is acceptable. When the orders table has incremental data in the future, you only need to append the new data to the composite table on a regular basis.

However, when the data of dimension tables changes, it will be a bit cumbersome, as they require regenerating the btx files. Moreover, because the result of sequence-numberization of the fact table relies on the record order in the dimension tables, any changes to the dimension tables (like adding or deleting records) necessitate a complete regeneration of the fact table, leading to a relatively high maintenance and management cost.

In practice, numerous computational scenarios rely on static historical data, and many conditional filtering calculations on enumerated fields urgently need performance improvements. esProc’s aligned sequences are well-suited to speed up these scenarios.