"When data analysis involves different business systems, cross-database computations become neces .."

lisongbo RaqForum 82 No.
1 Reply • 9 View • 1 Months ago

How to Implement JOIN Operations Between Tables from Different Databases with esProc

When data analysis involves different business systems, cross-database computations become necessary. Of these computations, JOIN between tables are particularly challenging, as many databases lack this capability. Although Java can retrieve data and perform calculations, it is too complex. esProc offers a much simpler solution for cross-database JOIN operations.

Data and use cases

A vehicle management system (DB_Vehicle) stores information about vehicles and their owners. The simplified structure for the owner_info table is shown below:

The primary key owner_id is the ID number of vehicle owners.

The simplified structure for the vehicle_master table is as follows:

The vin is designated as the primary key, and because plate_no is also unique, both fields can logically be treated as primary keys.

A traffic management system (DB_Traffic) stores vehicle traffic information. The structure of the traffic_violation table is as follows:

A citizen information system (DB_CitizenEvent) stores citizen-related information. The structure of the citizen_event table is as follows:

The logical relationships between these four tables can be described as follows:

Logically, the citizen_event and owner_info have a many-to-one relationship. As a dimension table, owner_info is significantly smaller than citizen_event.

The traffic_violation and vehicle_master tables also exhibit a many-to-one relationship. Both tables can be quite large. From the perspective of vehicle_master, traffic_violation resembles a sub table, forming a primary-sub relationship (plate_no is the logical primary key of vehicle_master).

Why distinguish table relationships?

In daily practice, equality joins typically involve primary keys (many-to-many associations generally have little business significance). These joins can be broadly divided into two types: one is dimension table associations, where a common field of one table associates with the primary key of a dimension table (e.g., citizen_event and owner_info); the other is associations between a table’s primary key and another table’s primary key or part of primary key (e.g., vehicle_master and traffic_violation, where plate_no and violation_id of the traffic_violation table can be treated as a composite primary key, i.e., violation_id is subordinate to plate_no).

When performing join operations, esProc chooses different association methods based on specific association scenario, simplifying coding while also improving computing efficiency. For now, just keep this in mind, and let’s look at specific examples.

Let’s do these calculations:

1. Calculate the number of events by city for citizens who own vehicles over the past year, to analyze the “behavioral activity level” of vehicle owners in each city.

2. Find the names and event descriptions of vehicle owners who have received commendations (“Commendation”) in the past year, to identify “outstanding citizens”.

3. Count the number of traffic violations by year and vehicle brand, to analyze whether certain brands are more prone to violations and to research the relationship between driving behavior and vehicle brands.

If these tables were in the same database, writing these operations in SQL would not be difficult, roughly as follows:

1. SELECT   o.city, COUNT(e.event_id) AS event_count
FROM     citizen_event e
JOIN     owner_info o ON e.citizen_id = o.citizen_id
WHERE    e.event_time >= DATE_SUB(CURDATE(), INTERVAL 1 YEAR)
GROUP BY o.city
ORDER BY event_count DESC;

2. SELECT   o.name AS citizen_name, e.description
FROM     citizen_event e
JOIN     owner_info o ON e.citizen_id = o.citizen_id
WHERE    e.event_type = 'Commendation'
  AND    e.event_time >= DATE_SUB(CURDATE(), INTERVAL 1 YEAR);

3. SELECT   YEAR(v.violate_time) AS year, vi.brand, COUNT(v.violation_id) AS violation_count
FROM     traffic_violation v
JOIN     vehicle_info vi ON v.plate_no = vi.plate_no
GROUP BY YEAR(v.violate_time), vi.brand
ORDER BY year, violation_count DESC;

However, if cross-database operations are involved, it becomes much more complicated.

Installing esProc

First, download esProc Standard Edition at https://www.esproc.com/download-esproc/.

After installation, configure the database connections. All three databases here are MySQL.

Begin by placing the MySQL JDBC driver package in the directory [esProc installation directory]\common\jdbc (similar setup for other databases).

Then, start the esProc IDE. From the menu bar, select Tool > Connect to Data Source and configure a standard MySQL JDBC connection.

Configure all three databases using the same method. After configuration, test the connections by clicking “Connect”. If the three newly configured data sources turn pink, the connections are successful.

Test the configuration by pressing Ctrl+F9 to execute the script. If the data is queried successfully, the configuration is correct.

Implementation of use cases

Now, let’s implement the first calculation requirement: Calculate the number of events by city for citizens who own vehicles over the past year. This requires associating owner_info and citizen_event tables, which is a dimension table association calculation.

Dimension table associations

Implement in esProc:

	A
1	=connect("vehicle")
2	=A1.query@x("select * from owner_info").keys@i(owner_id)
3	=connect("citizen")
4	=A3.query@x("select * from citizen_event where event_time >= DATE_SUB(CURDATE(), INTERVAL 1 YEAR)")
5	=A4.switch(citizen_id,A2)
6	=A5.groups(citizen_id.reg_city;count(event_id):ent)

A2 queries the vehicle database for owner information. query@x loads all data into memory and then closes the database connection. keys@i sets the primary key and creates an index. Since the fact tables are typically much larger than dimension tables, this index will be reused many times, significantly speeding up calculations.

A4 queries the event table to filter for data from the past year, and loads all the resulting data into memory.

A5 uses switch for foreign key association. Because the foreign key points to unique dimension table records, the switch directly transforms the association field citizen_id to the corresponding record from A2 (in memory, it actually stores the address of the dimension table record).

This transformation is a one-time operation, and the resulting data can be reused repeatedly. Moreover, it can simultaneously handle foreign key associations for multiple dimension tables. After association, any dimension table field can be referenced using the ‘association field. dimension field’ syntax. For instance, A6 uses citizen_id.reg_city to obtain registration cities for grouping and aggregation.

The overall execution is as follows:

Next, let’s find the names and event descriptions of vehicle owners who have received commendations in the past year.

Add the following based on the previous code:

	A
…	…
7	=A5.select(event_type=="Commendation").new(citizen_id.name,description)

This calculation is still based on the association results from A5, thereby achieving reuse.

Here, we want to point out that many databases inherently lack the ability to perform cross-database associations, especially for heterogeneous databases. esProc’s association capability is data-source independent, supporting any database and even other diverse data sources. This is its first advantage. The second is that esProc’s clear distinction of foreign key associations also offers significant benefits even compared to single-database associations.

In terms of writing and understanding, the dot (.) operator (such as object.attribute) allows referencing all fields of the foreign key table, regardless of how many layers of dimension tables exist (dimension tables may have their own dimension tables). It also easily expresses self-associations or recursive associations.

When the citizen_event table contains a large amount of data, esProc can still handle it. However, when the data exceeds available memory, the in-memory addressization approach does not work, because precomputed addresses cannot be stored in external storage. In this case, data must be read and addressizd gradually.

Calculate the number of events by city for citizens who own vehicles:

	A
1	=connect("vehicle")
2	=A1.query@x("select * from owner_info").keys@i(owner_id)
3	=connect("citizen")
4	=A3.cursor@x("select * from citizen_event")
5	=A4.switch(citizen_id,A2)
6	=A5.groups(citizen_id.reg_city;count(event_id):ent)

This code is largely the same as the in-memory approach, except that A4 uses cursor to create a cursor to read data in batches. esProc’s cursors are delayed cursors; calculations on the cursor are deferred until the data is retrieved.

However, cursors are single-use. If you need to perform other calculations—for example, identifying vehicle owners who have received commendations—relying on the initial calculation in A5 will not yield any results (note the calculation results of A7).

In this case, esProc’s channel mechanism can be utilized:

	A	B
1	=connect("dba")
2	=A1.cursor@x("SELECT * FROM orders WHERE EXTRACT(YEAR FROM order_date) = EXTRACT(YEAR FROM CURRENT_DATE) ORDER BY customer_id")
3	=connect("dbc")
4	=A3.query@x("SELECT * FROM customer").keys@i(customer_id)
5	=A2.switch(customer_id,A4)
6	cursor A5	=A6.groups(customer_id.customer_level;count(1):order_count)
7	cursor	=A7.group(customer_id).select(~.len()>1).conj().id(customer_id.customer_name)

A6 and A7 create channels based on A5 (A7 uses a shorthand notation). B6 performs grouping and aggregation on the channel and returns the results to A6:

B7 filters for vehicle owners who have received commendations based on another channel. The results of A are shown below:

Primary-subtable associations

Count the number of traffic violations by year and vehicle brand:

	A
1	=connect("vehicle")
2	=A1.query@x("select * from vehicle_master")
3	=connect("traffic")
4	=A3.query@x("select * from traffic_violation")
5	=join(A2:v,plate_no;A4:t,plate_no)
6	=A5.groups(year(t.violate_time),v.brand;count(1):cnt)

A5 uses the join function to associate the two tables based on plate_no. The association result is:

It’s a multi-layered set, retaining all records from both tables. Click to expand for details:

With the association complete, A6 can perform grouping and aggregation through multi-level referencing.

When handling primary-subtable associations, we use the join function, which is different from the switch function used for foreign key associations. The join function offers several options, such as @1 for left join, @f for full join, and @d for difference operations, to meet diverse join requirements. In fact, foreign key associations can also be established using the join function.

So why not use join consistently?

This is because we’ve only discussed two-table associations so far. If there are multiple dimension tables —a common scenario—, switch can attach all dimension tables (which may themselves have dimension tables) to the fact table. However, it is difficult to express these hierarchical relationships with join, and the coding is also inconvenient.

When handling primary-subtable associations, both tables may be quite large. By leveraging the characteristics that the association fields are primary keys (or part of the primary key), an ordered merge algorithm can be employed to complete the association with just one traversal.

Count the number of traffic violations by year and vehicle brand:

	A
1	=connect("vehicle")
2	=A1.cursor@x("select * from vehicle_master order by plate_no")
3	=connect("traffic")
4	=A3.cursor@x("select * from traffic_violation order by plate_no")
5	=joinx(A2:v,plate_no;A4:t,plate_no)
6	=A5.groups(year(t.violate_time),v.brand;count(1):cnt)

A2 and A4 employ the cursor function to create cursors, where the SQLs inside both sort by plate_no.

A5 uses joinx for ordered merging; the result is still a cursor. The rest of the code is identical to that used in the in-memory approach.

Ordered traversal leverages the ordered property of the association keys, making it suitable only for primary-subtable associations (which can be ordered by the primary key) and unsuitable for foreign key associations with dimension tables, as discussed earlier. The reason is that a single table can have multiple foreign keys participating in associations, making it impossible to sort the table by multiple fields simultaneously. Consequently, different functions (algorithms) are employed after differentiating the join types.

To summarize, esProc not only makes cross-database associations easy but also offers a variety of implementation algorithms for different association scenarios. Simply by choosing the right algorithm, you can dramatically improve both coding efficiency and calculation speed.

SPL Official Website 👉 https://www.esproc.com

SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL

SPL Learning Material 👉 https://c.esproc.com

SPL Source Code and Package 👉 https://github.com/SPLWare/esProc

Discord 👉 https://discord.gg/sxd59A8F2W

Youtube 👉 https://www.youtube.com/@esProc_SPL

esProc

lisongbo • 9 View • 1 Months ago