Is This a Refreshing Way to Understand Association?

 

The definition of association in SQL is overly simplistic: association is essentially performing a Cartesian product of two tables and then filtering the result, expressed in syntax like A JOIN B ON .... Python’s approach to association largely follows the SQL approach, with similar concepts and methods.

However, esProc SPL understands association differently, with a definition that is no longer related to the Cartesian product.

SPL divides association into two categories. One is called foreign key association, which refers to the association between a regular field (foreign key) of one table and the primary key of the other.

For example, the foreign keys employee_id and customer_id of the orders table are associated with the primary keys of the customers and employees tables, respectively. Now we want to find orders placed by New York customers and handled by US employees.

Note that the primary key we are talking about here is logical—a field with unique value in a table—not necessarily a physical primary key created on the table. Similarly, the foreign key is not necessarily a physical foreign key.

In SQL, this would be written as:

select o.order_id, o.order_date, o.employee_id, o.customer_id
from orders o
    join employees e on o.employee_id = e.employee_id
    join customers c on o.customer_id = c.customer_id
where c.city = 'New York' and e.nationality = 'US';

Both associations use the JOIN.

In Python, this would be written as:

merged=orders.merge(employees,on='employee_id').merge(customers,on='customer_id')
filter_result = merged[(merged['city'] == 'New York') & (merged['nationality'] == 'US')]
result = filter_result[['order_id', 'order_date', 'employee_id', 'customer_id']]

Both associations use the MERGE.

SPL, in contrast, treats the foreign key as an object. Since the foreign key is the primary key in the associated table, it’s unique. This means the foreign key can uniquely correspond to a record (object) in the associated table, enabling it to directly represent that record. Consequently, the fields of the associated table can be accessed as attributes of this object. For example, the employee_id in the orders table can be understood as a record (object) in the employees table, and the employee_name and nationality fields of this record are attributes of that object.

This problem can be written in SPL as:

orders.switch(employee_id,employee:employee_id;customer_id,customer:customer_id)
result = orders.select(customer_id.city == "New York" && employee_id.nationality=="US")

The switch() function converts the employee_id and customer_id in the orders table to the corresponding records in the employees and customers tables, making these fields the objects we want.

The subsequent calculation is very simple: retrieve the city attribute of the employee_id object and the nationality attribute of the customer_id object to form the filter condition.

This object-oriented understanding approach is clearly much more natural and intuitive than the approach of Cartesian product followed by filtering, offering a refreshing alternative.

The other category of association in SPL is called primary key association, which refers to the association between the primary key of one table and the primary key (or part of it) of the other.

For example, there is an orders table and an order details table. The order_id field is the primary key of the orders table, whereas the order_id and product_id fields together form the primary key of the order details table. Now we want to retrieve the order id, shipping fee, and order amount. the SQL would be:

select o.order_id,o.shipping_fee,sum(od.amount) as order_amount
from orders o 
    join order_details od on o.order_id = od.order_id
group by o.order_id, o.shipping_fee;

After each JOIN, SQL must explicitly specify the fields and calculate the result set. When the association result is needed again, the JOIN has to be recalculated. For example, we want to further find orders with a quantity of detailed items greater than 5. SQL must write another statement, rewriting the association code:

select o.order_id,o.order_date,count(od.product_id)
from orders o
    join order_details od on o.order_id = od.order_id
group by o.order_id, o.order_date
having count(od.product_id)>5;

SPL, in contrast, regards primary key association as an association between record objects or record sets. To retrieve the order id, shipping fee, and order amount, the code would be:

od_group=order_details.group(order_id)
order_all=join(orders:o,order_id;od_group:od,order_id)
result1=order_all.new(o.order_id,o.order_date,od.sum(amount))

The group() function groups the order details by order_id, obtaining a set of sets od_group, where each member is a subset containing records with the same order_id. Within these subsets, order_id values are unique and serve as the logical primary key.

The join() function finds members with the same primary key from the orders and the od_group. These matching members then compose a new record with two fields, o and od, which is subsequently placed in the resulting table order_all.

Note that SPL’s association operations here aren’t limited to two sets composed of records (i.e., tables); they also allow sets of sets to participate in the association.

This makes primary key association more natural, intuitive, and easier to understand. Moreover, the association result can be reused to perform various subsequent calculations. For example, to further find orders with a quantity of detailed items greater than 5, the existing SPL code can be extended as follows:

result2=order_all.new(o.order_id,o.order_date,od.count(product_id):count).select(count>5)

The advantage of SPL association operation is not obvious when the association result is used only once, but the code becomes significantly simpler when used multiple times.

Python can also reuse the association result. The code is:

merged = pd.merge(orders, order_details, on='order_id')

merged = pd.merge(orders, order_details, on='order_id')
result1=merged.groupby(['order_id','order_date'])['amount'].sum()

grouped=merged.groupby(['order_id','order_date'])['product_id'].count().reset_index(name='product_count')
result2 = grouped[grouped['product_count'] > 5]

The code is also simpler than SQL.

However, Python’s association result, merged, is a wide table formed by concatenating fields from two tables, resulting in poor flexibility. While suitable for simple grouping and aggregation, further calculations become cumbersome when faced with slightly more complex requirements.

For example, to further find orders containing beef products and calculate the order amount, continue writing code in Python:

beef_order_ids = merged[merged['product_name'] == 'beef']['order_id'].unique()
beef_orders = merged[merged['order_id'].isin(beef_order_ids)]
result3=beef_orders.groupby('order_id').agg({'order_date':'first','amount': 'sum'}).reset_index()

Conditional filtering, deduplication, IN-filtering, and subsequent grouping and aggregation are all required, resulting in cumbersome logic and code.

The association result is a wide table, which makes subsequent calculations prone to errors. If programmers do not analyze carefully, they may directly filter for beef product records from the merged wide table and then group and aggregate to obtain the result. However, this approach inadvertently filters out the amounts for other products in those orders, leading to an incorrect result.

SPL’s primary key association yields a nested table structure where the od field is a table object, enabling flexible execution of various subsequent calculations.

We can proceed based on the existing SPL code:

result3=order_all.select(od.select@1(product_name=="beef")).new(o.order_id,o.order_date,od.sum(amount))

Where od.select@1 retrieves the first record from the od table that satisfies the condition. If not found, it returns null (equivalent to false), and the record composed of o and od will be filtered out by order_all.select.

In fact, SPL association organizes the data to conform to business relationships, thereby facilitating the further development of concise code for complex computations.

Moreover, SPL’s association operation is less error-prone. Here, the od field is treated as a whole. When filtering with order_all.select, the od field is either fully retained or fully discarded, avoiding partial filtering of its detail records. This eliminates errors similar to those in Python.

There are also some association scenarios where SQL and Python are prone to errors. For example: consider the task of associating an orders table with both order details and payment details tables to identify the order_id and order_date for those orders where the cumulative payment amount is less than the total order amount.

If a careful analysis is absent, SQL may write the following code based on the previous code for associating orders with order details:

select o.order_id,o.order_date
from orders o
    join order_details od on o.order_id = od.order_id
    join payment_details pd on o.order_id = pd.order_id
group by o.order_id,o.order_date
having sum(pd.payment_amount) < sum(od.amount);

However, this calculation yields incorrect results because there’s a many-to-many association between payment details and order details. For example, if an order has two order details records and one payment detail record, the JOIN result will produce two records for that order, causing the payment amount to appear twice, and the aggregation result will of course be wrong.

To obtain correct result, SQL requires subqueries to first group the data and then perform the association:

select o.order_id,o.order_date
from orders o
join (
    select order_id, sum(amount) as o_amount
    from order_details
    group by order_id
) od on o.order_id = od.order_id
join (
    select order_id, sum(payment_amount) as p_amount
    from payment_details
    group by order_id
) pd on o.order_id = pd.order_id
where pd.p_amount < od.o_amount;

Python might also incorrectly write code that first associates the three tables and then groups:

joined_orders_details = orders.merge(order_details, on='order_id')
joined_all = joined_orders_details.merge(payment_details, on='order_id')
grouped = joined_all.groupby('order_id').agg({
      'order_date': 'first',
        'amount': 'sum',
        'payment_amount': 'sum'
    })
result = grouped[grouped['payment_amount'] < grouped['amount']][['order_id', 'order_date']]

Here, there is also a many-to-many association between joined_orders_details and payment_details, and the result is also wrong.

If the code is modified to group and aggregate before associating, the calculation result would be correct, but problems would arise when reusing the association result. Since order details and payment details had been aggregated in the association result, detailed data would no longer be accessible, leading to an inability to perform many further calculations.

In SPL’s primary key association operation results, both the payment details set and the order details set are represented as fields, thus avoiding many-to-many association errors:

od_group=order_details.group(order_id)
pd_group=payment_details.group(order_id)
orders_all=join(orders:o,order_id;od_group:od,order_id;pd_group:pd,order_id)
orders_all.select(pd.sum(payment_amount)< od.sum(amount))

Furthermore, subsequent calculations can be performed without any problems when reusing the association result.

In summary, SQL defines association operations as a Cartesian product followed by filtering, which appears simple but fails to reflect the essence of association operations. This approach makes complex scenarios difficult to understand, is error-prone, and the association results are not reusable. Python’s association operations basically follow the SQL approach. While a slight advantage is the reusability of the association result, this result is a wide table, which is inflexible and makes further calculations cumbersome and error-prone. SPL abandons the Cartesian product approach and redefines association operations in a way that truly reflects their essence. This offers a refreshing perspective. SPL’s association results are reusable, nested tables, providing exceptional flexibility, enabling concise code to perform complex calculations while also reducing the error rate.