SPL: The Professional Choice for Processing Multi-Layered JSON Data
Business data often exhibits hierarchical relationships. For example, an order contains date, customer, and order details, with order details further containing price, quantity and product, and product encompassing name and category.
JSON’s nestable structure is well-suited for describing such relationships. For example, here’s the order data:
[ { "order_id": 10248,"order_date": "2022-07-04","freight_rate": 2,"weight": 5,
"order_details": [
{ "price": 14.0,"quantity": 12.0,
"product": {"product_id": 17,"name": "beef","category":"Food"},
},
{ "price": 9.0,"quantity": 10.0
"product": {"product_id": 42,"name": "rice","category":"Food"}
}
],
"customer": {
"customer_id": 87,"name": "abc inc",
"city": "New York","phone_number": "26471510"
}
},
{ "order_id": 10249,"order_date": "2012-07-05","freight_rate": 10,"weight": 3,
"order_details": [
{ "price": 18.0,"quantity": 9.0
"product": {"product_id": 14,"name": "Computer","category": "Electronic"},
}
],
"customer": {
"customer_id": 991,"name": "bcd.com",
"city": "Los Angeles","phone_number": "(0251) 1031259"
}
}
...]
An order has both regular attributes “order ID, date, freight rate, weight” and nested attributes “order details, customer”. Order details, in turn, contain the nested attribute “product.”
For this nestable structure, the common data organization method is a multi-layer nested table, where fields can be tables or records. When calculating, tables or records at any layer are treated as objects, enabling consistent operations.
However, SQL inherently supports only flat tables and cannot implement this nested structure. Some new analytical databases partially support nested structures, including JSON data processing.
For example, to retrieve order_id, order_date, and the name and city of the nested attribute customer from the order data, the DuckDB code would be:
select order_id, order_date, customer.name, customer.city
from read_json_auto('orders.json')
It treats the single-row nested field customer as a record, extracting its fields using record-style access. However, this method only works for single-row nested fields and doesn’t support multi-row order_details—attempting this would simply return null.
For multi-row nested fields, DuckDB can perform calculations after expanding them. For example, to calculate the order amount by aggregating quantity*price from order_details based on the previous query, the code would be:
select order_id,any_value(order_date),
any_value(customer.name),any_value(customer.city),
sum(od.quantity * od.price) as amount
from read_json_auto('orders.json') as o,
lateral unnest(o.order_details) as t(od)
group by order_id;
First, expand order_details, then implicitly join with orders to form a large flat table, and finally group by order ID and aggregate, resulting in a convoluted code. This approach is less efficient than converting to a physical flat table, as that would eliminate the ‘expanding’ step during calculations.
DuckDB can also retain nested structure and use lambda syntax to calculate:
select order_id, order_date,customer.name, customer.city,
list_sum( list_transform(o.order_details,x -> x.price * x.quantity) ) as amount
from read_json_auto('orders.json') as o
It needs to use two special functions and write explicit lambda syntax, which is a bit cumbersome.
In contrast, performing similar calculations on ordinary data tables is simpler. For example, the code to aggregate the product of freight_rate*weight is:
select sum(freight_rate*weight) from read_json_auto('orders.json')
The syntax is completely inconsistent.
In fact, DuckDB treats multi-row nested fields as arrays rather than data tables. This causes inconsistency in operations and makes the code difficult to write and read.
esProc SPL completely implements the multi-layer nested table mechanism mentioned earlier, enabling the most concise code:
orders=json(file("orders.json").read())
orders.new(order_id,order_date,customer.name,customer.city,order_details.sum(price*quantity):amount)
Compare this to the code for calculating total freight rate:
orders.sum(freight_rate*weight)
The syntax is completely consistent!
The order_details field, like orders, is a table object, and the calculation methods are consistent. With such concise and consistent syntax, SPL can be considered the most professional tool for handling nested data structures.
Python also supports multi-layered structures. For single-row nested fields, calculations can be done as follows:
orders = pd.read_json('orders.json')
result_orders = orders[['order_id', 'order_date']].assign(
customer_name=orders['customer'].str['name'],
customer_city=orders['customer'].str['city']
)
Accessing customer fields requires an additional str function, resulting in less concise syntax compared to DuckDB and SPL. Additionally, the str function only supports simple types that can be converted to strings and cannot handle multi-row nested fields.
Python can also expand multi-row nested fields for calculation:
orders = pd.read_json('orders.json')
exploded_orders = orders.explode('order_details')
exploded_orders = pd.concat([exploded_orders,exploded_orders['order_details'].apply(pd.Series)], axis=1)
exploded_orders['amount'] = exploded_orders['price'] * exploded_orders['quantity']
result = exploded_orders.groupby('order_id').agg({'order_date':'first','amount': 'sum'}).reset_index()
final_result = result[['order_id', 'order_date', 'amount']]
First, vertically expand order_details, then horizontally expand the fields of each row to form a flat table, and finally perform grouping and aggregation. This approach is also quite convoluted. Handling multi-row nested structures requires two expansion steps, making it more cumbersome than DuckDB.
Python can also retain nested structure and use lambda syntax to calculate:
orders = pd.read_json('orders.json')
orders['amount'] = orders['order_details'].apply(lambda details:sum(item['price'] * item['quantity'] for item in details))
result=orders[['order_id','order_date','amount']]
Python employs explicit lambda syntax, which requires writing lambda keywords and defining parameters, making it somewhat cumbersome. Moreover, the lambda function necessitates an explicit for loop, an equivalent of a double nested loop, increasing the difficulty of understanding. Although the for loop here is a simplified notation, it still requires defining the loop variable item, which is still cumbersome.
Let’s compare the code for calculating freight rate:
total_freight=sum(orders['freight_rate']*orders['weight'])
The syntax is also completely inconsistent.
order_details is not a DataFrame object like orders, but rather an array-like list.
In slightly more complex scenarios, the problems caused by the syntax inconsistencies in DuckDB and Python will be more obvious. For example, find the orders with a total amount greater than 200 and containing Electronics products, then retrieve the order_id and order_date.
If DuckDB adopts the approach of expanding nested fields, the SQL becomes cumbersome. You must first expand order_details and product, then group and aggregate to calculate the order amount, and finally filter the qualifying orders based on the result. This requires multiple layers of subqueries or CTEs to keep the integrity of data structure. The resulting SQL will be very verbose and hard to debug. Here, we omit the SQL and only provide the lambda code:
select order_id, order_date
from read_json_auto('orders.json')
where
list_sum(list_transform(order_details,x -> x.price * x.quantity)) > 200 and
array_length(list_filter(order_details, x -> x.product.category = 'Electronics')) > 0;
This code uses three special functions and two explicit lambda functions. Compared to regular SQL, this code is complex and difficult to understand.
The Python code using the expanded approach is also verbose. We only show the lambda code:
orders = pd.read_json('orders.json')
final_orders = orders[
orders['order_details'].apply(
lambda x: any(item['product']['category'] == 'Electronics' for item in x) and
sum(item['price'] * item['quantity'] for item in x) >= 200
)]
result = final_orders[['order_id', 'order_date']]
This code uses a large apply loop with explicit lambda syntax, containing two nested for loops and requiring explicit loop variables, making the code hard to comprehend.
SPL, by contrast, uses basic operation functions without requiring explicit lambda syntax or explicit loops, resulting in the most concise and easily understandable code:
orders=json(file("orders.json").read())
orders.select(order_details.select@1(product.category=="Electronics") && order_details.sum(price*quantity)>200).new(order_id,order_date)
In summary, SQL inherently supports only flat tables and cannot implement nested structures. DuckDB adds support for nested structures, but expanding them before calculation is less convenient than using flat tables directly. When calculating directly on nested structures, operations at different layer are inconsistent, making it cumbersome to read and write the code. Moreover, multi-row nested structure necessitates special functions and explicit lambda syntax, leading to verbose code. Python supports nested structures, but it exhibits similar issues as DuckDB. In SPL, table fields can themselves be tables. This consistent data organization, from top to bottom in the nested structure, ensures that everything is an object with uniform referencing methods. As a result, the calculation code is concise and easy to understand, making SPL the most professional language for multi-layer nested structure calculations.
SPL Official Website 👉 https://www.esproc.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.esproc.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/2bkGwqTj
Youtube 👉 https://www.youtube.com/@esProc_SPL
Chinese version