Essential Adjacent References

 

Data analysis often involves cross-row calculations, such as Link Relative Ratio (LRR), Year-over-Year (YoY) and moving averages. Implementing cross-row calculations on ordered datasets will involve the issue of referencing adjacent members within a set.

For example, a merchant has prepared its sales data for 12 months of a year, arranged by month, and now wants to calculate the maximum monthly growth.

Code in SQL:

with sales as (
    select column_value as amount, rownum as month
    from table(sys.odcinumberlist(123,345,321,345,546,542,874,234,543,983,434,897))
),
lagged_sales as (
    select amount, month, 
        lag(amount) over (order by month) as prev
    from sales
)
select max(amount - prev) as max
from lagged_sales;

Window functions can reference the previous month’s sales, but it requires writing a subquery, making the code verbose. Moreover, SQL is based on unordered sets, resulting in its inability to utilize the original order of the data, and each window function requires a separate order by clause.

In fact, data inherently has an order, and utilizing this order can conveniently express calculation needs. For example, if the data is ordered by month, simply subtracting the previous member from the current member gives the monthly growth. If there were a syntax for referencing adjacent members, solving similar problems would be much easier.

esProc SPL exactly provides a mechanism for referencing adjacent members within a set:

sales = [123, 345, 321, 345, 546, 542, 874, 234, 543, 983, 434, 897]
sales.(if(#>1,~-~[-1],0)).max()

~ represents the current member, # is the current member’s index, and ~[i] represents the member with a distance of i from the current member. Here, ~[-1] represents the previous member, i.e., the previous month’s sales.

Actually, if(#>1,~-~[-1],0) is a lambda function, and the outer sales.() is a loop function. SPL consistently uses symbols like ~, #, and [] to represent the parameters of the lambda function.

SPL’s lambda function appears like an ordinary expression, with a more concise syntax than traditional lambda functions, while providing richer passed-in parameters.

If Python’s Series or DataFrame objects use lambda directly, only the current row can be passed as parameter, as Python lacks syntax for index-based or adjacent-member references. To solve this, it first needs to create a sliding window object using the rolling function.

result = sales.rolling(window=2).apply(lambda x: x[1] - x[0], raw=True).max()

The rolling function creates a window of size 2, and the apply function operates on two adjacent members in the sales sequence, which uses a lambda to calculate their difference.

However, Python adds an extra step of creating a window object, making it slightly cumbersome. Moreover, Python’s lambda functions are explicitly defined, requiring the lambda keyword and the definition of passed-in parameters, which is less concise than SPL.

Alternatively, Python can take a different approach:

sales=pd.Series([123, 345, 321, 345, 546, 542, 874, 234, 543, 983, 434, 897])
result = (sales-sales.shift(1)).max()

The shift function generates a new sequence by shifting the original sequence one position backward, and then calculates the differences with the original sequence through an aligned set operation. This requires creating an additional sequence object. In comparison, SPL’s code, which references adjacent members directly within the original sequence, is simpler.

Python also provides the diff function, which can calculate the difference between adjacent members:

pd.Series([123, 345, 321, 345, 546, 542, 874, 234, 543, 983, 434, 897])
result = sales.diff().max()

The diff function supports the periods parameter, which allows calculating the difference between the current member and the member at a distance of periods. However, diff and similar functions like pct_change are all predefined calculations. For calculations without a ready-made function, it has to continue using rolling or shift.

In contrast, SPL avoids these issues. ~[i] can be used freely in expressions, allowing for various adjacent-member calculations.

In addition to adjacent members, we often need to reference adjacent sets. For example, using the same data, we want to calculate the moving average of sales for each month and the two preceding months.

Code in SQL:

with sales as (
    select column_value as amount, rownum as month
    from table(sys.odcinumberlist(123,345,321,345,546,542,874,234,543,983,434,897))
)
select month,amount,
    avg(amount) over (order by month rows between 2 preceding and current row) as moving_average
from sales;

SQL has the concept of adjacent sets. For example, the three adjacent sales data form a set. However, SQL cannot retain this set, and the data in the set must be aggregated immediately. Therefore, it can only use built-in aggregate functions like avg, sum, and count. If the situation becomes more complex, SQL cannot leverage adjacent sets for calculations.

For example, to check if monthly sales are increasing over the current and two preceding months, a simple approach is to form an ordered set of the three months’ sales and then check if it is an increasing sequence.

SQL lacks an aggregate function to check for increasing sequences, so it must be written like this:

with sales as (
    select column_value as amount, rownum as month
    from table(sys.odcinumberlist(123,345,321,345,546,542,874,234,543,983,434,897))
)
select month,amount,
    amount > lag(amount, 1) over (order by month)  and 
    lag(amount, 1) over (order by month) > lag(amount, 2) over (order by month)   
    as flag
from sales;

As a result, it requires three lag functions and three order by month clauses, making it cumbersome.

SPL extends ~[i] to ~[a,b] to represent adjacent sets. Whether calculating moving averages or checking for increasing sequences, it’s easy to express:

sales = [123, 345, 321, 345, 546, 542, 874, 234, 543, 983, 434, 897]
sales.(~[-2,0].avg())
sales.(~[-2,0].pselect(~<=~[-1])==null)

~[-2,0] represents the adjacent set from the member two positions before the current one to the current member.

To calculate the moving average, apply the avg function to the adjacent set.

To check for an increasing sequence, use the position calculation function pselect to search the adjacent set. If no member is less than or equal to the previous member, the sequence is increasing. In the pselect parameter, ~ and ~[-1] denote the current and previous members in the adjacent set, respectively.

Python also has the concept of adjacent sets, and similar to SQL, simple aggregations such as moving averages can be easily implemented:

sales = pd.Series([123, 345, 321, 345, 546, 542, 874, 234, 543, 983, 434, 897])
result = sales.rolling(window=3, min_periods=1).mean()

However, checking for increasing sequences is much more complex, requiring the use of lambda functions again:

result = sales.rolling(window=3).apply(lambda x: (x[0] < x[1] and x[1] < x[2]), raw=True)

The rolling sliding window has a fixed size, making it very inflexible. For example, to define a window from the first member to the current member, another function, expanding, must be used. To define a window from the current member to the last member, the sequence must be reversed.

SPL, in contrast, uses a unified syntax. ~[,0] represents the set from the first member to the current member, and ~[0,] represents the set from the current member to the last member.

For common structured calculations, SPL simplifies the adjacent reference syntax, allowing direct use of field names. For example, the sales table has fields month, amount, and rate, the code to calculate the monthly amount*rate growth is:

sales.derive(amount*rate-amount[-1]*rate[-1]:increase)

[-1].amount is directly written as amount[-1], making the code very concise.

Python does not simplify this for data tables. The table name appears repeatedly:

sales['increase']=(sales['amount']*sales['rate']).rolling(window=2).apply(lambda x:x[1] - x[0], raw=True)

SPL’s adjacent reference mechanism applies to all sets, including, of course, grouped subsets. For example, the sales table stores each store’s sales data for 12 months of a year, sorted by month. The code to calculate each store’s monthly growth is:

=sales.group(store).(~.derive(amount-amount[-1]:increase))

The group function groups data by store and retains the grouped subsets. ~ refers to the current subset. The adjacent reference syntax for subsets is the same as that for the whole set, and the calculation processes are also identical.

SQL does not support grouped subsets; it needs to add partition by in the window function, which increases complexity:

with monthly_growth as (
    select *,
        ((amount - lag(amount, 1) over (partition by store order by month)) / lag(amount, 1) over (partition by store order by month))  as increase
    from
        sales
)
select *
from monthly_growth
order by store, month;

Python also the concept of grouped subsets. This task can be achieved by combining rolling function with groupby function.

sales['increase'] = sales.groupby('store')['amount'].rolling(window=2).apply(lambda x: x[1] - x[0], raw=True)

The syntax remains largely consistent with the non-grouped case, but it still requires writing a cumbersome lambda function.

In summary, SQL is based on unordered sets, often requiring subqueries and window functions to implement calculations between adjacent members, leading to verbose code. Python supports ordered sets, giving it a much greater advantage in implementing adjacent member references, but inconsistency issues in syntax necessitate selecting different functions depending on the situation, and may sometimes require reversing the set. SPL, with its adjacent reference mechanism based on ordered sets, offers concise and consistent syntax for adjacent calculations, making it the easiest to learn and understand.