"This type of tasks refers to set operations, including intersection, union, difference, concaten .."

jinxing RaqForum 41 No.
9 View • 3 Months ago

A Field Guide to Querying and Report Computations with SPL - 9 Post-grouping set operations

This type of tasks refers to set operations, including intersection, union, difference, concatenation, etc., performed after a grouping operation. SQL does not support explicit sets, making the language only fit for handling set operations in simple contexts. For those with complex contexts, particularly intra- or inter-group set operations, the code becomes difficult to write.

Example 1: Find salespeople who rank top3 every month

Source data: The orders table (9/Orders_TopnByMonth.txt) storing sales amount per salesperson per month in the year 2022.

Target: Group records by month, find the set of top3 salespeople in terms of sales amount in each group, and compute intersection of all such sets.

Picture1png

SPL code:

	A
1	$select month(OrderDate) as Month,SellerId,sum(Amount) as Amount from 9/Orders_TopnByMonth.txt where year(OrderDate)=2022 group by month(OrderDate),SellerId
2	=A1.group(Month)
3	=A2.(~.top(-3; Amount).(SellerId))
4	=A3.isect()

A1: Load data.

A2: Group records by month; each group consists of a set. Below shows the sets of the first two months:

Picture2png
A3: =A2.(~.top(-3; Amount)…)processes each of A2’s group. It first uses top() function to get records， where sales amounts rank top 3 for each group; ~ represents the current group. Here are sets for the first two months:

Picture3png
…(SellerId) then retrieves SelleriId field values, that is, a set of salespeople, from each group of records. Below shows the sets of salespeople in the first two months:

Picture4png
A4: Compute intersection of members of A3. The isect()function computes intersection of subsets in a set. Similar functions are union(), diff(), and conj().

Picture5png

Example 2: Copy a table’s order total record to the end of each year’s subtotal records

Source data: An orders table-based, two-dimensional table made up of each year’s subtotal per month and total, where the year field and month field of the total record is null and where the record count is not fixed.

Target: Generate data for report creation: keep the order of subtotal records, copy the total record to the end of each year’s subtotal records, and change the total record’s year to the corresponding subtotal record’s year.

Picture6png

SPL code:

	A
1	$ with detail as (select year(OrderDate) year, month(OrderDate) month,sum(Amount) amt, count(1) cnt from Orders.txt group by year(OrderDate) , month(OrderDate) ) select * from {create(year,month,amt,cnt).record([null,null,"total amt","total cnt"])} union select null as year, null as month,sum(amt) amt, sum(cnt) cnt from detail union select * from detail
2	=A1.select(!year)
3	=(A1\A2).group@u(year)
4	=A3.conj(~\|A2.new(A3.year,month,amt,cnt))

A1: Load data.

A2: Perform filtering to get a set of total records. The filtering condition !year in select() function is equivalent to year==null.

Picture7png
A3: Use the diff operation to get details records, and then group them by year while maintaining their order. The group() function works with @u option to retain the original order of members when performing grouping. Below shows details of the first group of records:

Picture8png
A1\A2 computes difference between two sets of records, A1 and A2. Similar operators are ^ (intersection), union (&), and concatenation (|). The diff()function can compute difference, too, so A3 can be written as =[A1,A2].diff().group@u(year).
 Learning point: Set operators and set functions
Both operators and function can be used to perform set operations. But, the computing objects of a set operator can only be two sets while a set function has more and a wider range of computing objects, which can be two or more sets or larges sets whose members are also sets.

A4: =A3.(~|A2.new(A3.year,month,amt,cnt)) processes data in each group (year): first generate a new table sequence based on A2’s set of total records and then concatenate with the current group. Note that the new table sequence’s year field will be changed to the year field of the current group. Operator | computes concatenation and can be replaced by conj() function. Here are details of the first group:

Picture9png
A3.conj(…) concatenates members of all groups. The conj() function concatenates subsets of A3’s set by combining members of subsets in order.

Picture10png

Example 3: Compute the current inventory based on the batch inbound and outbound stock tables

Source data: 9/Stock_BatchIn.txt, the batch inbound stock table, or the original stock table, stores multi-batch inbound information of various materials. Each batch records the start and end numbers of a continuous interval for the material. For example, the third record indicates that the chair’s inbound interval is [801:1200], totaling 399. 9/Stock_BatchOut.txt, the batch outbound stock table, or consumption table, stores multi-batch outbound information of various materials. Its field descriptions are similar to those in the batch inbound stock table. Each batch records the start and end numbers of a continuous interval for the outbound material. For example, the third record indicates that the outbound interval of chairs (coming from the original stock interval [801:1200]) is [850:1100].

Target: Compute the current inventory based on the batch inbound and outbound stock tables, and represent it by the start and end numbers of continuous intervals for material in the current inventory. Note that the intervals in the inbound table may be consumed into multiple discontinuous segments. In such cases, multiple records are automatically generated and each corresponds to a continuous interval. For example, after the outbound operation of [850:1100] from the original stock interval [801:1200], it becomes two continuous intervals: [801:849] and [1101:1200] in the current inventory.

Picture11png

SPL code:

	A	B
1	$select * from 9/Stock_BatchIn.txt	$select * from 9/Stock_BatchOut.txt
2	=A1.group(Supplie; ~.conj(to(StartNum,EndNum)):ini)	=B1.group(Supplie; ~.conj(to(StartNum,EndNum)):used)
3	=A2.join(Supplie,B2,used)
4	=A3.derive([ini, used].merge@d().group@i(~!=~[-1]+1):current)
5	=A4.news(current; Supplie, ~1:StartNum, ~.m(-1):EndNum)

A1, A2: Load data.

A2: The group()function categorizes records of the original stock) by material without aggregation. Each interval is transformed to a small set of continuous sequences, and those small sets are then concatenated into an ordered, discontinuous, large set. ~ represents the current group. The to()function generates a continuous sequence according to the specified start and end numbers. The conj() function performs concatenation on subsets of a set.

Picture12png
B2: Perform same operations on the outbound stock table.

Picture13png
A3: Use join() function to perform a left join between the inbound stock table and the outbound stock table according to the material.

Picture14png
A4: =A3.derive([ini, used].merge@d()…) adds a new computed column. It finds the difference between each type of material’s inbound records and its outbound records. The merge()function merges members of an ordered set and returns an ordered result. @d option enables performing the difference operation during the merge; it is an optimized function of the difference operation. Note that the sequence returned by the difference operation is ordered but discontinuous. For example, the continuous original stock sequence […238,239,240,241…] becomes […238,239,500,501…] after the outbound operation.

Picture15png
..group@i(_!=[-1]+1):current performs conditional grouping on each difference and put continuous sequences to the same group. For example, […238,239] and [500,501..] are put into the first group and the second group respectively. The group()function performs grouping operation; it, by default, performs an equi-grouping. @i option enables grouping by the specified condition; and ~[-1] represents the preceding member.

Picture16png
A5: Generate a new record using every sequence of the “current” field in A4’s each record. The start number and the end number of the new interval are obtained from the start and end of each sequence. The news()function expands every member of a set into a record. ~.m(-1) denotes the last member of ~; and ~.m(1), abbreviated as ~1, is the first member of ~.