A Field Guide to Querying and Report Computations with SPL - Basic regular operations

 

Basic regular operations

These tasks refer to basic operations such as queries, sorting, joins, grouping & aggregation, which can help you understand the basics of SPL. They are designed for beginners to get started quickly with SPL, while users already familiar with SPL may skip this section.

Example 1: List all unique departments

Data source: Employee table

Target: Remove duplicate Dept field values and retain the unique ones only.

Picture1png

SPL code:


A

1

$select * from Employee.txt

2

=A1.id(Dept)

A1: Load data: use simple SQL to query Employee.txt and generate a SPL table sequence.

Picture3png

l Learning point: Table sequence

A table sequence is the fundamental data structure in SPL, similar to an SQL result set, as both are sets of multiple records. Their main difference is that an SQL result set is unordered, with no inherent sequence numbers for the records, whereas a table sequence is inherently ordered, with natural sequence numbers for its records. When no confusion arises, a table sequence is often simply referred to as a “table”. For example, calling A1 “table sequence Employee” or “table Employee” means the same thing.

In the above screenshot, A1’s Index column is the sequence number column, represented by#in the code.A1.(#)is used to retrieve the sequence number column (a set of sequence numbers): [1,2,3,….]

A1.(Name) gets Name column: ["Ashley","Rachel","Emily",…]

Table sequences make it easy to perform computations related to sequence numbers. For simple tasks like retrieving a record by its sequence number, SPL uses the syntax(N), similar to therownum=Nsyntax in SQL. For example, A1(2)is used in SPL to retrieve the second Employee record.

Picture4png
Another example is to get the first N records, which is written as to(N) in SPL, similar to the top()function in SQL. For example, A1.to(3) is used in SPL to get the first 3 records.

Picture5png
A2: Use id() function to perform distinct operation on the Dept field.

Picture6png

l Learning point: distinct operation

SPL providesid()function to remove duplicate field values, similar to theDISTINCTsyntax in SQL. Just as SQL’sGROUP BYcan achieve deduplication, SPL’s grouping & aggregation functiongroups()can also remove duplicates. Therefore, A2 can also be written asA1.groups(Dept).

To perform distinct based on two fields, State and Dept, write it as A1.groups(Dept,State).

Picture7png

Example 2: Add a computed column pointing to Employee records to Orders table

Data source: Orders table, Employee table

Target: Add a new field named newField to Orders table. The field’s values are corresponding records of the Employee table.

Picture8png

SPL code:


A

1

$select * from Orders.txt

2

$select * from Employee.txt

3

=A1.derive(A2.select(EId==SellerId):newField)

A1, A2: Load data.

A3: Add a computed column named newField. Traverse each Orders record to get the Employee record matching SellerId, and make it value of the current computed column. select()function performs a filtering operation; derive() function adds a computed column and returns a new table sequence.

Picture9png

Note that field values of A3’s table sequence can be records (references), which is different from a SQL result set.

l Learning point: Field values of a table sequence are generic

Field values of a table sequence are generic, which means they can be either simple types or references to records or other table sequences. This requires pre-establishing associations between tables, thereby reusing the associations or simplifying complex join computations. In contrast, the field values of an SQL result set can only be simple types and cannot reference records or other result sets.

For example, you can use A3’s association for further computations, like counting orders and summing order amounts of every client in each department:

=A3.groups(newField.Dept, Client; sum(Amount), count(1))

Note that table sequence Orders is referenced through variable name A1 and table sequence Employee is referenced through variable name A2. This differs from SQL result sets.

l Learning point: Table sequences are explicit sets

Table sequences are explicit sets that can be directly referenced by variables, simplifying complex multi-step computations. In contrast, SQL result sets are not explicit sets and cannot be directly referenced by variables – they can only be referenced as temporary tables using theFROMclause.

Example 3: Select certain orders and increase their amounts by 10%

Data source: Orders table

Target: Get order records of March 2022 and increase their amounts by 10%, without affecting other orders.

Picture10png

SPL code:


A

1

$select * from Orders.txt

2

=A1.select(string(OrderDate,"yyyy-MM")=="2022-03")

3

=A2.run(Amount=Amount*1.1)

A1: Load data.

Picture11png
A2: Select order data of the specified month. select() function performs a filtering operation.

Picture12png

l Learning point: filtering operation

Theselectfunction is used for conditional filtering, and its basic functionality covers theSQL WHEREclause. The logical operators include&&(AND),||(OR),!(NOT), etc., and the equality symbol in comparison operators is==.

A3: Modify data in A2 to increase Amount value by 10%. The run() function modifies data and returns the modified data. Below is A3’s result set, the modified A2:

Picture13png
Look at A1 and you will find that only order records of the specified month are modified.

Picture14png

Once A2 is modified, corresponding records in A1 are also changed, indicating a certain relationship between A2 and A1. A1 is the table sequence introduced earlier, and A2 is a reference to records in A1, referred to as a record sequence.

l Learning point: Record sequence

General operations on table sequences, such as sorting (sort), filtering (select), grouping (group), and joining (join), typically do not generate new table sequences. Instead, they derive a subset of references to records of the original table sequence, which is referred to as the record sequence. To simplify the description, the record sequence’s references to records are often simply called records, and the record sequence itself is simply referred to as a subset of the table sequence. In contrast, operations that alter the structure of a table sequence or a record sequence, such as creating a new table sequence (create), appending fields or computed columns (derive), and performing grouped aggregation (groups), generally generate new table sequences.

Picture15png

When modifying a record sequence, what is actually being modified is a subset of the table sequence, such as the records in A1 where OrderID = [10, 11, 12]. Correspondingly, the complement of this subset remains unchanged.

Example 4: Compute the total order amount and order count per department per year

Data source: Orders table, Employee table

Target: Join Orders table and Employee table, group the joining result set by year and department, compute the total order amount and order count in each group, and then sort records by year in ascending order and order amount in descending order. Orders without assigned salespeople (departments) should also be included in the summarization.

Picture16png

SPL code:


A

B

1

$select * from Orders.txt

$select * from Employee.txt

2

=join@1(A1:ord,SellerId; B1:emp,EId)


3

=A2.groups(year(ord.OrderDate):y,emp.Dept;sum(ord.Amount):amt, count(1):cnt)

4

=A3.sort(y,amt:-1)

A1: Load data.

A2: Usejoin()function to perform a left join between Orders table and Employee table. The result is a new two-field table sequence, where the field values are references to the order records and employee records. Two of these records are shown below:

Picture17png

l Learning point: Association/Join

The join() function establishes an association between tables, similar to the SQL JOIN clause, which by default performs an inner join. @ is used to specify function options, where the following symbols or numbers indicate different extended functionalities. For example,@1represents a left join.

A3: Group records by year and department, and compute the total order amount and order count in each group.

Picture18png

l Learning point: Grouping & aggregation

SPL offers groups() function to perform grouping & aggregation and returns table sequence as the result. The function is similar to SQL GROUP BY syntax.

A4: Use sort() function to sort records by year in ascending order and by order amount in descending order.

Picture19png

l Learning point: Sorting operation

SPL provides sort()function to perform sorting operation, similar to SQL ORDER BY syntax. It sorts data, by default, in ascending order, and uses “:-1” to sort in descending order. If null values are not considered, sorting by the negative field values is equivalent to sorting in descending order. Therefore, A4 can also be written as=A3.sort(y, -amt).

Example 5: Respectively get records of employees in the specified states, those whose salaries are in the specified range, and those that are the intersection of the two

Data source: Employee table

Picture20png

SPL code:


A

1

$select * from Employee.txt

2

=A1.select(State=="New York" || State=="Texas")

3

=A1.select(Salary>=5000 && Salary<=10000)

4

=A2 ^ A3

A1: Load data.

A2: Get Employee records where State is New York or Texas. When performing the multi-value matching, you can also use thecontain()function in SPL, written as A1.select(["New York","Texas"].contain(State)).

Picture21png
A3: Get Employee records where the salary is within the range 5000-10000. You can also use SPL between function to perform the range matching, written as A1.select(between(Salary ,5000 :10000)).

Picture22png
A4: Compute intersection of A2 and A3.

Picture23png
l Learning point: set operations
SPL provides set operators, including ^ for intersection, & for union, \ for difference, and | for union all. The corresponding functions are isect(), union(), diff(), and conj(). Since SPL table sequences support explicit sets, there is no need to recompute to obtain those sets computed previously. This is different from SQL.