SPL Lightweight Multisource Mixed Computation Practices #2: Querying CSV/XLS and other files
It is convenient to access files with SPL. The language can also deal with various file formats.
Computing use case
Computing goal
Find the total amount under each type of order status in the year 2024 based on orders1.csv (with title row).
SPL syntax
Write the script
A |
|
1 |
=T("orders1.csv") |
2 |
=A1.select(year(order_date)==2024) |
3 |
=A2.groups(order_status;sum(total_amount)) |
A1: Retrieve data from the csv file. T() function retrieves content of the file according to the extension and returns a table sequence loaded wholly into the memory.
A2: Perform filtering on data of the year 2024.
A3: Perform grouping & aggregation by order status.
Some csv files do not have the title row. How should we handle them?
For example, to achieve the preceding computing goal based on the title row-absent orders2.csv, the SPL script is as follows:
A |
|
1 |
=T@b("orders2.csv") |
2 |
=A1.select(year(_3)==2024) |
3 |
=A2.groups(_10:status;sum(_8):amt) |
A1: @b option means the file to be handle is without a title row. In this case, field names will be represented by _1 _2 _3… after data is retrieved.
A2: Perform filtering on data of the year 2024 by column _3 (the 3rd column).
The subsequent code is similar.
SPL also supports using a column number for retrieving data from this column:
A |
|
1 |
=T@b("orders2.csv") |
2 |
=A1.select(year(#3)==2024) |
3 |
=A2.groups(#10:status;sum(#8):amt) |
Here column number (represented by #), instead of column name, is used for computation.
SQL syntax
Though native SPL syntax is capable of handling various computing scenarios, sometimes those who are accustomed to working with SQL still want to query files with the familiar language. SPL understands this customer need and provides SQL syntax.
The preceding computing goal, for example, can be achieved using SQL.
With the title row:
$select order_status,sum(total_amount) tamt from orders1.csv where year(order_date)=2024 group by order_status
The SQL statement can be executed directly based on the csv file, or any of the various other data sources, such as XLS, MongoDB, Restful and JSON.
Handling a csv file without title row is a little bit complicated. First, we need to use SPL to retrieve data from the file and then perform the query in SQL:
$select _10 order_status,sum(_8) tamt from {T@b("orders2.csv")} where year(_3)=2024 group by _10
The expression enclosed by braces ({T@b(“orders2.csv”)}) is SPL syntax. As field names in the result it returns are in the form of _1 ,_2…, SQL also uses field names in such form.
At present SPL supports SQL92 syntax, which covers WITH clause, but it does not support window functions. The support suffices because native SPL syntax is capable of handling various complex computing scenarios.
Here are some SQL query examples.
Date handling:
$select * from d:/Orders.csv where (OrderDate<date('2020-01-01') and Amount<=100)or (OrderDate>=date('2020-12-31') and Amount>100)
Case when:
$select case year(OrderDate) when 2010 then 'this year' when 2010 then 'last year' else 'previous years' end from d:/Orders.csv
Group by …having:
$select Client,year(OrderDate) y,sum(Amount) amt from d:/Orders.csv
group by Client,year(OrderDate) having sum(Amount)>1000
JOIN:
$select o.OrderId,o.Client,e.Name e.Dept,e.EId from d:/Orders.csv o
left join d:/Emp.csv e on o.SellerId=e.Eid
Handling Excel files
A similar method is used to handle Excel files. Look at an example. There is Excel file orders.xls. We want to retrieve the 2nd sheet (with titles) and the 3rd sheet (without titles) and find from each VIP customers whose order amounts are above 500.
We write all the code in a same script:
1 |
// Excel file with titles, SPL syntax |
2 |
=T("orders.xls";"orders1") |
3 |
=A2.groups(customer_id;sum(total_amount):amount) |
4 |
=A3.select(amount>500) |
5 |
//SQL syntax |
6 |
$select customer_id,sum(total_amount) amount from {T("orders.xls";"orders1")} group by customer_id having sum(total_amount)>500 |
7 |
|
8 |
// Excel file without titles, SPL syntax |
9 |
=T@b("orders.xls";"orders2") |
10 |
=A9.groups(_2:customer_id;sum(_8):amount) |
11 |
=A10.select(amount>500) |
12 |
//SQL syntax |
13 |
$select _2 customer_id,sum(_8) amount from {T@b("orders.xls";"orders2")} group by _2 having sum(8)>500 |
If you are familiar with SQL, use the language to handle simple computing problems and use native SPL syntax to deal with complicated ones. You can also use the two together to solve one computing problem. With SQL plus SPL, there are no computing problems that you cannot manage.
SPL Official Website 👉 https://www.esproc.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.esproc.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/sxd59A8F2W
Youtube 👉 https://www.youtube.com/@esProc_SPL