"It is convenient to access files with SPL. The language can also deal with various file formats. .."

lisongbo RaqForum 82 No.
0 View • 1 Days ago

SPL Lightweight Multisource Mixed Computation Practices #2: Querying CSV/XLS and other files

It is convenient to access files with SPL. The language can also deal with various file formats.

Computing use case

Computing goal

Find the total amount under each type of order status in the year 2024 based on orders1.csv (with title row).

SPL syntax

Write the script

	A
1	=T("orders1.csv")
2	=A1.select(year(order_date)==2024)
3	=A2.groups(order_status;sum(total_amount))

A1: Retrieve data from the csv file. T() function retrieves content of the file according to the extension and returns a table sequence loaded wholly into the memory.

A2: Perform filtering on data of the year 2024.

A3: Perform grouping & aggregation by order status.

Some csv files do not have the title row. How should we handle them?

For example, to achieve the preceding computing goal based on the title row-absent orders2.csv, the SPL script is as follows:

	A
1	=T@b("orders2.csv")
2	=A1.select(year(_3)==2024)
3	=A2.groups(_10:status;sum(_8):amt)

A1: @b option means the file to be handle is without a title row. In this case, field names will be represented by _1 _2 _3… after data is retrieved.

A2: Perform filtering on data of the year 2024 by column _3 (the 3^rd column).

The subsequent code is similar.

SPL also supports using a column number for retrieving data from this column:

	A
1	=T@b("orders2.csv")
2	=A1.select(year(#3)==2024)
3	=A2.groups(#10:status;sum(#8):amt)

Here column number (represented by #), instead of column name, is used for computation.

SQL syntax

Though native SPL syntax is capable of handling various computing scenarios, sometimes those who are accustomed to working with SQL still want to query files with the familiar language. SPL understands this customer need and provides SQL syntax.

The preceding computing goal, for example, can be achieved using SQL.

With the title row:

$select order_status,sum(total_amount) tamt from orders1.csv where year(order_date)=2024 group by order_status

The SQL statement can be executed directly based on the csv file, or any of the various other data sources, such as XLS, MongoDB, Restful and JSON.

Handling a csv file without title row is a little bit complicated. First, we need to use SPL to retrieve data from the file and then perform the query in SQL:

$select _10 order_status,sum(_8) tamt from {T@b("orders2.csv")} where year(_3)=2024 group by _10

The expression enclosed by braces ({T@b(“orders2.csv”)}) is SPL syntax. As field names in the result it returns are in the form of _1 ,_2…, SQL also uses field names in such form.

At present SPL supports SQL92 syntax, which covers WITH clause, but it does not support window functions. The support suffices because native SPL syntax is capable of handling various complex computing scenarios.

Here are some SQL query examples.

Date handling:

$select * from d:/Orders.csv where (OrderDate<date('2020-01-01') and Amount<=100)or (OrderDate>=date('2020-12-31') and Amount>100)

Case when：

$select case year(OrderDate) when 2010 then 'this year' when 2010 then 'last year' else 'previous years' end from d:/Orders.csv

Group by …having：

$select Client,year(OrderDate) y,sum(Amount) amt from d:/Orders.csv

group by Client,year(OrderDate) having sum(Amount)>1000

JOIN：

$select o.OrderId,o.Client,e.Name e.Dept,e.EId from d:/Orders.csv o

left join d:/Emp.csv e on o.SellerId=e.Eid

Handling Excel files

A similar method is used to handle Excel files. Look at an example. There is Excel file orders.xls. We want to retrieve the 2^nd sheet (with titles) and the 3^rd sheet (without titles) and find from each VIP customers whose order amounts are above 500.

We write all the code in a same script:


1	// Excel file with titles, SPL syntax
2	=T("orders.xls";"orders1")
3	=A2.groups(customer_id;sum(total_amount):amount)
4	=A3.select(amount>500)
5	//SQL syntax
6	$select customer_id,sum(total_amount) amount from {T("orders.xls";"orders1")} group by customer_id having sum(total_amount)>500
7
8	// Excel file without titles, SPL syntax
9	=T@b("orders.xls";"orders2")
10	=A9.groups(_2:customer_id;sum(_8):amount)
11	=A10.select(amount>500)
12	//SQL syntax
13	$select _2 customer_id,sum(_8) amount from {T@b("orders.xls";"orders2")} group by _2 having sum(8)>500

If you are familiar with SQL, use the language to handle simple computing problems and use native SPL syntax to deal with complicated ones. You can also use the two together to solve one computing problem. With SQL plus SPL, there are no computing problems that you cannot manage.

SPL Official Website 👉 https://www.esproc.com

SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL

SPL Learning Material 👉 https://c.esproc.com

SPL Source Code and Package 👉 https://github.com/SPLWare/esProc

Discord 👉 https://discord.gg/sxd59A8F2W

Youtube 👉 https://www.youtube.com/@esProc_SPL

esProc

lisongbo • 0 View • 1 Days ago

SPL Lightweight Multisource Mixed Computation Practices #2: Querying CSV/XLS and other files

Computing use case

SPL syntax

SQL syntax

Handling Excel files

ToC