"esProc SPL provides cursor operations, which can operate large CSV files with very simple code. .."

jinxing RaqForum 41 No.
349 View • 5 Months ago

How to operate large CSV files with esProc SPL

esProc SPL provides cursor operations, which can operate large CSV files with very simple code. With slight modifications, it can be converted into parallel computing. It also has a graphical interface, which is much more convenient than Python.
First, go here to download esProc SPL: https://www.esproc.com/download-esproc/
If you don’t want to bother with the source code, the standard edition can be used., download and install it.
Prepare a large CSV file:

Picture1png
Open the esProc IDE, create a new script, write SPL code in cell A1, and read the first 100 entries:

=file("d:/OrdersBig.csv").cursor@tc().fetch(100)

The function cursor represents opening the text file with a cursor, and @ represents the function’s extension options, @t represents the first row being column names, @c represents commas as the separator.
Because it is a large file, loading it all into memory may overflow, so fetch only 100 entries and take a look.
Press ctrl-F9 to execute, and you can see the calculation result data table on the right.

Picture2png

SPL code is written in cells, and after each cell is executed, there will be a value, which can be seen on the right side, which brings great convenience to debugging.

Let's try the calculation. First, count the rows:

	A
1	=file("d:/OrdersBig.csv").cursor@tc()
2	=A1.skip()

The skip function is used to skip N records and return the number of skipped records. When the parameter is empty, all records are skipped.

Picture3png

There are a total of 101730411 records.

Then take a look at the filtering, select records with Amount between 3000 and 4000 and Client containing s:

	A
1	=file("d:/OrdersBig.csv").cursor@tc()
2	=A1.select(Amount>3000 && Amount<=4000 && like@c(Client,"s"))
3	=A1.fetch(100)

The select function is used for conditional filtering, the like function is used for string matching, and * is a wildcard, @c means not case sensitive.

Because there may still be many results, we only take the first 100 entries to see, and the execution result is in A3:

Picture4png

Sorting can also be done, such as sorting in the order of OrderDate and in the reverse order of Amount:

	A
1	=file("d:/OrdersBig.csv").cursor@tc()
2	=A1.sortx(OrderDate,-Amount)
3	=file("d:/result.csv").export@tc(A2)
4	=file("d:/result.csv").cursor@tc().fetch(100)

After sorting, write the results into a new file, then open the new file and retrieve the first 100 entries. The function sortx is used for sorting large files, where - represents reverse order.

After execution, look at the result on the right, it has been sorted.

Picture5png

Then do some complex calculations, group and aggregation:

	A
1	=now()
2	=file("d:/OrdersBig.csv").cursor@tc(OrderDate,Client,SellerID,Amount)
3	=A2.select(year(OrderDate)>=2020 \|\| !Client \|\| to(101,400).contain(SellerID))
4	=A3.groups(year(OrderDate):y,month(OrderDate):m; sum(Amount):amt)
5	=output(interval@s(A1,now()))

SQL programmers must be familiar with the groups function in A4, so we won't go into detail here.

A2 can read only the columns to be used when opening the cursor, which can improve speed. A1 and A5 have also been added here to track the running time and print it on the console:

Picture6png

You can see that the running time is 145 seconds.

Parallel computing can fully utilize current multi-core CPUs and must be tried. Change the above code to parallel computing, simply add an option @m after the cursor function, and leave the rest unchanged:

	A
1	=now()
2	=file("d:/OrdersBig.csv").cursor@tcm(OrderDate,Client,SellerID,Amount)
3	=A2.select(year(OrderDate)>=2020 \|\| !Client \|\| to(101,400).contain(SellerID))
4	=A3.groups(year(OrderDate):y,month(OrderDate):m; sum(Amount):amt)
5	=output(interval@s(A1,now()))

@m represents performing multi-threaded computation according to the parallel options configured in the option.

At the same time turn on this parallel option.

Picture7png
Now execute it again:

Picture8png
Improved to 92 seconds, possibly due to the concurrency limitation of the hard disk, it is not possible to achieve a multiple performance improvement, and the results of running with 2 threads are also similar.
The above are a few basic operations. The official website provides more extensive and in-depth examples of large CSV calculations, and the code is also very simple. It is worth taking a look.