"Python, with its concise syntax and rich libraries, provides a significantly simpler way to data .."

lisongbo RaqForum 82 No.
1 Reply • 23 View • 4 Months ago

Besides True Parallelism and Big Data, esProc SPL’s Conciseness Leaves Python in the Dust

Python, with its concise syntax and rich libraries, provides a significantly simpler way to data computation than Java, even surpassing SQL in convenience, which explains its immense popularity in the field of data analysis.

However, the emergence of esProc SPL may disrupt this ranking.

A typical example is big data processing. When memory cannot hold the full data, even common aggregation/filtering operation requires approximately ten lines of Python code. If sorting/grouping operation is needed, it would be more complex, with code volume skyrocketing to hundreds of lines, involving numerous function and method coding, as well as temporary file handling. This level of complexity surpasses the capabilities of data analysts. This is because Python does not provide cursor natively; when data exceeds memory capacity, programmers have to split data themselves, leading to verbose and messy code written with great effort.

In contrast, SPL is much simpler. Because SPL has a built-in cursor data type, and aggregation/filtering operation can be done in a single line:

file(“huge.txt”).cursor@t().total(sum(amount))
file(“huge.txt”).cursor@t().select(amount>=1000)

Even complex operations like sorting and grouping that are difficult to implement in Python can still be done with SPL in just one line:

file(“huge.txt”).cursor@t().sortx(area)
file(“huge.txt”).cursor@t().groups(area,amount)

SPL cursors are not limited to just loading data in batches, but also incorporate optimization ways like indexing and blocking to ensure efficient computation even with large data volume.

Parallel computing is essential for big data processing. Python doesn’t offer a true multi-thread parallel mechanism, rendering multi-core CPUs virtually useless, and resorting to writing multi-process programs is cumbersome. In contrast, SPL simplifies parallel computing to a single option - just add @m to automatically enable parallel computing:

file(“huge.txt”).cursor@tm().groups(area;sum(amount))

SPL offers true parallel computing. The underlying level automatically splits tasks and distributes them based on the number of CPU cores, with each core independently processing a portion of data and then merging the results, which fully leverages multi-core advantages. The entire process is transparent to users and requires no manual intervention from programmers.

SPL also offers its own high-performance storage, employing mechanisms like binary format, compression, and columnar storage to significantly improve data read efficiency. Moreover, the storage can be flexibly designed based on computation goals, such as ordered storage by specified fields and appropriate redundancy. This ensures both high efficiency and flexibility.
With SPL cursor and rich computational support on cursor, data analysts can confidently tackle big data challenges. This already puts SPL several steps ahead of Python. When combined with SPL’s simple parallel computing and own high-performance storage, SPL leaves Python far behind.

Even for non-big-data processing scenarios, SPL remains more concise than Python.
For common, simple calculations, SPL and Python are similar, with no significant difference. For example, calculate the top three:

employee.top(-3;salary)  //SPL
employee.nlargest(3, 'salary')  //Python

However, once slightly more complex scenarios are involved, the difference becomes obvious. For example, to calculate the top three within each group, SPL retains its concise style:

employee.groups(department;top(-3;salary))

SPL directly treats topN as an aggregation operation and performs direct grouping.

In contrast, Python is considerably more complex:

employee.groupby('department').apply(lambda group: group.nlargest(3, 'salary'))

Not only does it require a combination of apply and lambda, but the style is inconsistent with that of calculating the whole set.

Similar inconsistencies in syntax and style are common in Python, which adds extra memorization cost. For example, both login.groupby('user')['time'].min() and login.groupby('user').agg({'time': 'min'}) calculate the minimum value within each group, but they return completely different objects, and the subsequent supported operations also vary, requiring extra caution during use.

Differences also arise in position-based calculations. For example, to extract the data for the 5th, 10th, 15th, … trading days of 2025, SPL directly uses # to represent position, making it simple and intuitive:

stock.select(year(date)==2025).select(# % 5 == 0)

Python, however, requires a workaround: first filtering, then using reset_index() to re-number, and finally retrieving values based on position:

stock_2025 = stock[stock['date'].dt.year == 2025].reset_index(drop=True)
selected_stock = stock_2025[stock_2025.index % 5 == 4]

The step reset_index(drop=True) alone is easy to forget; forgetting it will result in errors. SPL, on the other hand, natively supports sequence numbers, so there’s no need to worry about these details.

In addition, when calculating growth rates or moving averages, it’s often necessary to reference adjacent records. SPL directly uses [-1] to retrieve the previous record, making the code natural. For example, calculate the maximum monthly sales growth:

sales.(if(#>1,~-~[-1],0)).max()

Python requires shift()or rolling() to generate a new Series before performing the calculation:

sales.rolling(window=2).apply(lambda x: x[1] - x[0], raw=True).max()

To calculate if the three-month moving average is increasing, SPL remains straightforward:

sales.(~[-2,0].pselect(~<=~[-1])==null)

Python is much more cumbersome, requiring both rolling and lambda:

sales.rolling(window=3).apply(lambda x: (x[0] < x[1] and x[1] < x[2]), raw=True)

Not only is the code complex, but the fixed rolling window is also inflexible.

Both Python and SPL support lambda syntax, but SPL is simpler and more direct. For example, to label managers with salaries over 5000, the Python code:

employee['flag'] = employee.apply(lambda row: 'yes' if row['position'] == 'manager' and row['salary'] > 5000 else 'no', axis=1)

SPL doesn’t require lambda keywords and makes lambda syntax implicit, enabling direct coding:

employee.derive(if(position == "manager" && salary > 5000, "yes", "no"))

Code simplicity is sometimes even more important than performance. In this aspect, SPL leaves Python even further behind.

With its cursor calculations, true parallel processing, and proprietary high-performance storage, SPL effortlessly handles massive data without struggling with complex code. SPL truly deserves the praise “It’s awesome!”

SPL Official Website 👉 https://www.esproc.com

SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL

SPL Learning Material 👉 https://c.esproc.com

SPL Source Code and Package 👉 https://github.com/SPLWare/esProc

Discord 👉 https://discord.gg/sxd59A8F2W

Youtube 👉 https://www.youtube.com/@esProc_SPL

Promote