The Implicit Lambda Syntax

 

Performing calculations on each member of a dataset is a common task. While achievable with loop statements, it is quite cumbersome; even a simple summation requires multiple lines of code.

Programming languages often encapsulate these operations as functions, such as Python’s sum function. Calculating the total price of orders is written as follows:

total_price = orders['price'].sum()

It can also be written in SQL:

select sum(price) total_price from orders

Of course, there’s no problem with SPL:

total_price = orders.sum(price)

They all appear quite concise.

However, tasks are not always this simple. Let’s see a more complex example: calculate a label column for employees, where managers with salaries above 5000 are labeled ‘yes,’ and all other employees are labeled ‘no.’

This calculation can be described using a relatively uncomplicated expression, but the result cannot be pre-calculated outside the loop. Instead, it must be computed for each member of the set inside the loop. In this case, this expression can be defined as a function, and the function can be passed as a parameter to a function that performs loop calculations. In Python, it can be written as:

def calc_flag(row):
return 'yes' if row['position'] == 'manager' and row['salary'] > 5000 else 'no'

employee['flag']=employee.apply(calc_flag, axis=1)

This code first defines a function, calc_flag, which calculates the expression for the passed-in record row.

The apply function takes calc_flag as a parameter, passing the current member (record) of the set to calc_flag for calculation within the loop, and creates a new sequence with the results.

Clearly, pre-defining a function every time is quite cumbersome, especially for tasks that can be accomplished with such a simple expression. Therefore, the industry invented lambda syntax, which allows defining functions within parameters, resulting in much more concise code:

employee['flag'] = employee.apply(lambda row: 'yes' if row['position'] == 'manager' and row['salary'] > 5000 else 'no', axis=1)

In the apply function’s parameters, an anonymous function is defined using the lambda keyword, and the passed-in parameter is the record row. During the looping process, the apply function passes each member (record) to the lambda function and generates a new sequence from the computed results.

This Python approach uses explicit lambda syntax, which includes the lambda keyword, requires defining parameters, and writing the function body.

How does SQL handle this type of problem?

SELECT *,
    CASE WHEN position = 'manager' AND salary > 5000 THEN 'yes' ELSE 'no' END AS flag
FROM employee;

No “lambda”, it seems simpler.

In fact, the CASE WHEN expression is still equivalent to defining a function. It’s still treating the expression-defined function as a parameter for loop operation, which, in essence, is still lambda syntax. It’s just that SQL is more concise, and the form of lambda syntax is no longer explicit.

SQL, specializing in structured data computation, only supports the two-dimensional data table as a set, and the parameter passed to the lambda function can only be a record. Therefore, SQL does not need to explicitly define a row parameter like Python does. Instead, it can directly access fields, which is more convenient. In most cases, field names can be directly used within the lambda function; only when duplicate named fields exist does it become necessary to prefix them with the table name (or table alias) to distinguish them. In this way, both the table name and the record parameter are omitted, allowing SQL to express lambda functions as simple expressions, thereby making the lambda syntax implicit.

esProc SPL inherits these advantages of SQL, similarly making lambda syntax implicit:

employee.derive(if(position == "manager" && salary > 5000, "yes","no"))

Data sets are not limited to data tables. SQL does not support other forms of sets, making it cumbersome to process them. For sets composed of single-value members, SQL can still manage by using a data table with only one field. For example, to calculate the sum of squares for a set of numbers, the SQL code would be:

create table numbers as
    select value as n
    from (select 4.3 as value union all select 2.3 union all select 6.5 union all select 44.1) t;
select sum(n*n) from numbers;

It requires giving the set of numbers an extra field name and table name, which is a bit verbose.

Python supports sets composed of single values and can express such operations using lambda syntax:

numbers=pd.Series([4.3,2.3,6.5,44.1])
result=numbers.apply(lambda x: x * x).sum()

SPL also supports sets composed of single values:

numbers=[4.3,2.3,6.5,44.1]
numbers.sum(~*~)

Here, when the sum function loops through the set, it passes the current member ~ to the lambda function to calculate the square, and then the sum function performs the summation. The ~ symbol is equivalent to x in the previous Python code.

Within a loop, the lambda function almost always uses the current member of the set. SPL solidifies this parameter as the ~ symbol, thereby eliminating the need to define the parameter. This allows the lambda to be written as a simple expression, continuing to maintain the advantage of making the lambda implicit.

However, for this relatively simple case, Python more strongly advocates for aligned set operations, which can avoid using lambda syntax:

numbers=pd.Series([4.3,2.3,6.5,44.1])
result = (numbers * numbers).sum()

This appears more concise.

The previous employee label example can also be written as:

employee['flag'] = np.where((employee['position'] == 'manager') & (employee['salary'] > 5000), 'yes', 'no')

When the expression is more complex, it no longer appears as concise as the lambda syntax. Moreover, this approach is only applicable for operations that have been optimized for arrays (such as addition, subtraction, multiplication, division, and the if here). Most mathematical functions have not been optimized in this way, so when encountering them, you can only resort to lambda syntax.

As a side note, this approach requires using the where function, and the logical operator is &, whereas in the previous lambda syntax, the if function and "and" were used. Python syntax often exhibits such inconsistencies. The lambda syntax also presents similar inconsistency issues: it can work in the apply function but cannot in sort_values. Such inconsistencies increase the difficulty of learning the language.

SPL also supports the notation for aligned set operations:

(numbers ** numbers).sum()

It looks similar to Python, but it’s not as simple as the implicit lambda syntax.

SPL also supports set of sets. In fact, as long as it’s a set, SPL can utilize implicit lambda syntax. For example, find the employees in departments with more than 10 employees:

employee.group(department).select(~.len()>10)

The group function groups employees by department to obtain a large set, where each member is a subset containing employees from the same department. The expression ~.len()>10 represents a lambda function, where ~ signifies the current member of the set, that is, the subset.

When the select function loops through the large set, it passes the current member (the subset) to the lambda function to determine whether the subset’s length exceeds 10, and then retains or discards the subset based on the result. This is a very natural problem-solving approach.

Python can also represent set of sets to some extent, and similar code can be written:

result=employee.groupby('department').filter(lambda x: len(x) >10)

For set of sets, aligned set operations are no longer applicable. Using lambda syntax is Python’s simplest approach; alternative methods would lead to more complex logic and code.

SQL cannot describe set of sets. To solve this problem, it needs to use a different approach (much more cumbersome), and lambda syntax is powerless for this problem.

SPL isnt exclusively designed for structured data computation. However, because structured data is so common, SPL, like SQL, has specifically simplified the syntax. Taking the previous example of calculating the employee label column, the complete way to reference fields in SPL should be ~.position and ~.salary. SPL also provides a convenient mechanism for directly accessing fields, which allows the lambda function to be written as concisely as in SQL:

if(position == "manager" && salary > 5000, "yes","no")

Python lacks such simplifications and must be written as shown below, resulting in more verbose syntax for these common scenarios:

lambda row: 'yes' if row['position'] == 'manager' and row['salary'] > 5000 else 'no'

In addition to the current member, loop calculations on ordered sets often also use the member’s index. For example, to extract the even-indexed members from a set of numbers, a straightforward approach is to loop through the set and retain the member if its index is divisible by 2, otherwise discard it.

Python can only pass the current member as a parameter to the lambda function. To implement this approach, the set must be modified to attach an index to each member:

result = filter(lambda x: x[0] % 2 == 1, enumerate(number))
even_index_members = [x[1] for x in result]

The enumerate function converts each member of number to an array. The 0th member of the array is the index, and the 1st member is the original value. This allows the lambda function to use x[0] to access the member’s index. After filtering, the original value must be extracted, making the process convoluted and the code cumbersome.

Since SQL operates on unordered sets where member indexes are meaningless, this problem has to be implemented with a workaround.

SPL uses # to represent the current member’s index. The code written using the straightforward approach above is very concise:

number.select(# % 2 ==0)

#, like ~, is also a parameter passed to the lambda function in SPL’s looping functions.

Summary:

SQL renders lambda syntax implicit for two-dimensional data table operations, making it convenient and concise for describing common structured data set operations. However, it does not support sets other than structured data. Python supports a variety of set types, but lambda function syntax can be somewhat verbose, has limited adaptability, and exhibits stylistic inconsistencies. SPL inherits all of SQL’s advantages and allows the use of implicit lambda syntax across diverse set types. It introduces symbols like ~ and # to further simplify code, offers broad adaptability and consistent style, making it the most powerful of the three.