"10 Order-based computations on sets of strings This type of task refers to computations within o .."

jinxing RaqForum 41 No.
20 View • 3 Months ago

A Field Guide to Querying and Report Computations with SPL - 10 Order-based computations on sets of strings

10 Order-based computations on sets of strings

This type of task refers to computations within or between string sequences, commonly used for organizing irregular data. The fundamental data type in SQL (result set) is unordered set, which poses inherent obstacles to implementing such computations. When dealing with record-level or field-level sequences, SQL often requires converting data into record sets for computation, resulting in lengthy and complex code. While embedding regular expressions can shorten the code, they increase the difficulty of understanding and debugging, further exacerbating this complexity.

Example 1: Remove duplicate values in an indefinite number of mail1, 2, 3… columns and combine the rest of the values

Source data: Client mail table (10/MailCollected.txt) stores client information in one client field and an indefinite number of mail fields. One client may have null and duplicate email address.

Target: Rearrange the table as a regular two-dimensional table by removing null email addresses and duplicate addresses and combining address values into a single mail field.

Picture1png

SPL code:

	A
1	$select * from 10/MailCollected.txt
2	=A1.new(Client,~.array().m(2:).id@0().concat@c():Mail)

A1: Load data.

A2: =A1.new(Client,~.array().m(2:):Mail)creates a new two-dimensional table based on A1. It copies Client field and computes Mail field values in several steps – array() function converts records to a sequence and m() function gets members of the sequence from the 2^nd to the last. Below is the first record:

Picture2png
.id@0() performs a further computation to get the unique values. @0 option enables removing the null values. Below shows details of the first record:

Picture3png
.concat@c() furthers the computation to concatenate members of the sequence into a string. @c option enables using comma as the separator.

Picture4png

Example 2: Extract phone numbers of NANP format

Source data: Client phone number table (10/TelCollected.txt)stores relative information. Each phone number consists of numbers and one or some of the characters "()*#+-" basically in the NANP format. Some phone numbers have the country code and some are made up of continuous numbers, but all of them are mixed into large strings.

Target: 3. Extract the phone number and organize the data into a regular two-dimensional table according to the following rule: 1. Each telNo field value consists of only one phone number; 2. "tel" may or may not has a phone number after it, but if there isn’t the "tel", there isn’t any phone number; 3. Each phone number exists as a continuous string, and once a new character appears, the phone number ends.

Picture5png

SPL code:

	A
1	$select * from 10/TelCollected.txt
2	=A1.run(TelNo=substr(TelNo,"tel").split().select@c(pos("0123456789()*#+-",~)).concat())

A1: Load data.

A2: =A1.run(TelNo=substr(TelNo,"tel")) modifies TelNo field in several steps: first, substr() function retrieves the substring after the marker "tel".

Picture6png
.split() computes furtherly to split the current string into a sequence of characters. Below shows the first sequence of characters:

Picture7png
.select@c(pos(“0123456789()*#+-”,~))performs a further computation to retrieve members of the sequence from the first until a character that isn’t included in “0123456789()*#+-” appears. The select()function performs filtering operation to select certain members; @c option enables getting member from the first one and stopping the retrieval until the member making the specified condition false. The pos() function judges whether a character is contained in a string.

Picture8png
.concat() continues the computation to concatenate members of the sequence into a string.

Picture9png

Example 3: Find strings that contain at least 5 continuous, alphabetically ordered letters

Source data: A database table (10/consecutive_5_str.txt) consists only one string column that may contain any characters.

Target: Find every substring whose length is greater or equal to 5 from continuous substrings where letters are alphabetically arranged in ascending order. Note that the alphabetical order is A-Z/a-z and that letter Z is directly followed by letter a.

Picture10png

SPL code:

	A
1	$select * from 10/consecutive_5_str.txt
2	=A1.select(str.split().group@i(~<=~[-1] \|\| !isalpha(~[-1])).max(~.len())>=5)\n

A1: Load data.

A2: =A1.select(str.split()) selects strings meeting the specified condition. The process involves several steps. First, each string is split into a sequence of characters. Here are the 6^th, the 7^th, and the 8^th strings:

Picture11png

.group@i(~<=~[-1] || !isalpha(~[-1])) performs a further operation to group each sequence according to the specified condition – put alphabetically ordered characters and non-letter characters into different groups. The group()function performs grouping without aggregation. @i option enables grouping by the specified condition. ~<=~[-1] creates a new group when the current character is before the preceding one or in the alphabet; !isalpha(~[-1]) creates a new group when the preceding character isn’t a letter; || denotes the OR relationship. Here are details of the 6^th, the 7^th, and the 8^th strings:

Picture12png

.max(~.len())>=5 continues the computation to select strings corresponding to at least one subgroup that has at least 5 members. Strings from the 1^st to the 8^th are the eligible ones.

Picture13png