Comparison of esProc SPL and DuckDB in Data Storage
Data storage is essentially about striking a balance between flexibility, performance, and ease of use. Both DuckDB and esProc SPL offer their own binary storage formats, but they differ significantly in data organization.
DuckDB still employs a conventional database mechanism where data organization is logically holistic, and data under a specific subject forms a database, and there is a set of metadata to describe the structure and relationship of data in the database. A database is logically a whole, with clear distinctions between data inside and outside the database, as well as explicit import and export actions, which is often referred to as closedness. Closedness provides better manageability, but it also implies a lack of freedom in data organization. While DuckDB can also process data outside the database, this falls under the multi-data source functionality, and often still requires mapping to tables.
esProc is completely different; it is not a database. Its data organization is logically fragmented. It lacks the concept of a subject, and consequently, has no metadata. Of course, it makes no distinction between data inside and outside the database, nor does it involve import or export actions. Any data can participate in calculations as long as it’s accessible. The only difference lies in the access performance from different data sources. esProc designs a specialized binary data storage format for high performance (columnar storage, compression, etc.). However, from a logical standpoint, this storage format is treated the same as a text file or data extracted from other databases. esProc’s storage approach is strongly characterized by its openness, and data organization is more flexible and unconstrained, but this comes at the cost of data integrity and manageability.
Now, let’s take a look at the binary format differences between DuckDB and esProc SPL.
DuckDB’s .duckdb files adopt pure columnar storage, where all data is compressed in blocks by column. Columnar storage is well-suited for analytical queries (such as calculating total sales), as it only needs to read a single column of data, and the compressed data results in small file sizes and fast read speeds.
esProc offers two storage formats to address different scenarios:
1. btx files (row-based binary format): Data is stored by row, with a structure similar to CSV, but saved in binary format for faster read/write speed. This format is suitable for small-scale data or temporary storage (such as intermediate calculation results), as it requires no pre-defined structure and can be written to and used immediately.
2. ctx files (columnar composite tables): Data is stored by column and blocked in order based on primary keys (such as time or ID). This format is suitable for large-scale data analysis, allowing it to directly skip blocks that do not meet the conditions, reducing IO consumption and significantly improving speed. The ctx file also offers the option of row-based storage or no compression, allowing for selection based on different computational scenarios.
This “one-size-fits-all” design enables esProc to strike a balance between convenience and performance: btx is convenient for small files, while ctx is efficient for big data.
During big data analysis, esProc’s ctx, in addition to possessing the advantages of DuckDB’s columnar storage, also specifically supports ordered computation (as opposed to SQL’s unordered nature). ctx can also be designed according to the computation. For example, storing data in a specified order allows for the application of order-related algorithms like binary search to improve computational performance. Redundancy can also be implemented – after all, it’s just a matter of having an additional copy of a file.
The flexibility of esProc storage is also manifested in its ability to allow the storage of different types of data within the same field. This greatly increases flexibility but sacrifices performance, requiring trade-offs based on specific needs.
Moreover, the value of an esProc field can be another record or table, thereby better supporting multi-layered data storage and usage. DuckDB now also offers good support for JSON data, making the two systems comparable in this regard.
esProc SPL and DuckDB each have their own distinctive characteristics in data management and storage. DuckDB employs a conventional database model, using metadata to uniformly manage structured data. Its closed mechanism provides strong manageability but limits flexibility. In contrast, esProc adopts fragmented data organization, with no metadata constraints, supporting mixed computations of data from any source, but requiring developers to manage the data themselves. In terms of storage, DuckDB only supports a pure columnar format, while esProc offers a dual mode of btx row-based storage and ctx columnar composite table, supporting multiple data types within the same field and multi-layer nested data, allowing storage strategies to be flexibly chosen according to the scenario.
SPL Official Website 👉 https://www.esproc.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.esproc.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/2bkGwqTj
Youtube 👉 https://www.youtube.com/@esProc_SPL
Chinese version