Analyzing DuckDB’s Performance Optimization through TOPN and COUNT DISTINCT Operations

In recent years, DuckDB has emerged as a popular choice for numerous data analysis scenarios. Its lightweight nature, ease of use, and simple integration also make it well-suited for programmers performing local analysis. The direct use of SQL provides both convenience and efficiency.

However, ease of writing code isn’t the only consideration; fast execution and intelligent optimization are also key to the user experience.

We will use TOPN and COUNT DISTINCT operations as examples to analyze DuckDB’s performance optimization.

Test environment and data preparation

Test hardware:

Configuration items

Specifications

CPU

16-core (8 physical cores, 16 threads, mainstream laptop)

Memory

16GB

Storage

SSD

This is not a high-end configuration; rather, it’s closer to the environment developers use daily, making the test results more relevant.

Test data:

Data table

Field description

Total rows

Original CSV file size

topn

id: 10-character random uppercase letters; amount: random real number from 0 to 10000000

1 billion

22GB

orderdetail

orderid: strings in ascending order; detailid: integer; amount: real number

2.4 billion

50GB

Both tables exceed available memory to ensure realistic simulation of big data processing scenarios and avoid unfair advantages from caching.

Entire-set TOPN: Excellent performance

First, let’s look at a classic TOPN operation: retrieving the top 100 rows by amount field:

select * from topn order by amount desc limit 100;

If the SQL is interpreted literally, the database would first sort the entire table and then select the top 100 rows. This results in an algorithm complexity of n*logn. If the data volume exceeds available memory, performance drops significantly.

However, test results show DuckDB performs quite well in this scenario. After importing the 1 billion rows of raw text data (22GB), it compresses the data into an 8GB database file. The first query takes approximately 20 seconds (largely due to initial loading and caching), with subsequent queries stabilizing around 4 seconds.

During the execution, no temporary files were generated, indicating that DuckDB’s optimizer avoids full-table sorting, and uses an optimization algorithm: maintain just a small set of size N and traverse the data only once. This strategy reduces the algorithm complexity from n*logn to n*logN, resulting in a substantial performance improvement when N is small.

In this test, DuckDB performs excellently, demonstrating both intelligence and efficiency.

Grouped TOPN: Shortcomings exposed

Next, let’s look at a slightly more complex operation: grouped TOPN.

The task is to group the table by the first letter of the id field and select the top 100 rows with the largest amount values in each group. The SQL is roughly as follows:

select * from (
  select left(id,1) as gid, amount,
         row_number() over (partition by left(id,1) order by amount desc) rn
  from topn
) where rn <= 100;

In theory, this scenario is similar to regular TOPN. Since each group contains only tens of thousands of rows, maintaining a small set of the top N values and traversing it once should theoretically avoid a full sort as well.

However, in actual running, DuckDB seems to be “faithfully executing the SQL semantics: fully sorting each group and then filtering using row_number. This strategy resulted in extensive read/write operations between memory and external storage, causing temporary files to swell to over 40GB during the execution. Even after 10 minutes, the query remained incomplete.

In other words, although optimization is possible logically, DuckDB doesn’t actively recognize ‘grouped TOPN’ as ‘group aggregation’ scenario and instead rigidly performs sorting. In this case, users are left to optimize the query themselves by rewriting the SQL query to avoid sorting or using a different tool.

For example, esProc SPL directly supports using groups in combination with the top function, treating TOPN as an aggregation operation. It obtains the results by traversing the data only once, and only takes 31 seconds to complete the same query, demonstrating a considerable performance advantage.

SPL script – Entire-set TOPN:


A

1

=file("e:/duckdb1.1.1/topn.ctx").open().cursor@m(amount)

2

=A1.total(top(100;-amount))

SPL script – Grouped TOPN:


A

1

=file("e:/duckdb1.1.1/topn.ctx").open().cursor@m()

2

=A2.groups@u(left(id,1);top(100;-amount))

Summary of test results (Unit: seconds)

Test type

Entire-set TOPN

Grouped TOPN

esProc SPL

13

31

Duckdb

20

600+

COUNT DISTINCT: Unable to leverage ordered data

Let’s now discuss another common operation: COUNT DISTINCT.

In SQL, counting the number of distinct order IDs that meet a specific condition is typically expressed as:

select count( distinct orderid ) from orderdetail where amount>50;

In DuckDB, the logic for this operation is simple and brute-force: loop through all data, collect the order IDs that meet the condition into a DISTINCT set, compare each new value with the current DISTINCT set to determine whether to add it, and finally count the size of the DISTINCT set.

If the data is unordered, this approach is fine and consistent with the practices of most databases.

However, in this test, we intentionally inserted the data in ascending order by orderid. In this case, if the algorithm leveraged this ordered characteristic, it could significantly improve performance, as the adjacent rows with the same orderid values will be skipped directly, thus eliminating the need to maintain a DISTINCT set.

Regrettably, DuckDB lacks the ability to actively recognize and leverage the ordered nature of the data. This is confirmed by the test results: it took over 200 seconds to process 2.4 billion rows using a single thread.

In contrast, using esProc SPL and its icount@o function to perform COUNT DISTINCT specifically for ordered data, the test completed in approximately 50 seconds—about 4 times faster than DuckDB.

For more complex operations, such as calculating two COUNT DISTINCT values with different conditions:

select count( distinct case when amount>50 then orderid else null end ),
       count( distinct case when amount>100 then orderid else null end )
from orderdetail;

DuckDB repeatedly builds DISTINCT sets, causing the execution time to skyrocket to over 800 seconds. Conversely, esProc SPL reuses the scan process, resulting in almost no change in execution time.

The root cause of this gap lies in SQLs inability to express “ordered” information, leaving the optimizer no means to leverage such information. In contrast, languages like esProc SPL—which can explicitly express orderedness and utilize dedicated function calls—hold a natural advantage in such scenarios.

SPL script - One count distinct query:


A

1

=file("f:/temp/yxfz.ctx").open().cursor@m()

2

=A1.total(icount@o(if(amount>50,orderid)))

SPL script - Two count distinct queries:


A

1

=file("f:/temp/yxfz.ctx").open().cursor@m()

2

=A1.total(icount@o(if(amount>50,orderid)), icount@o(if(amount>100,orderid)) )

Summary of test results (Unit: seconds):

Test type

One DISTINCT query

Two DISTINCT queries

esProc SPL

53

59

Duckdb

222

865

As a lightweight analytical database, DuckDB boasts excellent usability and particularly excels at optimizing entire-set TOPN queries, delivering a great daily user experience.

However, when faced with slightly more complex problems, such as grouped TOPN or COUNT DISTINCT on inherently ordered data, the “clumsiness” of SQL semantics is exposed: the optimizer cannot recognize the underlying logic, which leads to a significant performance gap.

In these scenarios, using tools like esProc SPL, which can describe computational details and support the expression of ordered characteristics, can greatly improve performance.

In conclusion, DuckDB is a good tool, provided that the scenario is simple and allows the optimizer free rein. However, if the problem involves ordered characteristics or special aggregation logic, achieving fast running requires tools that can describe “smarter” computational methods.