15 Sep 2024
SQLite and DuckDB for analytics workload
The impact of column orientation
I've been yearning to build my own observability / time series system. You can get some more background on this here. After establishing how I'm going to store data, the question remaining is what DB I'm going to be using. In this note, I'll evaluate two embedded databases, the venerable SQLite and the newcomer in this space, DuckDB.
Methodology
I'm populating the DB under test with mock data of three types: Location, CO2, and structured logs. This mirrors the scenarios from the last note. The dataset includes around 50k location entries, 500k CO2 values, and 90k structured logs with their timestamps nicely spread out over one year.
Full population script for reference
SQLite is operated in WAL mode with synchronization Normal
, while DuckDB uses no
special configuration as WAL mode is enabled by default.
The schema used in the tests looks as follows:
Queries
Next, to exercise the DB, I'm querying the three categories of data with fairly intensive queries.
GPS coordinates:
This checks how many location entries are within a certain longitude and latitude rectangle.
CO2:
Possibly the most intensive query: This aggregates CO2 measurements by month and computes the respective averages.
Structured logs:
This groups the entries indicating an error by endpoint
and checks which one has the highest
error count. It does so over the data of the last 90 days.
Note that the queries look slightly different between SQLite and DuckDB, in particular with date functions and JSON field access patterns.
Results
Clearly, DuckDB is faster across the board. Interestingly, unlike SQLite, DuckDB doesn't benefit from indices at all; they even seem to hurt performance. The structured logs query benefits particularly from the timestamp index on SQLite, as it is the only query that filters data based on the timestamp.
While not surprising at first glance that a column-oriented DB is faster in such workloads, one has to keep in mind that most of the computation happens on fields within a JSON blob. I would expect that most of the benefit of neat data alignment, ready for vectorized computation, is forfeit in this situation. But clearly DuckDB still manages to eke out some performance benefits.
Inserts
DuckDB is not all sunshine and roses, though. From what I'm gathering, their design goal is primarily focused on bulk operations, not so much individual transactions. This shows in the insert rate: DuckDB achieves around 4k naive inserts per second, while SQLite manages around 30k to 40k depending on indices. This can be remedied somewhat with the Appender API, though. Alternatively, I've been working around it by buffering writes in memory and doing batch inserts "off-thread". This has the added advantage of being able to soak up peaks of around 100k inserts per second.
DB Size
DuckDB is storing the data significantly more compactly, too. 20 MB for the database file versus 60-80 MB with SQLite depending on indices.
Other considerations
Overall, I'm quite satisfied with DuckDB. It has a bunch of other nice-to-have features that might come in handy later. Say, I want to segment the data into time chunks and write older data out to Parquet files. DuckDB does this natively. It can then even read Parquet files efficiently from remote S3 storage. Or I might want to explore the Map datatype specifically for structured logs. I suspect this could speed up queries quite a bit over naive JSON.
Something I've noticed is that DuckDB takes a long time to open a DB when the corresponding WAL file is large. This can be remedied by periodically checkpointing to fold the WAL file back into the main DB.
In the end, I can't bring myself to leave that much performance on the table, so I'll go ahead with DuckDB even though SQLite has its allures.