15 Sep 2024

SQLite and DuckDB for analytics workloads

The impact of column orientation

I’ve been yearning to build my own observability / time series system. You can get some more background on this here. After establishing how I’m going to store data, the question remaining is what DB I’m going to be using. In this note, I’ll evaluate two embedded databases, the venerable SQLite and the newcomer in this space, DuckDB.

Methodology

I’m populating the DB under test with mock data of three types: Location, CO2, and structured logs. This mirrors the scenarios from the last note. The dataset includes around 50k location entries, 500k CO2 values, and 90k structured logs with their timestamps nicely spread out over one year.

Full population script for reference

import { faker } from "@faker-js/faker";
import * as Throttle from 'promise-parallel-throttle';

console.log("deleting ...");
await fetch("http://localhost:3000/delete", {
	method: "POST",
});
console.log("deleted");

faker.seed(123);

const startDate = new Date("01/01/2023");
const endDate = new Date("12/31/2023");
let curDate = structuredClone(startDate);

const payloads: Array<any> = [];

// Location - 10m
while (curDate.getTime() < endDate.getTime()) {
	const payload = {
		data: {
			longitude: faker.location.longitude(),
			latitude: faker.location.latitude(),
		},
		bucket: "location",
		timestamp: curDate.toISOString(),
	};
	payloads.push(payload);
	curDate = new Date(curDate.getTime() + 600_000);
}

// CO2 - 1m
curDate = structuredClone(startDate);

while (curDate.getTime() < endDate.getTime()) {
	const payload = {
		data: {
			co2: faker.number.float({ min: 0, max: 5000 }),
		},
		bucket: "co2",
		timestamp: curDate.toISOString(),
	};
	payloads.push(payload);
	curDate = new Date(curDate.getTime() + 60_000);
}

// Structured logs - 1h bursty
curDate = structuredClone(startDate);

while (curDate.getTime() < endDate.getTime()) {
	let count = 10;
	const user = faker.internet.userName();
	const endpoint = faker.internet.url();
	while (count > 0) {
		const payload = {
			data: {
				span_id: faker.number.int(),
				level: faker.datatype.boolean(0.95) ? "success" : "error",
				user,
				message: faker.company.buzzPhrase(),
				endpoint,
			},
			bucket: "logs",
			timestamp: curDate.toISOString(),
		};
		payloads.push(payload);
		curDate = new Date(curDate.getTime() + 100);
		count -= 1;
	}
	curDate = new Date(curDate.getTime() + 3_600_000);
}

payloads.sort(
	(a, b) => new Date(a.timestamp).getTime() - new Date(b.timestamp).getTime()
);

const doRequest = async (payload, idx) => {
	await fetch("http://localhost:3000", {
		method: "POST",
		body: JSON.stringify(payload),
		headers: { "Content-Type": "application/json" },
	});
};

const queued = payloads.map((p, idx) => () => doRequest(p, idx));

await Throttle.all(queued, { maxInProgress: 10 });

SQLite is operated in WAL mode with synchronization Normal, while DuckDB uses no special configuration as WAL mode is enabled by default.

The schema used in the tests looks as follows:

CREATE TABLE IF NOT EXISTS metrics (
	id integer primary key,
	timestamp TIMESTAMP NOT NULL,
	bucket TEXT NOT NULL,
	data JSON NOT NULL
);

Notably, due to SQLite version shenanigans, I couldn’t use JSONB.

Queries

Next, to exercise the DB, I’m querying the three categories of data with fairly intensive queries.

GPS coordinates:

SELECT data ->> '$.longitude', data ->> '$.latitude' FROM metrics WHERE bucket = 'location' AND data ->> '$.longitude' > 6 AND data ->> '$.longitude' < 10 AND data ->> '$.latitude' > 45 AND data ->> '$.latitude' < 50;

This checks how many location entries are within a certain longitude and latitude rectangle.

CO2:

SELECT strftime('%m', timestamp) as timestamp, avg(data -> '$.co2') as avg FROM metrics WHERE bucket = 'co2' GROUP BY strftime('%m', timestamp);

Possibly the most intensive query: This aggregates CO2 measurements by month and computes the respective averages.

Structured logs:

SELECT count(*), data ->> '$.endpoint' FROM metrics WHERE bucket = 'logs' AND data ->> '$.level' = 'error' AND timestamp > DATE('2024-01-01', '-90 day') GROUP BY data ->> '$.endpoint' ORDER BY count(\*);

This groups the entries indicating an error by endpoint and checks which one has the highest error count. It does so over the data of the last 90 days.

Note that the queries look slightly different between SQLite and DuckDB, in particular with date functions and JSON field access patterns.

Results

Figure 1. DuckDB benchmark results

Figure 2. SQLite benchmark results

Clearly, DuckDB is faster across the board. Interestingly, unlike SQLite, DuckDB doesn’t benefit from indices at all; they even seem to hurt performance. The structured logs query benefits particularly from the timestamp index on SQLite, as it is the only query that filters data based on the timestamp.

While not surprising at first glance that a column-oriented DB is faster in such workloads, one has to keep in mind that most of the computation happens on fields within a JSON blob. I would expect that most of the benefit of neat data alignment, ready for vectorized computation, is forfeit in this situation. But clearly DuckDB still manages to eke out some performance benefits.

Inserts

DuckDB is not all sunshine and roses, though. From what I’m gathering, their design goal is primarily focused on bulk operations, not so much individual transactions. This shows in the insert rate: DuckDB achieves around 4k naive inserts per second, while SQLite manages around 30k to 40k depending on indices. This can be remedied somewhat with the Appender API, though. Alternatively, I’ve been working around it by buffering writes in memory and doing batch inserts “off-thread”. This has the added advantage of being able to soak up peaks of around 100k inserts per second.

DB Size

DuckDB is storing the data significantly more compactly, too. 20 MB for the database file versus 60-80 MB with SQLite depending on indices.

Other considerations

Overall, I’m quite satisfied with DuckDB. It has a bunch of other nice-to-have features that might come in handy later. Say, I want to segment the data into time chunks and write older data out to Parquet files. DuckDB does this natively. It can then even read Parquet files efficiently from remote S3 storage. Or I might want to explore the Map datatype specifically for structured logs. I suspect this could speed up queries quite a bit over naive JSON.

Something I’ve noticed is that DuckDB takes a long time to open a DB when the corresponding WAL file is large. This can be remedied by periodically checkpointing to fold the WAL file back into the main DB.

In the end, I can’t bring myself to leave that much performance on the table, so I’ll go ahead with DuckDB even though SQLite has its allures.

Hint from the future: I’ve also tested ClickHouse in a similar fashion with interesting results.