11 May 2026
Exploring the Variant type in Parquet
Let's get shredding
Observability has been on my mind again lately. I already touched on how to store unstructured data a while back. After a brief stint building my own system on top of DuckDB, I’ve been running an off-the-shelf combination of ClickHouse and Grafana to collect and visualize tracing data. It has served me well, but a planned move of my side projects to my own hardware has me re-evaluating the setup. ClickHouse is by far the heaviest service I run on my server, and it has a habit of filling up my disk with logging tables. Look, ClickHouse is an engineering marvel, but it’s not my marvel. It’s someone else’s marvel. And I have a need to create my own. Thing, not marvel, I mean. So here we are.
Now, I’ve stumbled across the Parquet file format, and in particular its new data type that looks like it was custom-made to solve all my problems. Maybe not all of them. In this note I want to explore how DuckDB can write and query these files. If this works out, I may be tempted to build a new observability system around wide events on top of these technologies.
But before we get into the weeds, let me set the scene. Let’s say I have a service that sends structured logs (or wide events) to my observability system. Today they might look like this:
{
"timestamp": "2026-05-05T08:00:00Z",
"service": "marending-dev",
"severity": "error",
"message": "User Bobby Tables doesn't exist"
}
But then I want to track more information.
{
"timestamp": "2026-05-05T08:00:00Z",
"service": "marending-dev",
"severity": "error",
"message": "User Bobby Tables doesn't exist",
"user_id": "null",
"ip": "85.0.1.42",
"path": "/user/login"
}
You get the picture. The question is, how can we store these JSON blobs such that we can then
efficiently compute, say, error rate grouped by user_id? We can’t have a static schema, since we
might want to add properties to this event at any time. We can chuck this as a JSON string into
any database, file format or what have you. But then any query that attempts to read from properties
of that JSON string is going to incur the significant cost of extracting the appropriate fields for
every row of interest. One way to get reasonable performance here is to ensure severity and user_id
are each contiguously laid out so that they can be efficiently aggregated.
Parquet and Variant
That brings us to Apache Parquet. Parquet is a column-oriented file format. While DuckDB has its own file format for storing data, one of its strengths is its support for a wide variety of other formats, including Parquet. So you can write and query Parquet files natively with SQL within DuckDB.
Recently Parquet added the Variant type, a binary encoding of primitives, arrays or objects of arbitrary types. It allows representing structured data in such a way that it can be queried more efficiently than if the same data was serialized and stored as a string type. But the real kicker is “Variant shredding”. It’s the process of extracting some of the properties within the Variant into their own columns. This should then give us maximum performance when aggregating on those columns. And DuckDB apparently supports this!
Once I learned about this, a path towards my dream observability system emerged. Step 1: Write a service that accepts arbitrary JSON payloads, buffers them in memory and writes them to Parquet files with the payload being of type Variant every 10 seconds or so. Step 2: Every hour, read all the small Parquet files and shred them into a larger hourly file. Step 3: Possibly aggregate the hourly files into even broader files, but I’ll see when I get to it.
Now at query time, I will have at most an hour’s worth of data in inefficient, unshredded form. That should be easy enough to deal with. Data older than that will be in highly efficient form. And the best part? DuckDB should be able to query across many Parquet files using glob syntax, even when some have concrete, shredded columns and some don’t. But we’ll see this for ourselves.
Exploration
Let’s start with the example given in DuckDB’s documentation.
CREATE TABLE events (id INTEGER, data VARIANT);
INSERT INTO events VALUES
(1, 42::VARIANT),
(2, 'hello world'::VARIANT),
(3, [1, 2, 3]::VARIANT),
(4, {'name': 'Alice', 'age': 30}::VARIANT);
Querying the table presents the expected result.
SELECT * FROM events;
┌───────┬────────────────────────────┐
│ id │ data │
│ int32 │ variant │
├───────┼────────────────────────────┤
│ 1 │ 42 │
│ 2 │ hello world │
│ 3 │ [1, 2, 3] │
│ 4 │ {'name': Alice, 'age': 30} │
└───────┴────────────────────────────┘
SELECT id, data, variant_typeof(data) AS vtype FROM events;
┌───────┬────────────────────────────┬───────────────────┐
│ id │ data │ vtype │
│ int32 │ variant │ varchar │
├───────┼────────────────────────────┼───────────────────┤
│ 1 │ 42 │ INT32 │
│ 2 │ hello world │ VARCHAR │
│ 3 │ [1, 2, 3] │ ARRAY(3) │
│ 4 │ {'name': Alice, 'age': 30} │ OBJECT(name, age) │
└───────┴────────────────────────────┴───────────────────┘
We can now also query nested fields in a Variant.
SELECT data.name FROM events;
┌─────────┐
│ name │
│ variant │
├─────────┤
│ NULL │
│ NULL │
│ NULL │
│ Alice │
└─────────┘
Writing this data as-is to a parquet file is now as simple as
COPY events TO 'events.parquet' (FORMAT parquet);. And indeed, having DuckDB describe the parquet
file’s shape yields the expected.
DESCRIBE SELECT * FROM 'events.parquet';
┌────────────────┐
│ events.parquet │
│ │
│ id integer │
│ data variant │
└────────────────┘
But hold on. According to the docs, writing to a Parquet file should automatically shred the variant type. Shouldn’t I see the individual columns then? Turns out shredding is only a storage detail, not a logical schema change. The type stays as is. To see how the data is actually stored, we have to inspect the file’s metadata.
SELECT name, type, converted_type FROM parquet_schema('events.parquet');
┌───────────────┬────────────┬────────────────┐
│ name │ type │ converted_type │
│ varchar │ varchar │ varchar │
├───────────────┼────────────┼────────────────┤
│ duckdb_schema │ NULL │ NULL │
│ id │ INT32 │ INT_32 │
│ data │ NULL │ NULL │
│ metadata │ BYTE_ARRAY │ NULL │
│ value │ BYTE_ARRAY │ NULL │
│ typed_value │ INT32 │ INT_32 │
└───────────────┴────────────┴────────────────┘
Well, the way this is supposed to be read is as follows. id is of type INT32, the data variant
itself doesn’t really have a type but it comprises three parts: metadata, value and
typed_value. Of interest are value and typed_value. Each row with the data logical column is
either going to have the (fallback) byte-array value representation set, or the shredded
typed_value column. In this case, DuckDB’s autoshredder apparently decided that row 1 with the
integer data variant deserves to have its representation shredded, but not the other rows.
If, instead, we instruct DuckDB to shred data to a struct like so
COPY events TO 'events.parquet' (
FORMAT parquet,
SHREDDING {'data': 'STRUCT(name VARCHAR, age INTEGER)'}
);
we get this
SELECT name, type, converted_type, logical_type FROM parquet_schema('events.parquet');
┌───────────────┬────────────┬────────────────┐
│ name │ type │ converted_type │
│ varchar │ varchar │ varchar │
├───────────────┼────────────┼────────────────┤
│ duckdb_schema │ NULL │ NULL │
│ id │ INT32 │ INT_32 │
│ data │ NULL │ NULL │
│ metadata │ BYTE_ARRAY │ NULL │
│ value │ BYTE_ARRAY │ NULL │
│ typed_value │ NULL │ NULL │
│ name │ NULL │ NULL │
│ value │ BYTE_ARRAY │ NULL │
│ typed_value │ BYTE_ARRAY │ UTF8 │
│ age │ NULL │ NULL │
│ value │ BYTE_ARRAY │ NULL │
│ typed_value │ INT32 │ INT_32 │
└───────────────┴────────────┴────────────────┘
Now we see data with its metadata, value and typed_value properties and also a value and
typed_value logical column for both name and age. For these the shredded types are UTF8 and
INT32 respectively.
To get an intuitive understanding, it is interesting to look at the output of
parquet-tools meta events.parquet from this tool:
Output
{
"NumRowGroups": 1,
"RowGroups": [
{
"NumRows": 4,
"TotalByteSize": 725,
"Columns": [
{
"PathInSchema": ["id"],
"Type": "INT32",
"ConvertedType": "convertedtype=INT_32",
"Encodings": ["PLAIN"],
"CompressedSize": 41,
"UncompressedSize": 39,
"NumValues": 4,
"NullCount": 0,
"MaxValue": 4,
"MinValue": 1,
"CompressionCodec": "SNAPPY"
},
{
"PathInSchema": ["data", "metadata"],
"Type": "BYTE_ARRAY",
"Encodings": ["PLAIN"],
"CompressedSize": 52,
"UncompressedSize": 60,
"NumValues": 4,
"NullCount": 0,
"CompressionCodec": "SNAPPY"
},
{
"PathInSchema": ["data", "value"],
"Type": "BYTE_ARRAY",
"Encodings": ["PLAIN"],
"CompressedSize": 81,
"UncompressedSize": 137,
"NumValues": 4,
"NullCount": 1,
"CompressionCodec": "SNAPPY"
},
{
"PathInSchema": ["data", "typed_value", "name", "value"],
"Type": "BYTE_ARRAY",
"Encodings": ["PLAIN"],
"CompressedSize": 34,
"UncompressedSize": 119,
"NumValues": 4,
"NullCount": 4,
"CompressionCodec": "SNAPPY"
},
{
"PathInSchema": ["data", "typed_value", "name", "typed_value"],
"Type": "BYTE_ARRAY",
"ConvertedType": "convertedtype=UTF8",
"Encodings": ["PLAIN"],
"CompressedSize": 44,
"UncompressedSize": 128,
"NumValues": 4,
"NullCount": 3,
"CompressionCodec": "SNAPPY"
},
{
"PathInSchema": ["data", "typed_value", "age", "value"],
"Type": "BYTE_ARRAY",
"Encodings": ["PLAIN"],
"CompressedSize": 34,
"UncompressedSize": 119,
"NumValues": 4,
"NullCount": 4,
"CompressionCodec": "SNAPPY"
},
{
"PathInSchema": ["data", "typed_value", "age", "typed_value"],
"Type": "INT32",
"ConvertedType": "convertedtype=INT_32",
"Encodings": ["PLAIN"],
"CompressedSize": 39,
"UncompressedSize": 123,
"NumValues": 4,
"NullCount": 3,
"CompressionCodec": "SNAPPY"
}
]
}
]
}Let’s take this line for example ["data", "typed_value", "name", "typed_value"]. A row in our
dataset is going to populate this logical column when the name property in the variant is a UTF8
string, otherwise it will fall back to the BYTE_ARRAY data.value. And indeed we see that 3 out of
4 values in this column are NULL, since row 4 is the only one that can populate this typed_value.
Conversely, the ["data", "value"] entry only has a single NULL value: only row 4 does not populate
the fallback column, populating the typed values instead.
If we look at ["data", "typed_value", "name", "value"] we can see that not a single non-NULL value
is here. I suppose this would be populated if we had a row with an object that contains the name
property, but with a type other than string.
I’ll just leave this blog post from the official Parquet website here for further reading. It has some nice visualizations that further explain how this works.
Autoshredding
So far, so good. But we don’t want to have to specify how the values should be shredded. The whole point would be for DuckDB to figure it out from the data, but somehow in our first attempt it just chose to “materialize” the integer and not our object.
Well, it looks like DuckDB decides what to shred based on how often a specific type occurs in the table. If I add two more rows that have the same shape as the object in row 4
INSERT INTO events VALUES (5, {'name': 'Bob', 'age': 27}::VARIANT);
INSERT INTO events VALUES (6, {'name': 'Flo', 'age': 31}::VARIANT);
COPY events TO 'events-more-structs.parquet' (FORMAT parquet);
I get our familiar shredding, this time without having to specify the type.
SELECT name, type, converted_type FROM parquet_schema('events-more-structs.parquet');
┌───────────────┬────────────┬────────────────┐
│ name │ type │ converted_type │
│ varchar │ varchar │ varchar │
├───────────────┼────────────┼────────────────┤
│ duckdb_schema │ NULL │ NULL │
│ id │ INT32 │ INT_32 │
│ data │ NULL │ NULL │
│ metadata │ BYTE_ARRAY │ NULL │
│ value │ BYTE_ARRAY │ NULL │
│ typed_value │ NULL │ NULL │
│ age │ NULL │ NULL │
│ value │ BYTE_ARRAY │ NULL │
│ typed_value │ INT32 │ INT_32 │
│ name │ NULL │ NULL │
│ value │ BYTE_ARRAY │ NULL │
│ typed_value │ BYTE_ARRAY │ UTF8 │
└───────────────┴────────────┴────────────────┘
If I add a bunch more string columns now, I can get the autoshredder to only shred the UTF8 type.
Initially I would have expected that it should be possible to have both shredded age and name
values as well as a fallback not to an untyped BYTE_ARRAY but to a UTF8 string. But thinking about
this, I guess the typed_value for data is now an object of name and age (although it’s not
readily visible in the above representation), so it can’t be a string simultaneously.
It should be possible, however, if the string columns were in a new property inside an object. Let me give this a shot.
INSERT INTO events VALUES (10, {'message': 'Hello world'}::VARIANT);
INSERT INTO events VALUES (11, {'message': 'message'}::VARIANT);
INSERT INTO events VALUES (12, {'message': '42'}::VARIANT);
COPY events TO 'events-message.parquet' (FORMAT parquet);
SELECT name, type, converted_type FROM parquet_schema('events-message.parquet');
┌───────────────┬────────────┬────────────────┐
│ name │ type │ converted_type │
│ varchar │ varchar │ varchar │
├───────────────┼────────────┼────────────────┤
│ duckdb_schema │ NULL │ NULL │
│ id │ INT32 │ INT_32 │
│ data │ NULL │ NULL │
│ metadata │ BYTE_ARRAY │ NULL │
│ value │ BYTE_ARRAY │ NULL │
│ typed_value │ NULL │ NULL │
│ message │ NULL │ NULL │
│ value │ BYTE_ARRAY │ NULL │
│ typed_value │ BYTE_ARRAY │ UTF8 │
│ age │ NULL │ NULL │
│ value │ BYTE_ARRAY │ NULL │
│ typed_value │ INT32 │ INT_32 │
│ name │ NULL │ NULL │
│ value │ BYTE_ARRAY │ NULL │
│ typed_value │ BYTE_ARRAY │ UTF8 │
└───────────────┴────────────┴────────────────┘
Indeed we can see that it now materialized age, name and message. Neat!
Real-world data
With my mental model starting to get less foggy, I think it’s time to start digging into some
real-world data from my existing observability setup. Luckily ClickHouse makes it really easy to get
my data out: SELECT * FROM events.otel_traces INTO OUTFILE './otel_traces.parquet' FORMAT Parquet.
This dumps all of the data into a single Parquet file with just over 10 million rows, weighing in at
800 MB.
The shape of this file is determined by the standard schema the OpenTelemetry collector created when I started ingesting data. In ClickHouse it looks like this:
CREATE TABLE default.otel_traces
(
`Timestamp` DateTime64(9) CODEC(Delta(8), ZSTD(1)),
`TraceId` String CODEC(ZSTD(1)),
`SpanId` String CODEC(ZSTD(1)),
`ParentSpanId` String CODEC(ZSTD(1)),
`TraceState` String CODEC(ZSTD(1)),
`SpanName` LowCardinality(String) CODEC(ZSTD(1)),
`SpanKind` LowCardinality(String) CODEC(ZSTD(1)),
`ServiceName` LowCardinality(String) CODEC(ZSTD(1)),
`ResourceAttributes` Map(LowCardinality(String), String) CODEC(ZSTD(1)),
`ScopeName` String CODEC(ZSTD(1)),
`ScopeVersion` String CODEC(ZSTD(1)),
`SpanAttributes` Map(LowCardinality(String), String) CODEC(ZSTD(1)),
`Duration` Int64 CODEC(ZSTD(1)),
`StatusCode` LowCardinality(String) CODEC(ZSTD(1)),
`StatusMessage` String CODEC(ZSTD(1)),
`Events.Timestamp` Array(DateTime64(9)) CODEC(ZSTD(1)),
`Events.Name` Array(LowCardinality(String)) CODEC(ZSTD(1)),
`Events.Attributes` Array(Map(LowCardinality(String), String)) CODEC(ZSTD(1)),
`Links.TraceId` Array(String) CODEC(ZSTD(1)),
`Links.SpanId` Array(String) CODEC(ZSTD(1)),
`Links.TraceState` Array(String) CODEC(ZSTD(1)),
`Links.Attributes` Array(Map(LowCardinality(String), String)) CODEC(ZSTD(1)),
INDEX idx_trace_id TraceId TYPE bloom_filter(0.001) GRANULARITY 1,
INDEX idx_res_attr_key mapKeys(ResourceAttributes) TYPE bloom_filter(0.01) GRANULARITY 1,
INDEX idx_res_attr_value mapValues(ResourceAttributes) TYPE bloom_filter(0.01) GRANULARITY 1,
INDEX idx_span_attr_key mapKeys(SpanAttributes) TYPE bloom_filter(0.01) GRANULARITY 1,
INDEX idx_span_attr_value mapValues(SpanAttributes) TYPE bloom_filter(0.01) GRANULARITY 1,
INDEX idx_duration Duration TYPE minmax GRANULARITY 1
)
ENGINE = MergeTree
PARTITION BY toDate(Timestamp)
ORDER BY (ServiceName, SpanName, toUnixTimestamp(Timestamp), TraceId)
TTL toDateTime(Timestamp) + toIntervalDay(3)
SETTINGS ttl_only_drop_parts = 1
Once dumped to the Parquet file, we can see the shape as read by DuckDB here:
DESCRIBE SELECT * FROM 'otel_traces.parquet';
┌───────────────────────────────────────────────┐
│ otel_traces.parquet │
│ │
│ Timestamp timestamp with time zone │
│ TraceId varchar │
│ SpanId varchar │
│ ParentSpanId varchar │
│ TraceState varchar │
│ SpanName varchar │
│ SpanKind varchar │
│ ServiceName varchar │
│ ResourceAttributes map(varchar, varchar) │
│ ScopeName varchar │
│ ScopeVersion varchar │
│ SpanAttributes map(varchar, varchar) │
│ Duration ubigint │
│ StatusCode varchar │
│ StatusMessage varchar │
│ Events.Timestamp timestamp with time zone[] │
│ Events.Name varchar[] │
│ Events.Attributes map(varchar, varchar)[] │
│ Links.TraceId varchar[] │
│ Links.SpanId varchar[] │
│ Links.TraceState varchar[] │
│ Links.Attributes map(varchar, varchar)[] │
└───────────────────────────────────────────────┘
You can see a clear correspondence between the Parquet and ClickHouse types. Notably, this file is not in the shape we want yet. For this we’ll need to slice and dice it into different forms to benchmark how queries behave.
Benchmarks
To get an overview of the performance of the Variant type, I’m going to wrangle the payload into the following types to see how the different approaches stack up:
- JSON
- Map / Array
- Variant
- Fully materialized
For simplicity, I’m going to focus on two types of data I have stored in my system, GPS location data and access logs to this website. For both datasets I’ll run a query that can benefit from prefiltering data down to the relevant rows and another one that has to run expensive aggregations across the whole dataset.
For the GPS data, the first query should get distinct days I’ve visited a particular location. The second query will get average speed and altitude hourly across the whole timespan.
For the access logs, the first query will count error responses per URI, and the second one will aggregate daily traffic per browser.
Danger zone! The queries compared here were generated / translated by Claude and may not be the pinnacle of optimization.
GPS details
JSON
This nifty query gets the relevant data from my GPS entries and stores it into a new Parquet file with the data stored as JSON.
COPY (
SELECT
Timestamp,
'gps-location' AS service,
CAST(SpanAttributes AS JSON) AS data
FROM 'otel_traces.parquet'
WHERE SpanName = 'gps-location'
) TO 'gps_location_json.parquet' (FORMAT PARQUET);
DESCRIBE SELECT * FROM 'gps_location_json.parquet';
┌────────────────────────────────────┐
│ gps_location.parquet │
│ │
│ Timestamp timestamp with time zone │
│ service varchar │
│ data json │
└────────────────────────────────────┘The nested structures inside the data field are unfortunately not parsed into JSON as well, but rather stay as string values of the top-level properties. But as the DuckDB JSON type is anyhow just backed by a VARCHAR, I don’t expect too much impact for this.
Now here is the query to list distinct days where I’ve visited a particular location. Pasting these coordinates into a map is left as an exercise to the reader.
SELECT DISTINCT
DATE_TRUNC('day',
CAST(json_extract_string(
CAST(json_extract_string(data, '$.location') AS JSON),
'$.properties.timestamp'
) AS TIMESTAMP)
) AS day
FROM 'gps_location_json.parquet'
WHERE CAST(json_extract(
CAST(json_extract_string(data, '$.location') AS JSON),
'$.geometry.coordinates[0]'
) AS DOUBLE) BETWEEN 8.486986 AND 8.493896
AND CAST(json_extract(
CAST(json_extract_string(data, '$.location') AS JSON),
'$.geometry.coordinates[1]'
) AS DOUBLE) BETWEEN 47.348284 AND 47.353111
ORDER BY day DESC;The second query to aggregate some data into hourly buckets.
WITH g AS (
SELECT
CAST(json_extract_string(data, '$.location') AS JSON) AS loc
FROM 'gps_location_json.parquet'
)
SELECT
TIME_BUCKET(INTERVAL '1 hour',
CAST(json_extract_string(loc, '$.properties.timestamp') AS TIMESTAMP)
) AS hour,
AVG(CAST(json_extract(loc, '$.properties.speed') AS DOUBLE)) AS avg_speed,
AVG(CAST(json_extract(loc, '$.properties.altitude') AS DOUBLE)) AS avg_altitude,
MAX(CAST(json_extract(loc, '$.properties.horizontal_accuracy') AS DOUBLE)) AS worst_accuracy,
COUNT(*) AS points
FROM g
GROUP BY hour
ORDER BY hour;Note how in those two queries we use the nested timestamp from within the payload instead of the
top-level Timestamp. That’s due to getting those proper timestamps messed up when migrating data
from an even older system into the one I use currently. If what I’m trying here comes to fruition,
I’m going to fix this up.
The benchmark results will follow at the end.
Map
For the Map type I could just leave the data as-is, since the original Parquet file already has the
SpanAttributes property as map(varchar, varchar). But due to the nesting of the data I care
about, this would actually end up being the JSON type just with a detour. Instead, I’m flattening
the structure into a proper Map below. Of course, this kind of defeats the purpose of not having to
care about the exact schema of the data coming my way. But for benchmarking the data types it will
do.
COPY (
WITH src AS (
SELECT
Timestamp,
CAST(SpanAttributes['location'] AS JSON) AS loc
FROM 'otel_traces.parquet'
WHERE SpanName = 'gps-location'
)
SELECT
Timestamp,
'gps-location' AS service,
MAP {
'location.type': json_extract_string(loc, '$.type'),
'location.geometry.type': json_extract_string(loc, '$.geometry.type'),
'location.geometry.coordinates[0]': json_extract_string(loc, '$.geometry.coordinates[0]'),
'location.geometry.coordinates[1]': json_extract_string(loc, '$.geometry.coordinates[1]'),
'location.properties.altitude': json_extract_string(loc, '$.properties.altitude'),
'location.properties.battery_level': json_extract_string(loc, '$.properties.battery_level'),
'location.properties.battery_state': json_extract_string(loc, '$.properties.battery_state'),
'location.properties.course': json_extract_string(loc, '$.properties.course'),
'location.properties.course_accuracy': json_extract_string(loc, '$.properties.course_accuracy'),
'location.properties.horizontal_accuracy': json_extract_string(loc, '$.properties.horizontal_accuracy'),
'location.properties.motion[0]': json_extract_string(loc, '$.properties.motion[0]'),
'location.properties.speed': json_extract_string(loc, '$.properties.speed'),
'location.properties.speed_accuracy': json_extract_string(loc, '$.properties.speed_accuracy'),
'location.properties.timestamp': json_extract_string(loc, '$.properties.timestamp'),
'location.properties.vertical_accuracy': json_extract_string(loc, '$.properties.vertical_accuracy'),
'location.properties.wifi': json_extract_string(loc, '$.properties.wifi')
} AS data
FROM src
) TO 'gps_location_map.parquet' (FORMAT PARQUET);
DESCRIBE SELECT * FROM 'gps_location_map.parquet';
┌────────────────────────────────────┐
│ gps_location_map.parquet │
│ │
│ Timestamp timestamp with time zone │
│ service varchar │
│ data map(varchar, varchar) │
└────────────────────────────────────┘SELECT DISTINCT
DATE_TRUNC('day', CAST(data['location.properties.timestamp'] AS TIMESTAMP)) AS day
FROM 'gps_location_map.parquet'
WHERE CAST(data['location.geometry.coordinates[0]'] AS DOUBLE)
BETWEEN 8.486986 AND 8.493896
AND CAST(data['location.geometry.coordinates[1]'] AS DOUBLE)
BETWEEN 47.348284 AND 47.353111
ORDER BY day DESC;SELECT
TIME_BUCKET(INTERVAL '1 hour',
CAST(data['location.properties.timestamp'] AS TIMESTAMP)) AS hour,
AVG(CAST(data['location.properties.speed'] AS DOUBLE)) AS avg_speed,
AVG(CAST(data['location.properties.altitude'] AS DOUBLE)) AS avg_altitude,
MAX(CAST(data['location.properties.horizontal_accuracy'] AS DOUBLE)) AS worst_accuracy,
COUNT(*) AS points
FROM 'gps_location_map.parquet'
GROUP BY hour
ORDER BY hour;Variant
Finally, let’s get to the real meat of the story. Here we store the gps location data into a Parquet file where the relevant data is stored as a Variant.
COPY (
SELECT
Timestamp,
'gps-location' AS service,
CAST(CAST(SpanAttributes['location'] AS JSON) AS VARIANT) AS data
FROM 'otel_traces.parquet'
WHERE SpanName = 'gps-location'
) TO 'gps_location_variant.parquet' (FORMAT PARQUET);
DESCRIBE SELECT * FROM 'gps_location_variant.parquet';
┌────────────────────────────────────┐
│ gps_location_variant.parquet │
│ │
│ Timestamp timestamp with time zone │
│ service varchar │
│ data variant │
└────────────────────────────────────┘SELECT DISTINCT
DATE_TRUNC('day', CAST(data.properties.timestamp AS TIMESTAMP)) AS day
FROM 'gps_location_variant.parquet'
WHERE CAST(data.geometry.coordinates[1] AS DOUBLE) BETWEEN 8.486986 AND 8.493896
AND CAST(data.geometry.coordinates[2] AS DOUBLE) BETWEEN 47.348284 AND 47.353111
ORDER BY day DESC;SELECT
TIME_BUCKET(INTERVAL '1 hour',
CAST(data.properties.timestamp AS TIMESTAMP)) AS hour,
AVG(CAST(data.properties.speed AS DOUBLE)) AS avg_speed,
AVG(CAST(data.properties.altitude AS DOUBLE)) AS avg_altitude,
MAX(CAST(data.properties.horizontal_accuracy AS DOUBLE)) AS worst_accuracy,
COUNT(*) AS points
FROM 'gps_location_variant.parquet'
GROUP BY hour
ORDER BY hour;The double casting to JSON and then Variant is apparently doing heavy lifting to ensure nested values are also properly typed. I can confirm that everything is nicely shredded after I save this file.
Materialized
This one again is not going to be an actual option for my use case as I need to know the schema up front, but it serves as a comparison.
COPY (
WITH src AS (
SELECT
Timestamp,
CAST(SpanAttributes['location'] AS JSON) AS loc
FROM 'otel_traces.parquet'
WHERE SpanName = 'gps-location'
)
SELECT
Timestamp,
'gps-location' AS service,
json_extract_string(loc, '$.type') AS location_type,
json_extract_string(loc, '$.geometry.type') AS geometry_type,
CAST(json_extract(loc, '$.geometry.coordinates[0]') AS DOUBLE) AS lon,
CAST(json_extract(loc, '$.geometry.coordinates[1]') AS DOUBLE) AS lat,
CAST(json_extract(loc, '$.properties.altitude') AS DOUBLE) AS altitude,
CAST(json_extract(loc, '$.properties.battery_level') AS DOUBLE) AS battery_level,
json_extract_string(loc, '$.properties.battery_state') AS battery_state,
CAST(json_extract(loc, '$.properties.course') AS DOUBLE) AS course,
CAST(json_extract(loc, '$.properties.course_accuracy') AS DOUBLE) AS course_accuracy,
CAST(json_extract(loc, '$.properties.horizontal_accuracy') AS DOUBLE) AS horizontal_accuracy,
json_extract_string(loc, '$.properties.motion[0]') AS motion,
CAST(json_extract(loc, '$.properties.speed') AS DOUBLE) AS speed,
CAST(json_extract(loc, '$.properties.speed_accuracy') AS DOUBLE) AS speed_accuracy,
CAST(json_extract_string(loc, '$.properties.timestamp') AS TIMESTAMP) AS gps_timestamp,
CAST(json_extract(loc, '$.properties.vertical_accuracy') AS DOUBLE) AS vertical_accuracy,
json_extract_string(loc, '$.properties.wifi') AS wifi
FROM src
) TO 'gps_location_materialized.parquet' (FORMAT PARQUET);
DESCRIBE SELECT * FROM 'gps_location_materialized.parquet';
┌──────────────────────────────────────────────┐
│ gps_location_materialized.parquet │
│ │
│ Timestamp timestamp with time zone │
│ service varchar │
│ location_type varchar │
│ geometry_type varchar │
│ lon double │
│ lat double │
│ altitude double │
│ battery_level double │
│ battery_state varchar │
│ course double │
│ course_accuracy double │
│ horizontal_accuracy double │
│ motion varchar │
│ speed double │
│ speed_accuracy double │
│ gps_timestamp timestamp │
│ vertical_accuracy double │
│ wifi varchar │
└──────────────────────────────────────────────┘The query here is very straight-forward.
SELECT DISTINCT DATE_TRUNC('day', gps_timestamp) AS day
FROM 'gps_location_materialized.parquet'
WHERE lon BETWEEN 8.486986 AND 8.493896
AND lat BETWEEN 47.348284 AND 47.353111
ORDER BY day DESC;SELECT
TIME_BUCKET(INTERVAL '1 hour', gps_timestamp) AS hour,
AVG(speed) AS avg_speed,
AVG(altitude) AS avg_altitude,
MAX(horizontal_accuracy) AS worst_accuracy,
COUNT(*) AS points
FROM 'gps_location_materialized.parquet'
GROUP BY hour
ORDER BY hour;Access logs details
Now let’s look at another dataset I have in my observability system: Access logs to this website. I build this page into a static bundle which I then serve using a Rust service. The only reason I have this service and reverse proxy via Caddy instead of just letting Caddy serve the static page directly is exactly these access logs. Caddy doesn’t give you insights per domain, but only aggregated metrics. So this code sets up an OpenTelemetry exporter and then logs events on request and on response.
Given the way OpenTelemetry ingests this data, the Events.Attributes property takes the following
form. The two objects correspond to the two info! invocations in the code.
[
{
"code.filepath": "src/main.rs",
"code.lineno": "67",
"code.namespace": "marending_dev",
"level": "INFO",
"referrer": "",
"request": "HEAD",
"target": "marending_dev",
"uri": "/",
"user_agent": "updown.io daemon 2.11"
},
{
"code.filepath": "src/main.rs",
"code.lineno": "85",
"code.namespace": "marending_dev",
"latency": "424164",
"level": "INFO",
"status": "200",
"target": "marending_dev"
}
]JSON
Now, going through the same spiel again, we extract the data into a Parquet file with the relevant fields cast to JSON.
COPY (
SELECT
Timestamp,
'marending' AS service,
CAST("Events.Attributes" AS JSON) AS data
FROM 'otel_traces.parquet'
WHERE ServiceName = 'marending'
) TO 'marending_json.parquet' (FORMAT PARQUET);
DESCRIBE SELECT * FROM 'marending_json.parquet';
┌────────────────────────────────────┐
│ marending_json.parquet │
│ │
│ Timestamp timestamp with time zone │
│ service varchar │
│ data json │
└────────────────────────────────────┘Here’s the first query that counts error responses by URI. The top result is /robots.txt if you
care to know. .env is not much further down the list though.
SELECT
json_extract_string(data, '$[0].uri') AS uri,
COUNT(*) AS error_count
FROM 'marending_json.parquet'
WHERE json_extract_string(data, '$[1].status') LIKE '4%'
OR json_extract_string(data, '$[1].status') LIKE '5%'
GROUP BY uri
ORDER BY error_count DESC
LIMIT 50;And the second query to get the daily traffic by browser:
WITH r AS (
SELECT
Timestamp,
json_extract_string(data, '$[0].user_agent') AS user_agent,
CAST(json_extract(data, '$[1].latency') AS DOUBLE) AS latency_us
FROM 'marending_json.parquet'
)
SELECT
DATE_TRUNC('day', Timestamp) AS day,
CASE
WHEN user_agent LIKE '%Firefox%' THEN 'Firefox'
WHEN user_agent LIKE '%Chrome%' AND user_agent NOT LIKE '%Edg%' THEN 'Chrome'
WHEN user_agent LIKE '%Safari%' AND user_agent NOT LIKE '%Chrome%' THEN 'Safari'
WHEN user_agent LIKE '%Edg%' THEN 'Edge'
WHEN user_agent LIKE '%bot%' OR user_agent LIKE '%Bot%' THEN 'Bot'
ELSE 'Other'
END AS browser,
COUNT(*) AS requests,
AVG(latency_us) AS avg_latency_us
FROM r
GROUP BY day, browser
ORDER BY day, requests DESC;Map
COPY (
SELECT
Timestamp,
'marending' AS service,
"Events.Attributes" AS data
FROM 'otel_traces.parquet'
WHERE ServiceName = 'marending'
) TO 'marending_map.parquet' (FORMAT PARQUET);
DESCRIBE SELECT * FROM 'marending_map.parquet';
┌────────────────────────────────────┐
│ marending_map.parquet │
│ │
│ Timestamp timestamp with time zone │
│ service varchar │
│ data map(varchar, varchar)[] │
└────────────────────────────────────┘SELECT
data[1]['uri'] AS uri,
COUNT(*) AS error_count
FROM 'marending_map.parquet'
WHERE data[2]['status'] LIKE '4%'
OR data[2]['status'] LIKE '5%'
GROUP BY uri
ORDER BY error_count DESC
LIMIT 50;WITH r AS (
SELECT
Timestamp,
data[1]['user_agent'] AS user_agent,
CAST(data[2]['latency'] AS DOUBLE) AS latency_us
FROM 'marending_map.parquet'
)
SELECT
DATE_TRUNC('day', Timestamp) AS day,
CASE
WHEN user_agent LIKE '%Firefox%' THEN 'Firefox'
WHEN user_agent LIKE '%Edg%' THEN 'Edge'
WHEN user_agent LIKE '%Chrome%' THEN 'Chrome'
WHEN user_agent LIKE '%Safari%' THEN 'Safari'
WHEN user_agent ILIKE '%bot%' THEN 'Bot'
ELSE 'Other'
END AS browser,
COUNT(*) AS requests,
AVG(latency_us) AS avg_latency_us
FROM r
GROUP BY day, browser
ORDER BY day, requests DESC;Variant
COPY (
SELECT
Timestamp,
'marending' AS service,
CAST(CAST("Events.Attributes" AS JSON) AS VARIANT) AS data
FROM 'otel_traces.parquet'
WHERE ServiceName = 'marending'
) TO 'marending_variant.parquet' (FORMAT PARQUET);
DESCRIBE SELECT * FROM 'marending_variant.parquet';
┌────────────────────────────────────┐
│ marending_variant.parquet │
│ │
│ Timestamp timestamp with time zone │
│ service varchar │
│ data variant │
└────────────────────────────────────┘SELECT
CAST(data[1].uri AS VARCHAR) AS uri,
COUNT(*) AS error_count
FROM 'marending_variant.parquet'
WHERE CAST(data[2].status AS VARCHAR) LIKE '4%'
OR CAST(data[2].status AS VARCHAR) LIKE '5%'
GROUP BY uri
ORDER BY error_count DESC
LIMIT 50;WITH r AS (
SELECT
Timestamp,
CAST(data[1].user_agent AS VARCHAR) AS user_agent,
CAST(data[2].latency AS DOUBLE) AS latency_us
FROM 'marending_variant.parquet'
)
SELECT
DATE_TRUNC('day', Timestamp) AS day,
CASE
WHEN user_agent LIKE '%Firefox%' THEN 'Firefox'
WHEN user_agent LIKE '%Edg%' THEN 'Edge'
WHEN user_agent LIKE '%Chrome%' THEN 'Chrome'
WHEN user_agent LIKE '%Safari%' THEN 'Safari'
WHEN user_agent ILIKE '%bot%' THEN 'Bot'
ELSE 'Other'
END AS browser,
COUNT(*) AS requests,
AVG(latency_us) AS avg_latency_us
FROM r
GROUP BY day, browser
ORDER BY day, requests DESC;Materialized
COPY (
SELECT
Timestamp,
'marending' AS service,
-- Request event (index 1)
"Events.Attributes"[1]['code.filepath'] AS request_code_filepath,
"Events.Attributes"[1]['code.lineno'] AS request_code_lineno,
"Events.Attributes"[1]['code.namespace'] AS request_code_namespace,
"Events.Attributes"[1]['level'] AS request_level,
"Events.Attributes"[1]['referrer'] AS referrer,
"Events.Attributes"[1]['request'] AS request_method,
"Events.Attributes"[1]['target'] AS request_target,
"Events.Attributes"[1]['uri'] AS uri,
"Events.Attributes"[1]['user_agent'] AS user_agent,
-- Response event (index 2)
"Events.Attributes"[2]['code.filepath'] AS response_code_filepath,
"Events.Attributes"[2]['code.lineno'] AS response_code_lineno,
"Events.Attributes"[2]['code.namespace'] AS response_code_namespace,
CAST("Events.Attributes"[2]['latency'] AS BIGINT) AS latency_us,
"Events.Attributes"[2]['level'] AS response_level,
CAST("Events.Attributes"[2]['status'] AS SMALLINT) AS status,
"Events.Attributes"[2]['target'] AS response_target
FROM 'otel_traces.parquet'
WHERE ServiceName = 'marending'
) TO 'marending_materialized.parquet' (FORMAT PARQUET);SELECT uri, COUNT(*) AS error_count
FROM 'marending_materialized.parquet'
WHERE status >= 400
GROUP BY uri
ORDER BY error_count DESC
LIMIT 50;SELECT
DATE_TRUNC('day', Timestamp) AS day,
CASE
WHEN user_agent LIKE '%Firefox%' THEN 'Firefox'
WHEN user_agent LIKE '%Edg%' THEN 'Edge'
WHEN user_agent LIKE '%Chrome%' THEN 'Chrome'
WHEN user_agent LIKE '%Safari%' THEN 'Safari'
WHEN user_agent ILIKE '%bot%' THEN 'Bot'
ELSE 'Other'
END AS browser,
COUNT(*) AS requests,
AVG(latency_us) AS avg_latency_us
FROM 'marending_materialized.parquet'
GROUP BY day, browser
ORDER BY day, requests DESC;Results
JSON, Map and materialized columns behave pretty much exactly like I would have expected. Only the
one I actually cared about, Variant, is disappointing. I might be hitting
this issue. Although at the time of writing, I get
similar performance on a nightly build of DuckDB even though it’s supposedly fixed on main. I think
this must truly be a limitation that will be addressed at some point. Looking at the query plan with
EXPLAIN ANALYZE, it’s clear that a lot of variant extractions happen, even though in my
understanding it shouldn’t have to do that.
In general, I’m impressed with the performance of DuckDB here. There are just over 30’000 rows in this dataset and it churns through that in no time even with JSON types.
Also interesting are the file sizes of the different parquet files, one can clearly see how the faster formats also compress the data better.
- JSON: 2.2MB
- Map: 1.3MB
- Variant: 1.1MB
- Materialized: 1.0MB
For the access logs, it looks similar. Although here this DuckDB limitation around Variant hits even harder. We must be hitting a highly unoptimized code path here to be that much slower than even JSON handling.
ClickHouse comparison
I don’t want to get too deep into this, as the goal here is primarily to establish the feasibility
of querying my data from a Parquet file. But just to get a feel for the landscape of query engines,
I’m going to translate the four queries I used here to ClickHouse equivalents and run them against
the live system. Note that this is very much an apples-to-oranges comparison: The numbers thus far
have been achieved on an M1 MacBook Air, the ClickHouse ones on a Hetzner CAX21. And the Parquet
files have first been isolated to their respective datasets, while ClickHouse has an assortment of
data in its otel_traces table. Not to mention that the Hetzner server is a noisy shared VPS.
Still, I’m honestly a bit surprised. I was expecting ClickHouse to do better. Looking back to this comparison where I looked at rather similar queries, ClickHouse was pulling way ahead of DuckDB. But there I was checking throughput rather than latency, maybe parallelism is where the client-server model shines. Either way, this only reinforces that DuckDB should be up to the task as I was satisfied with the performance of ClickHouse.
For fairness it should be mentioned that the schema used in ClickHouse is quite a bit more
complicated than our neat Parquet files and uses the Map type. I would suspect we could go
considerably faster by using ClickHouse’s own JSON type, which behaves similarly to Variant in my
understanding.
Verdict
I think I’m reaching a different conclusion than what I set out to see, but I’m still not complaining. I was expecting to see subpar performance on JSON strings, with Variant saving the day and making my undertaking feasible. But I got the opposite: JSON handling is surprisingly fast in DuckDB and the Variant support is unfortunately too young for real-world use.
But the bigger take-away is that the general pattern of writing and querying Parquet files could
actually be quite elegant. While I would be very nervous trying to alter the otel_traces table in
a running system right now, rewriting Parquet files to use Variant down the line should be a walk in
the park (save for rewriting queries of course). In a follow-up note I’ll have to home in on Parquet
file handling, e.g. is there an impact when I have a single file that mixes data sources? And how to
simultaneously write files periodically and query across different files?
Being terrible at leaving my side-projects be comes with plenty of downsides, but I do enjoy that the iteration tends to result in leaner and simpler systems.