12 Sep 2024

How to store and query unstructured data

Reinventing the wheel

To store metrics, I have a Grafana and Prometheus deployment on my server. I’m primarily feeding IoT sensor data from various sensors around the house into that system. Lately, I’ve craved a more customized, lean solution that would serve as my central place to chuck any data for later analysis. Instead of utilizing an existing stack, I’m setting out to build my own. It’s going to be worse in every conceivable way, but it’ll be mine.

Requirements

The system I’m envisioning (let’s call it observatory for, you know, reasons) should be push-based, unlike Prometheus, which expects to be able to scrape a metrics endpoint. This makes sense for always-on backend services, not so much for battery-powered IoT devices.

It should not impose any strict schema on the data it’s receiving. One use case will be to store structured logs of backend services. A new field being added to a logged event must obviously not require a schema migration in observatory.

To simplify the hosting, observatory should use an embedded database and also contain a visualizing frontend. (I just want to experiment with an embedded DB to be honest.)

Scenario

To evaluate possible approaches, I have to get a bit more specific. So let’s consider the following use cases that observatory should handle nicely:

GPS location data sent by Overland. This data comes in a predefined structure. I wouldn’t want to have to hardcode the schema into observatory however, to simplify migrating to a different GPS logger app, for instance. Queries might be all coordinates visited in the last 30 days or number of distinct times a certain geo fence has been visited.

Structured logs from various backend services I’m operating. A sample log could look like this:

{
	"timestamp": "2069-09-08T09:06:45.944Z",
	"span_id": "1287394",
	"level": "ERROR",
	"user": "Bobby Tables",
	"message": "Dropping database failed"
}

Now consider a new event being emitted that includes a new field of origin IP address. Naturally, these logs must still be handled gracefully by observatory. Here, a query might be the ongoing error rate of all requests to a specific service or the average response time of an endpoint (granted, I have span entry and exit timestamps).

CO2 sensor data coming from an ESP32-based IoT device. A query of interest might be the average CO2 concentration per month over the last year.

Challenges

I’ve already abandoned the idea of making observatory a general observability and metrics stack including user-provided data queries and customizable visualizations in favor of a monolith that serves exactly my usecases. That means dashboards are hardcoded with queries known at build time. You can get a glimpse of the original vision in this note.

With the hard part descoped, I’m now primarily concerned with how to store data of unknown shape in a way that we can efficiently query it later. Traditional databases expect to know the schema ahead of time; in return, they give us lots of mechanisms to speed up querying later. It boils down to a trade-off between work spent when storing data and work spent when querying data.

In particular, combining metrics and structured logs has been throwing me for a loop. Traditional time series data has known approaches to speed up querying; I’m not very concerned about this use case. Log data, on the other hand, is usually accessed by label and over an interval to, say, track down an error in an application. As far as I’m aware, it’s rare to generate metrics from logs, which is what I effectively want: Given a stream of logs, I want to generate error rates and much more. Normally, people “pre-aggregate” these metrics in the application and expose metrics directly.

Why not just expose metrics for my structured logs use case then? Because I enjoy the idea to not have to predefine what kind of data I may be interested in later. Others have written about this

Approaches

JSON column in DB

The most straightforward approach is clearly to just chuck the JSON body that is sent observatorys way into a column. Insertion is very fast, the system has to do very little work here. Querying data, on the other hand, comes with quite an overhead. Running a query that reaches into the JSON body, like say, averaging CO2 values by month, is quite a bit slower than if the values were numeric columns. Implementation effort is quite low for this approach.

Entity-attribute-value model

In the EAV model, the idea is to have a single table that contains entity, attribute, and value columns. In particular, the attribute column stores the name of the attribute that the value describes. Say “CO2” in our example. See the Wikipedia page for more information. This brings the much desired flexibility: We don’t need to know attributes up front. However, I’m unclear on how you would store the structured log example. Here, you have multiple attribute-value pairs that should stay in relation. If you want to query the error rate of your application over time, it might be sufficient to just have level: "ERROR" entries in the DB, but if you want to query which user causes errors, these attributes must be linked. I suppose you could do that via the entity column.

Schema evolution

Another idea I’ve had, where I couldn’t find much prior art (probably for good reason), is automatic schema evolution. Imagine a system that automatically migrates a table schema to fit the incoming data: Queries would be fast as the data is stored in an efficient format, inserts would be fast if the data conforms to the existing schema. The moment a datum comes your way that deviates from the schema, all hell breaks loose. The overhead of migrating the DB in the middle of normal operation could be detrimental. I think this could be managed to some degree, for instance by buffering data in memory while the DB sorts itself out.

There are many more approaches, specifically for time series data, there are log-structured merge trees, time-structured merge trees and many more hierarchical data structures. I won’t consider them here as I’ll be writing the system myself, I want to avoid such complexity.

October update: Clickhouse published a nice article on their new JSON storage format. It’s an interesting read and kind of confirms the intuition that there isn’t an easy way to do this.

Choices choices

When we line up the approaches on an axis from “all the work happens on insert” to “all the work happens on query”, the simple JSON column is all the way to the right. But somehow I enjoy that. It makes a lot of sense to only extract data from the JSON we actually need at query time, it frees us to start chucking data into the system immediately and figure out what kind of queries we want to run on the data later. It reminds me of event sourcing in that regard. The fact that it is the simplest to implement doesn’t hurt its case either.

Optimizations

The simple JSON approach also has the advantage of starting simple but allowing seamless optimizations on top. For instance, most DBMS allow adding indexes into JSON fields. So once it is clear which queries are desirable, indices can be added to speed them up. Also, the raw JSON data could be materialized into a shape that makes it more efficient to query. Whether this is done in real time or in batches lagging behind is up to the user.

Going further, to keep the size of the database in check (logs might blow it up quickly), one can segment the data easily into time chunks. Serializing older data (with known shape at this point, since no new data gets added here) into, e.g., Parquet files and uploading them to object storage might be a fancy way to keep all the data around, if at more costly access latency.