06 Jul 2025
Exploring LLMs
A skeptic reconsiders
Work in progress…
Goals and summary
[x] Learn technical details of how LLMs work
Excellent videos by 3Blue1Brown
[x] Explore modern local models
Not great if I want more than simple summarization or proofreading
[x] Try local model on my jour data set
Not working. Context too small or they just can’t keep track of anything.
[x] Explore modern LLM tools and agents
Tried Gemini and Claude. I’ll focus on Claude for now just to dive deep rather than wide.
[x] Explore Claude on proofreading
Works great.
[x] Explore Claude on spotting mistakes in crypto code
Excellent.
[x] Explore Claude on architecture discussions
Phenomenal.
[x] Explore Claude on search (Google replacement)
Pretty useful.
[x] Explore Claude on embedded project
Pretty useful.
[x] Explore Claude on web dev project
Mixed bag.
Explore local models, Claude and Gemini
Who is currently in the Swiss National Council
- Gemini 2.5 Flash: Very good
- Claude Sonnet 4: Good but slow, can’t figure out presidency
- Gemma3:1B: Bad, gets even the number of members wrong
- Llama 3.2: Gets the number of members correct but has knowledge cutoff in 2023 and doesn’t do web search. Also slow to load model.
- Qwen3: Super slow but pretty good.
- Gemma3: Gets number of seats really wrong and annoyingly agreeable to having it pointed out
Summarize my server setup blog post and explain the scripts
- Gemini 2.5 Flash: Phenomenal
- Claude Sonnet 4: Phenomenal, a bit more concise than Gemini
- Gemma3:1B: Useless. When asking for typos it thinks I’m talking about typos in its response. Asking it to explain something in the script is fruitless.
- Llama 3.2: Not good. Doesn’t find any typos and can’t explain script at all.
- Qwen3: Not good. It comes up with a new deployment script, doesn’t analyze mine.
- Gemma3: Not good. Comes up with new deployment script like Qwen3. Slightly more similar to the script in the post though.
Local models conclusion
As much as I would like to use local models for privacy reasons, it doesn’t look like it’s going to happen. This is not to say that open-weight models in general are bad, just the ones that are small enough that I can run them on my laptop. So I guess I’ll have to go with one of the bigger hosted ones.
Hosted models
I’m going with Claude for this exploration for two reasons. First, it’s said to be the strongest model for coding, which is my primary use case. Second, Anthropic seems to be the most sympathetic of the AI companies out there to me.
I can’t get behind Google due to privacy reasons, they can simply bring together too many threads of my digital persona. OpenAI is exceptionally unpleasant in their public conduct. Mostly Sam Altman if I’m being honest. Grok also won’t cut it for similar reasons.
Unfortunately there are no good options when it comes to privacy, but it seems Anthropic is the least bad.
Claude
Proofreading
Works very well. I can paste in a note in mdx
format and just need to remind it to keep
line breaks as is, and it will do a fine job directly correcting typos and awkward sentences, ready
to paste back into my website.
Even handles markdown tables etc. flawlessly. Much simpler than pasting note into an online spell-checker and manually fixing all the findings. With having the note checked into git, it’s easy to see the diff that Claude produced. Wouldn’t trust it otherwise.
Spot mistakes in crypto code
I’m pasting in the encryption code used in Fieldnotes and ask to assess the security of this code and whether it follows best practices. Claude says the code is sound and accomplishes its goal. Asking follow-up questions, it’s clear that Claude has a rather sophisticated understanding. E.g. “What could be the advantage of using per-note keys rather than encrypting notes directly with the main key?”.
To check that Claude isn’t overly agreeable I ask the same question in a new context with modified crypto
code that includes some issues. In one instance I set the main key to be extractable
, which it correctly
flags as something that could be improved. Further, when passing undefined in the initialization vectors,
it raises alarm bells, calling this a critical bug with catastrophic consequences.
This isn’t to say that an AI chatbot should be entrusted with security audits, it is still reassuring that it doesn’t mess up on obvious issues.
Architecture discussion
I’m writing a short overview of requirements for my upcoming rewrite of my observability / metrics tool. There are a bunch of open questions in my head and I haven’t managed to get a clear picture of what I want to do. Part of the reason is the absolute mess that is OpenTelemetry documentation.
# Observatory: Observability and Time Series System
Observatory is a new system to store observability data as well as time series data.
## Store time series data from embedded devices
There are multiple embedded devices that periodically record sensor data. They should be able
to send that time series data to Observatory.
## Store observability data from backend services
There are multiple backend services implemented in Rust. They use the `tracing` crate to instrument them.
They should be able to export their tracing data to Observatory.
## Visualization
Observatory should include a web application that visualizes the data stored within.
## Questions
- What storage layer is advisable?
- Can a database be used directly to ingest observability data or is a backend application in front of
the storage layer needed?
After pasting in this document and prompting Claude to ask clarifying questions, it produces the updated requirements doc below. It contains a bunch of boilerplate that would satisfy an architect, but also some interesting stuff.
Resulting architecture document
Overview
Observatory will handle data ingestion from embedded devices and backend services, store the data efficiently for perpetual retention, and provide visualization through Grafana integration.
Scale and Performance Requirements
- Device Scale: Support for less than 20 embedded devices
- Data Volume: Handle at most hundreds of messages per second
- Retention: Perpetual data retention (no automatic deletion)
- Query Performance: Up to 5 seconds delay acceptable for visualization
- Downsampling: No requirement for data downsampling or tiered storage
Data Sources
Store Time Series Data from Embedded Devices
There are multiple embedded devices that periodically record sensor data. They should be able to send that time series data to Observatory.
Device Specifications:
- Connectivity: WiFi connection to Observatory
- Data Types: Temperature, CO2, location, and other sensor data
- Flexibility: API should allow sending arbitrary JSON data structures
- Frequency: Data transmission every 30 seconds to every 10 minutes
- Reliability: Devices should cache data locally if Observatory is unreachable
- Connection: No intermittent connectivity issues expected
Store Observability Data from Backend Services
There are multiple backend services implemented in Rust. They use the tracing
crate to instrument them. They should be able to export their tracing data to Observatory.
Observability Specifications:
- Data Types: Both spans and events from Rust tracing
- Format: Structured logging and metrics
- Integration: OpenTelemetry standard for data export
- Services: Multiple Rust backend services using
tracing
crate
Storage Architecture
Technology Stack
- Database: ClickHouse (preferred for time-series analytics)
- Backend: Rust application layer
- Alternative: Open to other technologies if justified
Database Schema
Two specialized tables in ClickHouse:
-- Device sensor data
CREATE TABLE device_data (
timestamp DateTime64(3),
device_id String,
sensor_type String, -- 'temperature', 'co2', 'location', etc.
data JSON,
INDEX idx_device_id device_id TYPE bloom_filter(0.01) GRANULARITY 1,
INDEX idx_sensor_type sensor_type TYPE bloom_filter(0.01) GRANULARITY 1
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(timestamp)
ORDER BY (device_id, timestamp);
-- Observability data (spans + events)
CREATE TABLE observability_data (
timestamp DateTime64(3),
service_name String,
trace_id String,
span_id String,
event_type Enum('span', 'event'),
data JSON,
INDEX idx_service service_name TYPE bloom_filter(0.01) GRANULARITY 1,
INDEX idx_trace trace_id TYPE bloom_filter(0.01) GRANULARITY 1
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(timestamp)
ORDER BY (service_name, timestamp);
Backend Application Requirements
A Rust backend application is required in front of ClickHouse rather than direct database ingestion to provide:
- Protocol Translation: Convert device JSON and OpenTelemetry data to ClickHouse schemas
- Authentication: API key management and user authentication
- Data Validation: Schema validation and data transformation
- Performance Optimization: Batching and buffering for efficient ingestion
- Flexible JSON Handling: Support for arbitrary sensor data structures
API Specifications
Device Data Ingestion
- Protocol: HTTP REST interface
- Endpoint:
/api/v1/devices/{device_id}/data
- Method: POST
- Authentication: API key based authentication
- Data Format: JSON payload with sensor readings
Observability Data Ingestion
- Protocol: OpenTelemetry Protocol (OTLP)
- Endpoint:
/api/v1/otlp/v1/traces
- Standard: Full OpenTelemetry compatibility
- Integration: Compatible with Rust
tracing
crate via OpenTelemetry exporters
Visualization
Observatory should integrate with Grafana for data visualization rather than building a custom web application.
Visualization Requirements:
- Platform: Grafana integration with ClickHouse data source
- Users: Primary users are operators
- Dashboards: Support for real-time dashboards, historical analysis, and alerting
- Query Types: Complex aggregations across time windows and devices
- Performance: Near real-time updates (up to 5 second delay acceptable)
Authentication and Security
API Authentication:
- API key authentication for device data ingestion
- API key authentication for observability data ingestion
User Authentication:
- User authentication system for Grafana visualization access
- No device-level permissions required (all operators can see all device data)
Authorization Scope:
- Authentication required for both data ingestion and visualization
- Single-tenant system (no multi-tenancy requirements)
Deployment and Scalability
Constraints:
- No specific deployment constraints (cloud/on-premises flexibility)
- No horizontal scalability requirements given the scale
- No specific performance SLA requirements beyond the 5-second visualization delay
Integration:
- No requirement to integrate with existing monitoring tools
- No requirement to export data to other systems
- Grafana satisfies all visualization requirements
Technical Implementation Details
Data Processing
- Batching: Configurable batch insertion (2-3 second intervals) to meet 5-second delay target
- Error Handling: Robust error handling for device connectivity issues
- Data Transformation: Convert OpenTelemetry spans/events to unified storage format
Performance Optimization
- Connection Pooling: ClickHouse connection pooling for efficient database access
- Indexing: Optimized indexes for common query patterns
- Partitioning: Monthly partitioning for efficient data management
Monitoring and Reliability
- Caching: Local device caching when Observatory is unreachable
- Buffering: Backend buffering to handle traffic spikes
- Health Checks: System health monitoring capabilities
Success Criteria
- Successfully ingest data from up to 20 embedded devices
- Handle hundreds of messages per second without data loss
- Provide real-time visualization through Grafana with sub-5-second delay
- Support perpetual data retention without performance degradation
- Authenticate and authorize all data access appropriately
- Maintain system availability even when individual devices are offline
I also ask follow-up questions regarding OpenTelemetry that I haven’t been able to find answers to after hours of scouring the web and Claude can give competent answers. Overall, exceptionally helpful for this task.
Search replacement
- Restaurant recommendations: Good
- Smart plug product search: Okay, has to be pointed in some directions but can summarize options quite well.
- Particulate matter sensor: Very good
- JS editor research: Excellent
- HAR file to wiremock stub: Why search for a tool when it can generate the script directly?
Embedded development
- Failed to set up a Rust project for programming an ESP32-C6
- But then again, so did I until I figured out a peculiarity in my toolchain
- Interestingly, Claude prefers to generate code from memory rather than invoke the template generator it most definitely has this code from. That leads to outdated dependencies.
- Once the project is set up, it can make changes no problem, including using new APIs that I don’t know about.
Web development
Masonry grid
- Changing an image layout from cropped images in a square grid to masonry (without CSS masonry grid): Not managing well. It literally tried to absolute position the images *facepalm*.
- Claude lacks contextual persistence: It just modified a file which throws a JS error now. Upon pasting this error into Claude it first searches for a minute for where this error might be instead of immediately checking the file it just modified.
- Not fun to use at all. Claude just messed something new up every time. An exercise in frustration. Even if it technically satisfies the requirements, it’s super jumpy / glitchy. It has no taste whatsoever to build a robust solution. To be fair, it is building this blind. It has no channel to observe the result, that would be hard for me as well. But that’s just the nature of these tools for now I guess.
- I’m giving up on this. It will be faster and importantly less nerve-wracking doing this myself.
Async image processing
- Changing an image upload endpoint from immediate compression and upload of multiple versions to one where only original is uploaded and compression and upload of compressed images is happening asynchronously: Works flawlessly on first try! Did not expect that.
- Later, I noticed a bug that I couldn’t get Claude to resolve itself. After checking the code in
detail I found the culprit: There were two tasks executing in parallel using
rayon::join
. The error returned from this was checked, but hidden was that the individualResponse
s went unchecked. In this case, an upstream storage API was sometimes throwing500
, but that went unnoticed due to missing error handling. It’s bugs like this that don’t happen to me, I’m too meticulous in handling any error the type system throws my way (part of the reason I enjoy Rust so much). That makes it extra annoying when I need to spend 5 minutes tracking down this bug.
Notes on Deep Dive into LLMs like ChatGPT
- Base models don’t know about question / answer format, they simply predict the next token from the context.
- One can fake a conversational agent by constructing a few-shot example prompt that shows the
question / answer format ending in
answer:
. The model will continue in a similar fashion. - Instruct models are post-trained on human-generated conversations to properly bake in the concept of answering questions. While the base training might take months, post-training typically takes only hours.
- Hallucinations are being addressed by first probing a model for what it knows: Paste a paragraph of Wikipedia into one model and have it generate question / answer pairs related to the paragraph. This is easy because the answers are right there in the context window. Take the generated questions to another model and have it answer without having the context. Compare answers to determine if second model knows the answer. If model doesn’t know, add an answer to post-training data set to give answer “Don’t know”.
- Tool usage works as follows: Model generates special tokens for e.g. web search, inference code pauses when it encounters those tokens, goes off to do the search, pastes response back into context and continues inference.
- Interesting insight on reasoning: There is only a fixed amount of computation flowing into any given token. So training data needs to be phrased in such a way that the model doesn’t have to solve the entire problem in one token, but instead can spread out intermediary computation on more tokens. Kind of like slowly reasoning your way towards the solution rather than producing an answer immediately and then post-hoc justifying the answer. This leads to the conclusion that LLMs need tokens to think. Thinking is intrinsically linked to language processing.
- Many counting and spelling tasks are not working well due to tokens. Tokens only exist for efficiency, there are efforts to move to character-level or byte-level models. That may solve this problem.
- Post-processing: Reinforcement learning: Model is sampled many times on same question. Answers are checked for correctness against known answer. Correct answers are trained on to make those token streams more likely (reinforced). E.g. model has reasoned its way through problem well, not pinning too much computation on a single token and thus got to the right answer. This is good and should be encouraged through fine-tuning on those answers.
- DeepSeek R1 was the first model where the use of reinforcement learning was publicly talked about. This model learned to generate thinking sequences, where multiple different approaches are generated and compared by the model. Only then does it generate a nice output intended for the human with the previous thinking in context.
- together.ai for trying open weights models
- Reinforcement learning in verifiable domains: There is a way to reliably tell what the correct answer is. Trivially or via LLM judge.
- RLHF: Reinforcement learning from human feedback
- Reinforcement learning in unverifiable domains (e.g. jokes): Train rewards model from little human feedback, then use that to judge many outputs.
- Discriminator-generator gap: For a human it’s way easier to judge output than create output.
- Risk of RLHF: RL may discover a way to game the model as a lossy simulation of a human is judging output. That may lead to nonsensical output getting high scores. (Adversarial example)
- As a result RLHF is often run briefly to avoid this problem.
- Upcoming capabilities: Multimodal models: Tokenize audio, video and train as usual.
- Upcoming capabilities: Agents: Long, coherent, error-correcting contexts.
- Upcoming capabilities: Test-time training: Current models only do in-context learning on the context.
Notes on How I use LLMs
- Keep context on-topic. Large context may distract model and slightly decrease accuracy.
- Thinking (reasoning) models are considered ones that have been improved with reinforcement learning.
- GPT-4o is non-thinking model.
- OpenAI models starting with
o
(o3-mini, o1, etc.) are thinking models. - Search tool use very useful to quickly gather websites and summarize content.
- Deep research combines search and thinking to do research.
- Some models support document upload. Here, the document is simply added to the context for the LLM to query.
- ChatGPT is trained to recognize math problems it probably can’t do in its “head”. In this case it will invoke a python interpreter instead (tool use).
- Claude Artifacts can build little applications and run them directly in browser in their interface.
- SuperWhisper to transcribe voice to text system-wide. Also useful for prompting without typing.
- Advanced voice mode in ChatGPT: Handle voice natively in the LLM instead of transcribing to text and operating on that.
- NotebookLM for generating podcasts on arbitrary topics.
- DALL-E does not generate images inside the LLM, it sends a caption to a separate image-generation model.
- ChatGPT typically wipes context in a new chat. But there is the option to ask it to remember something. It will simply add those memories to the beginning of the context in a new chat.
- ChatGPT custom instructions to avoid repeating preferences in every new chat.
- Custom GPT: provide a system prompt to get answers in specific format.
Notes on Transformers, the tech behind LLMs
- “Attention is all you need” originally introduced transformers.
- GPT-3 has 175B parameters
- Word embedding in GPT-3 is 12k-dimensional
- Typical word-embedding may choose to encode gender information in one dimension. Thus,
the classic
man - woman = uncle - aunt
. Similarly,hitler + italy - germany = mussolini
. - Further, one direction in the embedding space indicates plurality. i.e.
plur = cats - cat
and the dot product between plurals and theplur
direction is higher than with singular words. - Embedding matrix in GPT-3 has around 617M weights.
- Embedding maps every token to a vector in isolation. It’s the attention mechanism that enables
the network to exchange meaning between the vectors, to arrive at a more enriched meaning for
each token. e.g.
river bank
vsdeposit at the bank
,bank
has different meaning based on context. - At the end, the unembedding matrix maps the last token in the context to a probability distribution across the entire vocabulary (50k in GPT-3) for prediction.
- All the other tokens in the last layer are actually encoding their immediately following tokens and are not used for the prediction of a new token. This turns out to favor training.
- Unembedding matrix also has 617M weights in GPT-3.
- Softmax is used to turn embedding vector into a probability distribution. The temperature is an extra
parameter here to guide how “sharp” the distribution is. i.e
T = 0
, highest component gets 100% of the probability,T = 5
, spreads out the probability much more evenly, thus making predictions more likely that have lower components in the vector. - Components in the output vector after unembedding are called Logits.
Notes on Attention in transformers, step-by-step
- Initially, every input token gets mapped to the same embedding vector, there is no sense of context.
- One interaction of attention may be to update the embeddings of nouns with the preceding adjectives. This should refine the embedding vector to one that captures the essence of the noun in context better. (this assumes tokens are words).
- Query and key vectors determine how relevant each token is to each other. i.e. the embedding of one word attends to the embedding of another.
- The dot products of each query vector with each key vector yields the attention pattern.
- To improve training, not only is the last token expected to predict the next one, but all tokens are expected to predict the following.
- Masking is the process of setting the lower left half of the attention pattern to 0, to ensure later tokens cannot “give away” the prediction of earlier ones.
- Size of attention pattern is bottleneck for context window as it scales with the square.
- Finally, embeddings of tokens are updated by multiplying attention pattern by value vectors (produced using a value matrix). This way, every token gets some part of every other token (preceding it), but more from the ones that are attending it more strongly.
- This whole process is called one head of attention.
- One attention head uses approximately 6.4M weights.
- Multi-headed attention is running multiple attention layers in parallel. 96 per block for GPT-3.
- All deltas of each attention layer are added together to the original embedding.
- GPT-3 has 96 attention blocks, leading to 58B weights dedicated to attention.
- The rest of the 175B parameters are in the in-between multilayer perceptron layers.
Notes on How might LLMs store facts
- 2/3 of weights in a typical transformer live in the Multilayer Perceptron layers (MLP).
- In MLP layers vectors don’t “talk” to each other, the operations happen in isolation.
- First operation in MLP layer is a matrix multiplication that maps input vector into higher dimension.
- In case of GPT-3, this matrix has 49k rows.
- In some way, each of those rows can be thought of as a question that is being asked of the input vector.
- Matrix multiplication is like dot product between rows of matrix and vector.
- Next, resulting vector is run through non-linear function (e.g ReLU) to nicely clip components that didn’t satisfy question.
- Finally a down projection matrix maps the intermediate vector down to the input dimension.
- The resulting vector is added to input vector.
- Single MLP layer has about 1.2B parameters.
- Fascinating insight: For an n-dimensional space, there are exactly n vectors that are pairwise orthogonal to each other. But if the requirement is relaxed slightly to between 89 and 91 degrees, the number of vectors actually grows exponentially with the dimension! (Johnson-Lindenstrauss Lemma)
- This has huge implications for LLMs and explains why model performance seems to scale so well with size. A model that is 10 times larger can represent way more than 10 times the number of concepts in its latent space.
- This explains why inside an MLP layer the neurons are not lighting up as unit vectors to represent a single concept, instead it’s some random looking vector that happens to be one of the nearly orthogonal directions in that high-dimensional space.
- Research area trying to extract true meaning of those neurons: Sparse Autoencoders.