Hi :) I’m Sebastian. At my job, I work on LLM agents, and in my free time, I train language models and write about AI.
-
The last layer’s hidden state in a transformer is meant only for being decoded into token probabilities. [Read More]
-
Tokens vs. Bytes
Compared to bytes, tokens have two advantages: 1) They lead to shorter sequence lengths; 2) Their embeddings contain trainset-wide statistics on the specific combination of bytes that they consist of. They also have two disadvantages: 1) They are poorly legible; 2) They encourage memorization. [Read More] -
The Tick-Tock-Boom Cycle: A Strategic Pattern for LLM Development
Authors: Sebastian Nicolas Müller (snimu), Claude 3.5 Sonnet (new) [Read More] -
Sorting shuffled data as a verifiable task
Sorting shuffled data is a great verifiable task for RL, and by extension, a good eval benchmark. In this article, I want to boost it so that it will be used more. [Read More] -
Model stacking
I have a crazy idea for doing large decentralized, asynchronous training of transformers. This article is very speculative. [Read More] -
Forward-Backward prediction
What happens if we do fully causal prediction, but in the backward direction—so last-word-first—in addition to causal prediction forward? [Read More] -
The benefits of looping latents
It has become fashionable to loop latents in transformers (see here or here). The reason that is typically given is that normal reasoners (DeepSeek R1, OpenAI o1, etc.) decode the output latents and sample a token at every step. They then feed that sampled token back into the model and... [Read More] -
Mixture of Tokenizers — Performance on addition
Mixtures of Tokenizers (MoT) make learning math easier than using classical tokenizers. [Read More] -
COCONUT: Parallel pre-training
In Training Large Language Models to Reason in a Continuous Latent Space, Hao et al. post-train language models to use their own output hidden statess as inputs for the next step using a method called “Chain of Continuous Thought” (COCONUT): [Read More] -
On the Byte Latent Transformer
My understanding of the Byte Latent Transformer paper. [Read More] -
Embeddings are in the middle of the model
This is probably pretty obvious to many people, but was unintuitive to me. I’m quickly writing it down to not forget it, and to maybe help somebody. [Read More] -
Merge tokens in autoregressive generation
Models can produce tokens that could be further merged into a single token. For example: “this is a prompt” would be tokenized as “this”, “is”, “a”, “prompt”, but a model might produce is as “t”, “h”, “i”, “s”, “ “, “i”, “s”, “ “, “a”, “ “, “p”, “r”, “o”,... [Read More] -
LLM Test Time Compute Scaling *is* model scaling
Test-time-scaling is equal to model-scaling in two ways: [Read More] -
Tokenization and batch-norm: incorporating global statistics
Status: un-researched speculation [Read More] -
Multi-resolution VLMs for robotics
Status: Idea, no literature research done [Read More] -
Question: Does PEFT with SVD and full parameter finetuning work?
Status: is genuinly a question, I haven’t looked into it yet [Read More] -
Doing Pre-training Research on Instruction Models
I want to do research on pre-training methods, which is of course best done on base models. However, there is a conflict: evals are often better done on instruction models. Here is a simple way to resolve this conflict. [Read More] -
Mixture of Tokenizers (proposal)
Tokenization causes many issues in LLMs (“how many Rs are in strawberry?”, “which is larger, 9.11 or 9.9?”). [Read More]