Language Models Are Starting to Look Less Like Token Machines

Published 2026-05-12·Updated 2026-05-12·v1·#ai#research#agents#language-models#briefing#ai-agents#diffusion#memory-systems

Language Models Are Starting to Look Less Like Token Machines

Two of the most-liked AI papers on alphaXiv this week point in the same direction from opposite ends of the stack: foundation models are being pushed away from one-shot token prediction and toward systems with separable internal structure. One paper attacks the generation process itself; the other attacks the way agents accumulate procedural memory. The interesting bit is not “new model beats benchmark.” It is that both papers treat today’s dominant interface — next token in, next token out — as too flat.

1. Latent diffusion tries to break the tyranny of left-to-right text

Sources: alphaXiv, arXiv, PDF

What changed: Continuous Latent Diffusion Language Model introduces Cola DLM, a hierarchical latent diffusion language model that separates global semantic organization from local textual realization. Instead of generating text strictly token-by-token, it first maps text into continuous latent variables, models a semantic prior with a block-causal Diffusion Transformer, then decodes the latent plan back into text. The paper reports matched comparisons against ~2B-parameter autoregressive and LLaDA baselines and scaling curves up to roughly 2000 EFLOPs.

Why it matters: This is part of a broader search for alternatives to the autoregressive bottleneck. AR models won because they scale beautifully and are easy to train, not because left-to-right decoding is the only natural shape for intelligence. Cola’s bet is that language generation should look more like planning over compressed semantic space followed by surface realization. If that holds, the long-term implication is bigger than faster decoding: text, image, video, and action may become different decoders over related continuous latent planning machinery.

Contrarian read: The paper is ambitious enough that the burden of proof is high. Latent-variable language models have historically struggled with posterior collapse, fuzzy evaluation, and reconstruction-quality traps. A 99-page paper with many architectural components can be a research program disguised as a model. The right question is not “does Cola beat AR everywhere?” It is whether it creates a reproducible scaling law where semantic-space computation improves faster than token-space computation as models and data grow.

2. Agent memory is becoming a trainable operating system, not a notes folder

Sources: alphaXiv, arXiv, PDF

What changed: SkillOS: Learning Skill Curation for Self-Evolving Agents trains a curator that manages an external skill repository for an LLM agent. The executor is frozen; the curator learns when to insert, update, or delete procedural skills based on grouped task streams and delayed downstream outcomes. The paper frames agent improvement not as “retrieve more context,” but as learning policies over reusable procedural artifacts.

Why it matters: This is a useful correction to the current memory hype. Most “agent memory” systems are just a vector database with optimistic naming. SkillOS is closer to an operating system for procedural knowledge: compact skills, explicit update operations, deletion, and rewards tied to whether later related tasks improve. That lines up with what real agents need: not infinite recall, but better habits, better recovery routines, and less garbage accumulating in memory.

Contrarian read: The hard part may not be learning to save skills; it may be learning not to save them. Long-lived agents fail through memory pollution as much as memory absence. A curator trained on benchmark task streams can look clean because the environment supplies related tasks and measurable rewards. In the wild, task boundaries blur, feedback is delayed or absent, and a bad skill can quietly bias hundreds of later actions. The real milestone is not a higher benchmark score; it is skill repositories that remain compact, inspectable, and corrigible after months of use.

What to watch next

  • For Cola DLM: independent reproductions of the ~2B comparisons, especially compute-normalized quality and latency against strong AR baselines.
  • For SkillOS: whether learned skill curation works outside curated task groups, and whether generated skills remain readable enough for humans to audit.
  • Across both: evidence that “structured internal systems” beat flat token prediction not only on benchmarks, but in reliability, controllability, and long-horizon behavior.

Review note

This note was auto-published to Knowledge OS by the AI analysis workflow. Please review the local Obsidian version when convenient:

/Users/hiroyoshisuzuki/Documents/Obsidian Vault/AI news/AI analysis cron/2026-05-12 language-models-less-like-token-machines.md