LatentScore: Text to Real-time Music on CPU
Table of Contents
Try the interactive demo at latentscore.com or browse the source on GitHub.
LatentScore is an open-source Python library I built that takes a text prompt and turns it into procedural music in real time. You type something like “heavy rain on a tin roof” and get a layered soundscape out, with bass, pads, melody, rhythm, texture, and accent layers all configured from that description. On CPU. Sub-second.
Here’s some music generated with LatentScore:
Vibrant Social Gala
Stillness of a Frozen Memory
Obsessive Shadow of the Relic
No, it won’t beat Suno. It can’t generate vocals, and it doesn’t have the cultural acuity to nail “Arabian nights” or “Indian classical.” But it handles ambient and electronic music pretty well, it runs on a laptop, and you can steer it while it plays.
I want to walk through the whole stack: how the data gets built, how the models fit in, and what it looks like to actually play it.
Why generate recipes instead of audio
If you’ve ever played an instrument, you know you don’t think about how the music should sound at every second. You have a grasp of what notes to play and when, and the rest follows from that structure.
Generative audio models don’t work that way. They synthesize waveforms sample by sample, which is why they need GPUs, take seconds to minutes, and give you back a blob you can’t really edit.
LatentScore flips this around. Think of it like being a blind master chef: you have the knowledge, but you can’t see the stove. So you write down a detailed recipe and hand it to a sous-chef. Writing the recipe takes seconds. The sous-chef never gets tired.
That’s what LatentScore does. Instead of generating audio, it generates a structured recipe (a JSON config) for a procedural synthesizer. The config covers about 34 fields: tempo, root note, scale mode, brightness, density, six instrument layers, spatial effects, melody parameters, and harmony controls. Everything uses categorical labels. Brightness is "dark" or "bright", not a float.
Each output also includes stuff beyond the music. A thinking field where the model explains its sonic reasoning (“rain on metal = dark, percussive, enclosed, drone bass for the weight of it…”), a short title summarizing the vibe, and three palettes of five weighted hex colors each. The web UI demo uses these palettes to drive background colors and particle effects, so the same config powers both the audio and the visuals.
Here’s a simplified config:
{ "thinking": "Rain on metal = dark, percussive, enclosed. Drone bass for the weight of it, vinyl crackle for grit...", "title": "Downpour on Tin", "config": { "tempo": "slow", "mode": "minor", "brightness": "dark", "space": "large", "bass": "drone", "pad": "dark_sustained", "rhythm": "soft_four", "texture": "vinyl_crackle", "echo": "heavy", "human": "loose" }, "palettes": [ {"colors": [ {"hex": "#1a1a2e", "weight": "xl"}, {"hex": "#16213e", "weight": "lg"} ]} ]}The synthesizer is CPU-based and deterministic. Same config plus same seed equals identical output every time.
The live generator
The part I’m most pleased with is the live generator. The SDK is always sounding. You write an async Python generator that yields instructions and the system continuously emits audio from the current configuration:
async def set(): yield "heavy rain on a tin roof" await asyncio.sleep(10)
yield MusicConfigUpdate( brightness=Step(-1), echo="heavy" ) await asyncio.sleep(10)
yield "first light through fog"
session = live(set())session.play(seconds=60)When a new instruction arrives, it resolves in the background while the current config keeps playing. When ready, the SDK crossfades into the new config. Even if the backend takes several seconds, the audience hears uninterrupted sound.
Eight of the schema fields have an order to them, so you can nudge them up or down. Calling Step(-1) on brightness takes you from "medium" to "dark". Calling Step(+2) on echo skips you from "subtle" straight to "heavy". The rest take absolute values only. Either way, parameter updates land instantly since no backend call is involved.
The data pipeline
This is where most of the engineering lives. You can’t hand-author thousands of musically coherent configs.
Extracting vibes from text
I started with Common Pile, an openly-licensed text corpus. The reasoning: if the whole system is about converting vibes to music, you need a dataset that captures a wide slice of human experiences. That’s exactly what books contain. Fiction, non-fiction, poetry, technical writing, all of it carries different emotional textures. You want the breadth of vibes people might actually describe.
The extraction step sends each text to an LLM and asks it to pull out vibes at multiple granularity levels. The model returns a five-level descriptor for each vibe it finds, ranging from a rich 2-3 line description at the top down to a single word or two-word label at the bottom. So from one passage about a stormy scene, you might get vibes ranging from “desolate” all the way up to a full paragraph about the weight and texture of the moment. Each level becomes a separate training row, giving the model exposure to inputs at very different levels of specificity.
The extraction also distinguishes character-level vibes (how a character in the text comes across) from scene-level vibes (what the setting feels like), each with the same five-level ladder.
Noise injection
A fraction of the extracted vibes get character-level noise injected using nlpaug, random character substitutions that slightly corrupt the text. This produces a noisy version alongside the original for each row. The idea is robustness: during training, the model sees both clean and slightly garbled inputs, so it learns to handle imperfect prompts. At least one noisy row is forced per data split to make sure every split exercises this path.
Deduplication
Deduplication happens on the vibe content, not on the source text. Two different books about loneliness should dedupe if their extracted vibes are semantically similar, even though the source material is completely different.
The method: embed every vibe with a sentence transformer, compute pairwise cosine similarity, and greedily remove anything above a high similarity threshold. This knocked out roughly a quarter of the vibes, leaving about ten thousand clean entries.
Splitting
The split order matters. Evaluation sets (test and validation) get sampled first, randomly, so they represent the true distribution. Then a diversity-sampled set is carved out for reinforcement learning (more on that below), using farthest-point sampling on vibe embeddings to maximize coverage of the embedding space. The supervised fine-tuning set gets whatever’s left. The key idea: you don’t want your evaluation data skewed by optimization-driven selection upstream.
Config generation with Best-of-N
For each vibe, Gemini 3 Flash generates multiple candidate configs at moderate temperature. Gemini Flash turned out to be surprisingly good at this. It doesn’t accept audio input, but it generates musically coherent configs that hold up when you actually render and listen to them. I tried other models and Gemini Flash consistently produced better recipes at a fraction of the cost.
Each candidate gets validated for format (parses as JSON), schema correctness (matches the Pydantic model), and palette validity (exactly 3 palettes with 5 weighted colors each).
Then a second pass renders every valid candidate to audio and scores it with LAION-CLAP. CLAP embeds both text and audio into the same vector space, so you can directly measure how well “cute bird” matches a 10-second audio clip of a tweeting bird. Or in our case, how well “heavy rain on a tin roof” matches the 60-second audio clip the synth produced from a candidate config.
The highest-scoring candidate becomes the final entry in the retrieval map. The Best-of-N curve shows diminishing returns: going from one candidate to two gives the biggest quality jump, and each additional candidate contributes less.
One more detail on scoring: we penalize configs for producing discordant-sounding audio, but only if the discordance exceeds what the text itself implies. If someone asks for “industrial noise” or “heavy metal chaos,” the output should sound harsh. The penalty only kicks in when the audio is more discordant than the prompt warrants.
Building the retrieval map
The final step embeds every vibe with the same sentence transformer used for deduplication and exports the full map. At runtime, a new text prompt gets embedded, nearest-neighbor lookup returns the closest vibe, and the associated config is used. Sub-second on CPU, no network, no inference.
This is the default “fast” backend. The LLM’s musical knowledge is baked into the curated data and frozen into a retrieval map that works forever.
The model training
I fine-tuned Gemma 3 270M using supervised fine-tuning with LoRA. The trained model can generate configs from text directly on-device.
The thinking field in each training example matters here. By forcing the teacher model (Gemini Flash) to generate reasoning for every config choice, that chain-of-thought transfers to the smaller model during fine-tuning. It lets a 270M parameter model punch above its weight, because it learns to “think through” the vibe decomposition before committing to parameter values.
One thing I learned: small models tend to ramble in the freeform text fields. A repetition penalty during generation helps keep the output tight and prevents the model from looping on phrases.
I also set up a diversity-sampled split specifically for GRPO (Group Relative Policy Optimization) training, using farthest-point sampling to maximize the distance between training examples in embedding space. The GRPO training itself hasn’t run yet, but the data split is ready for it. The diversity sampling matters here because reinforcement learning benefits from varied prompts more than supervised learning does: diverse inputs produce diverse reward signals, which means more learning per gradient update.
The fine-tuned model is one of three backends the SDK supports. The other two are the embedding retrieval (fast, default) and external LLMs via API (anything LiteLLM supports). All three produce the same JSON schema. The synth doesn’t care which backend generated the config.
The benchmark
I benchmarked the backends head-to-head on a held-out test set, measuring text-audio alignment via CLAP, schema validity, and latency.
The embedding lookup scored the highest on text-audio alignment, with perfect schema validity and by far the lowest latency. The frontier LLMs were competitive on alignment but slower by an order of magnitude, with occasional schema validation failures. The fine-tuned Gemma model barely outperformed a random valid config baseline, suggesting mode collapse at that scale.
A caveat: CLAP was used both for Best-of-N selection during dataset construction and for evaluation, which likely inflates the retrieval backend’s relative score. This is a known limitation. Ideally, future benchmarking would use LLM-as-a-judge evaluation and dedicated audio quality models for a more independent signal.
I think the result still says something real about where intelligence needs to live in a system like this. For LatentScore, the intelligence seems better placed in the dataset (carefully curated, quality-scored vibe-to-config pairs built offline) than in a model doing inference at runtime. The LLM’s job is finished once the retrieval map is built.
The audiovisual UI
The web UI wraps all of this into a playable surface. You type a prompt, the config resolves, and the UI displays the generated title, the color palettes, and a full parameter editor. Background particle effects respond to playback state. The palette colors shift as the music shifts. Because the config contains both music parameters and visual metadata, everything stays in sync without any separate analysis step.
Here’s a video demo of the instrument in action.
You can also steer an existing config manually: change any parameter, hit apply, and the synth re-renders with the update.
Try it
pip install latentscoreTry the interactive demo at latentscore.com, or browse everything on GitHub. SDK, dataset, fine-tuned model, web UI. MIT licensed.