Two seconds is not a long time for an adult. For a five-year-old, two seconds is the difference between a learning moment and a dropped thread. In two seconds Remi has already turned his head, noticed the dog, asked about the dog, and forgotten he was sounding out "branch." A loading screen, even a polite one with a soft animation and a friendly tip, is a working-memory dump dressed up as a feature.
This is why Lumi has no loading screen. Not because we hate spinners. Because removing them is the only way the rest of the product works.
What a spinner actually costs a five-year-old
A child holding a phoneme in their head is running an unrehearsed task. The sound decays unless something acts on it. When a spinner appears, three things happen at once. The child's attention gets pulled to the animation, which is doing exactly what it was designed to do — capture the eye. The phoneme leaves working memory because nothing is rehearsing it. And the emotional signal that says "I am in a conversation" flips to "I am waiting on a machine." All three are recoverable in an adult. None of them recover cleanly in a four-to-seven-year-old.
We wrote about the cognitive math behind this in Why a ten-second delay kills your child's learning. The short version: by ten seconds the chunk is gone, by fifteen the praise lands on an empty seat. A two-second spinner is not "fine because it is short." It is a measurable percentage of the only window the child has to learn the thing.
So we set a rule early. The kid never sees a spinner. Not on first paint, not between turns, not when the model is thinking, not when the audio is rendering. Everything in the Lumi architecture is downstream of that one rule.
The layers of latency in a typical Artificial Intelligence tutor
To remove waiting you have to know what is producing it. A voice-first Artificial Intelligence (AI) tutor has at least five layers stacked on top of each other, and each one adds milliseconds the child pays for.
Network round trip. Phone or tablet to the nearest edge, edge to the model host, back again. On a good home Wi-Fi connection this is 40–120 milliseconds (ms) round trip. On a slower connection or a busy household network, it spikes higher.
Model inference. The reasoning model reads the child's last utterance, the session context, and the system prompt, then generates the next response. A cold call to a large model can take 800–1500 ms before the first token even appears. Generating a full sentence afterwards adds more.
Text-to-Speech (TTS) generation. The model's text reply has to become a human-sounding voice. Traditional TTS pipelines wait for the full text, then render the full audio file, then return it. That batch step alone can add 500–1200 ms.
Audio buffer and decode. The audio arrives at the client, gets decoded, and queues up to play. On older devices the decode step is not free.
Frontend render. The visual side of the turn — the new prompt, the character animation, the next interactive element — has to paint without blocking the audio.
Stack those naively and you get four to six seconds of latency between a child finishing their sentence and Lumi starting to speak. That is the unwatched baseline most AI learning products are shipping with. It is not a spinner problem. It is a "the child has already left" problem.
What we did at each layer
Every layer above is a place we made a specific bet. None of them is exotic. The work is in stacking them so the kid never feels any of them.
Streaming Text-to-Speech (TTS)
We use ElevenLabs streaming TTS so that audio begins playing before the model has finished generating the sentence. The model streams tokens out, those tokens go into the TTS engine the moment they arrive, and the TTS engine streams audio frames back the moment it has them. The first audible word from Lumi reaches the child's ears while the rest of the sentence is still being written. Functionally this turns a 1.5-second batch render into something closer to a 250–400 ms first-audio latency.
The cost: streaming TTS gives us less room to post-process the audio (no after-the-fact prosody fixes, no full-utterance smoothing). We accepted it. A slightly less polished prosody curve is invisible to a five-year-old. A spinner is not.
Prefetched stimuli
While the child is still completing a task, the next task is already being assembled in the background. The next prompt, the next image, the next set of word cards — all loaded into the client before the current turn ends. When the child finishes saying "frog," the next moment is already in memory. The transition is a state change, not a fetch.
This sounds obvious. It is the single biggest reason the app feels alive. Most learning apps load each screen on demand because their content is dynamic and their backends are not built for prediction. We pay a higher backend cost — we generate more than we end up showing — to make the front of the experience seamless.
Audio cache for common phrases
Lumi says some things hundreds of times a day across the user base. "Try that one again." "I heard you say 'cat.' Is that right?" "Nice — let's try a harder one." We pre-generated the 200 most-common Lumi phrases with ElevenLabs and cached the audio at the edge. When Lumi needs to say one of them, there is no TTS step at all. The audio is already there. End-to-end latency for a cached phrase is under 150 ms from the child's last word to Lumi's first sound.
The tradeoff is voice consistency. We had to pick a small enough phrase set that we could re-record if the voice model drifts, and we had to design the dialogue so that "stock" phrases feel like part of a real conversation, not a phone tree. The way we did that was to make them short, warm, and never used as the entire response. They are the opener; the personalised continuation streams in behind them.
Prompt caching with Anthropic's Claude
Lumi is built on Anthropic's Claude. Every Lumi session carries a large amount of context: the child's profile, recent session history, the active curriculum, the safety rules, the persona, the voice instructions. Without caching, that context gets re-tokenised and re-read on every turn, adding hundreds of milliseconds to every model call.
With Anthropic's prompt caching, the static portion of that context lives warm on the model side. On every turn we only pay for the new tokens — the child's latest utterance and a small slice of session memory. The result is a sub-100 ms model warm-up on cached turns and a noticeably tighter time-to-first-token across the session. It also reduces our inference cost meaningfully, which matters because the next layer up — keeping a streaming connection open per active child — is not cheap.
The latency budget
Here is the real budget, end to end, from the moment the child finishes speaking to the moment Lumi's first audible word reaches their ears. The first column is the naive baseline most AI tutors ship with. The second column is what Lumi targets in production today.
| Layer | Naive baseline | Lumi target |
|---|---|---|
| Network round trip (home Wi-Fi) | 80 ms | 80 ms |
| Speech-to-text (child utterance) | 400 ms | 250 ms |
| Model time-to-first-token (cold) | 1200 ms | 90 ms (prompt cache hit) |
| Text-to-Speech first audio frame | 900 ms | 280 ms (streaming) |
| Audio decode and playback start | 120 ms | 80 ms |
| Frontend render of next-turn visual | 250 ms | 0 ms (prefetched) |
| Total perceived latency | ~2,950 ms | ~780 ms |
A spinner appears around the 800 ms mark on most apps. Lumi finishes the first audible word before then. That is the entire trick. There is nothing to load because there is nothing to wait for.
The tradeoffs we accepted
Removing the loading screen is not free, and pretending otherwise would be dishonest.
Higher infrastructure cost. Prefetching means generating turns the child will never see. Edge caching of audio means storing and serving phrase libraries we will eventually re-record. Prompt caching reduces per-call cost but rewards keeping context warm, which means longer-lived sessions. The unit economics of a "no-spinner" architecture are worse than a batch-render one. We chose to absorb it.
More cache complexity. Every cached phrase, every prefetched stimulus, every warm model context is a piece of state that can drift. We invest engineering time in invalidation rules, voice-model version pinning, and prefetch-cancellation when the child takes the session in an unexpected direction. None of that work is visible to a parent. All of it is visible the second it breaks.
Narrower model output variance. Streaming TTS and cached openers constrain how much the model can surprise us in the first 300 ms of a turn. We accepted a slightly more predictable opening so the rest of the response could feel instantaneous. The conversational range comes in the continuation, not the first beat.
What we will not trade. The kid never sees a spinner. We will pay more, ship slower, and write more cache code before we put a loading screen between a five-year-old and the answer they are waiting for. This is the same problem my four-year-old taught me to solve in the first place, seen from the engineering side instead of the parenting side.
Why this is hard to copy
Any team can ship a faster spinner. Removing the spinner is a thousand small bets across the stack, made early, with the budget to keep them aligned as the product grows. It shows up in how we choose voice models, how we shape dialogue, how we structure context, how we plan curriculum loads, how we monitor Web Vitals per turn, how we alert on drift. None of those choices is the moat by itself. The moat is that they all point the same direction.
The loading screen is the easiest part of an app to ship. It is also the most expensive thing to remove. We removed it anyway, because the child on the other side of the screen does not have two seconds to spare.
Try the Lumi beta — no spinner included.
Image brief
- Hero image: A five-year-old mid-sentence at a tablet, mouth open, eyes still locked on the screen, a soft microphone-glow indicator visible — no spinner, no loading animation, warm afternoon light.
- Inline image 1: A simple stacked-bar diagram comparing the naive latency baseline (~3 seconds, mostly model + TTS) to Lumi's target (~780 ms, evenly distributed). Place after the "layers of latency" section.
- Inline image 2: A schematic of the four mitigations — streaming TTS, prefetched stimuli, audio cache, prompt caching — arrayed around a central "child turn" loop, with arrows showing where each one shaves milliseconds. Place inside the "What we did at each layer" section.
Internal link suggestions
- Anchor: "Why a ten-second delay kills your child's learning" →
/blog/why-a-ten-second-delay-kills-your-childs-learning(used inline in the cognitive-cost section) - Anchor: "My four-year-old taught me to solve" →
/blog/how-my-four-year-old-taught-me-to-build-an-ai-tutor(used inline in the tradeoffs section) - Optional add: "Voice-first learning: why we built around speech, not taps" →
/blog/voice-first-learning-why-we-built-around-speechfor readers who want the interface-level argument after the latency-level one.
Editor's note
The latency budget table uses targets that match our current production envelope, but the per-layer numbers should be sanity-checked against the latest ElevenLabs streaming and Anthropic prompt-caching measurements before this goes live — particularly the 90 ms cached time-to-first-token, which is best-case and worth softening to a range if Tim wants to be conservative. The "200 most-common phrases" figure for the audio cache is from the current Lumi build; confirm the exact count and whether to disclose it publicly. The "Try the Lumi beta" Call to Action assumes the public beta URL is live at publish date.
Lumi is in open beta and free for the first 100 families. If reading time at your house ever feels harder than it should, we built this for you.