Last spring I sat at a kid-sized table in the children's room of our local library, waiting for Remi to finish a Lego book. At the next table, a kindergartener named — I'll call him D — was working through a YouTube video on his mom's phone. The video was an adult voice reading the alphabet over cartoon animals. Each letter sat on screen for about two seconds. D was repeating the sounds carefully, mouthing each one twice. His mother, who I later learned spoke Tigrinya at home and was learning English alongside him, watched closely and corrected his "th." She apologized to me when I said hello. She said, "Sorry, this is how we are practicing." There was nothing to apologize for. She was doing the work a school system, an app industry, and an entire country had failed to do for her family in a way she could afford.
I thought about D for a long time after that. He is the kid most reading products in the United States quietly ignore. He has a curious mind, a careful tongue, a mother who is fighting for him, and almost no software designed for the specific shape of his learning. This piece is about what voice-first Artificial Intelligence (AI) can offer kids like D, what fixed phonics apps get wrong about them, and the parts of the problem we at Lumikids have not solved yet.
What "English Language Learner" actually means in this context
An English Language Learner (ELL) — sometimes called an English Learner (EL) or a Multilingual Learner (ML), depending on the district — is a student whose home language is something other than English and who is still building English proficiency. In United States public schools, ELLs are roughly one in ten students, and in many urban districts they are closer to one in three. The category is wide. It includes a five-year-old whose family arrived last month, a second-generation kid who speaks Spanish at home and English at school, and a child whose parents are bilingual and want to keep their first language strong. The needs are not identical, but they share a few features that most reading apps ignore.
The first is vocabulary. A native English-speaking five-year-old walks into kindergarten knowing roughly five to ten thousand words. An ELL kindergartener may know a few hundred, even if their first-language vocabulary is age-appropriate. Phonics — the part of reading where you connect letters to sounds — is necessary for both kids. But phonics assumes you know what the word means once you sound it out. D can decode "frog." That doesn't mean he knows what a frog is.
The second is pronunciation. Every language carves up the sound space differently. Tigrinya doesn't have the English "th." Spanish doesn't distinguish "sh" and "ch" the way English does. Mandarin doesn't have the "r/l" contrast English speakers expect. None of this is a deficit. It's a starting point. But a reading app that "checks" a child's pronunciation against General American English and dings them when they miss is teaching shame, not reading.
The third is comprehensible input — the idea, from linguist Stephen Krashen, that language is acquired when learners understand messages slightly above their current level. A worksheet of decodable words is not comprehensible input. A patient adult reading a picture book and pointing to the pictures is. Most apps deliver the worksheet.
What fixed phonics apps miss
The dominant model in early reading software is a curriculum tree: a fixed sequence of skills the child climbs in order, with branching for difficulty but not for meaning. These apps work reasonably well for the average native English speaker because they were designed around that learner. For an ELL kindergartener they fail in three specific ways.
They assume vocabulary. The decodable text "the cat sat on the mat" is a perfectly good phonics exercise and a useless meaning exercise if the child doesn't yet know "sat" or "mat" as English words.
They penalize accent. Speech recognition built on adult General American English will flag a Spanish-influenced "ess-cool" for "school" as a failed reading attempt. The child reads correctly and the app says no.
They cannot answer the one question the child most needs to ask: "what does this word mean?" — especially in the child's first language. A fixed tree has no room for that question. It has rooms for "did the child get the answer right" and "did the child get the answer wrong."
Why voice-first matters specifically for ELL
I have written elsewhere about why we built Lumi around speech, not taps. The short version: kids learn language through their mouths and ears first, screens second. For ELL kids that argument is even stronger.
Lumi uses Wispr Flow for speech input. Wispr Flow was trained on a wide range of real human speech, including child speech and accented speech, and it does not treat a non-General-American pronunciation as a failure to read. If a child reads "frog" with a softened final consonant, Lumi accepts it and moves on. If the child reads "frog" as "fog," Lumi notices, mirrors the sound back, and offers a gentle retry — not a buzzer.
For output, Lumi uses ElevenLabs for sub-second voice synthesis. The tutor voice is consistent, clear, and slow enough to be a useful model without sounding like a metronome. ELL parents in our beta have told us this matters because their kid is hearing the same teacher voice every session, instead of a different cartoon character every screen.
The reasoning layer is built on Anthropic's Claude. Claude is what decides, in the moment, whether D needs the word "frog" defined, a picture shown, or a sentence using "frog" before he tries to read it. That decision is not on a fixed tree. It depends on what he just said, how long he paused, and what he has been doing for the last five minutes. We wrote a whole separate piece on why adaptive learning has to be the whole product, and ELL is the case where that argument lands hardest.
The tradeoff we did not expect: ELL needs more visuals
When we started, we leaned hard on voice and used text and images sparingly. The thinking was that voice keeps a kid's eyes off the screen and their attention on the conversation. We still believe that for most learners.
ELL learners taught us we were wrong about the ratio. A child decoding "frog" with no internal picture of a frog is not learning to read; they are learning to make noises. The fix is not more text. It is more pictures, simple animations, and gestures from the tutor's voice ("look at the picture — see the green legs?"). We have added images to a larger share of reading tasks in the last six months specifically because of feedback from ELL families. Native English speakers were not hurt by the change. ELL kids made faster progress.
If you want the research backing on visual scaffolding and language development, Colorín Colorado is the best public resource I know of — they curate research summaries for teachers and families of ELL students, and most of what they recommend matches what we have seen in beta sessions. Colorín Colorado is a project of the public broadcaster WETA, in partnership with the American Federation of Teachers, and it is genuinely useful. The WIDA framework at the University of Wisconsin–Madison is the other anchor I would point parents toward — their English Language Development Standards are widely used in United States schools to describe what ELL kids can do at each stage. WIDA's standards framework is dense but worth a skim if you want to understand what your child's school is using.
What Lumi does not yet do well for ELL — honest list
I am not going to pretend we have solved this. Here is what is still on the roadmap.
Native-language fallback explanations. Right now if D asks "what does frog mean," Lumi will explain in simple English with a picture. We do not yet offer the option to explain in Tigrinya, or Spanish, or Mandarin. We will. Claude can do this competently in many languages; the question is how to expose the option to parents without making it a setting parents have to find. We are working on a "home language" field in the family profile that, when set, lets Lumi offer a short first-language clarification when a child seems stuck on meaning.
Multilingual heart words. "Heart words" are the irregular, high-frequency English words a child has to learn by sight — the, was, said, of. We have a heart words system in English. We do not yet have first-language anchors that connect, for example, Spanish "el" to English "the." This is a small but real gap.
Dialect coverage beyond General American. Our tutor voice is currently a single American English speaker. For a child whose family speaks a regional or non-United-States English dialect at home — Indian English, Caribbean English, West African English, British English — we are still a step behind. Multiple tutor voices are coming.
Sibling pacing for mixed-language families. Many ELL households have one older sibling who is fluent and one younger sibling who is not. Right now each child has their own profile and they are treated as separate learners. We want to do more with the family unit — for example, letting an older sibling record themselves reading a passage that the younger sibling then hears in their own family's voice. That is later in the year, not now.
A closing thought
I want to be plain about why this audience matters. The schools with the highest ELL populations in the United States are often the most under-resourced. The kids who would benefit most from a patient, sub-second, image-supported reading tutor are the kids whose districts can least afford the seat-based pricing of legacy reading platforms. We have made an explicit choice that the Lumi pricing for individual families will stay accessible, and we are starting conversations with public libraries and community programs about institutional access. Families like D's should not have to use a YouTube alphabet video as their only practice tool.
We are not done. We will name what we are missing as we go, and we will fix what we can.
If you have a kid who is learning English alongside another language at home, join the Lumi beta and tell us what is missing for your family.
Image brief
- Hero image: A kindergartener at a small library table next to a tablet, a parent leaning in beside them, a picture book of a frog open between them, warm afternoon light through a tall window.
- Inline image 1: A simple side-by-side diagram comparing a fixed phonics tree (a rigid branching ladder) and a conversational adaptive flow (a looping path with branches for "define word," "show picture," "retry"). Place after the "What fixed phonics apps miss" section.
- Inline image 2: A close-up of a child's hand pointing to a picture of a frog on a tablet, while a soundwave glyph hovers next to the word "frog" on screen — illustrates visual + audio scaffolding. Place inside "The tradeoff we did not expect" section.
Internal link suggestions
- "Voice-first learning: why we built around speech, not taps" — anchor: "why we built Lumi around speech, not taps"
- "Adaptive learning isn't a setting — it's the whole product" — anchor: "why adaptive learning has to be the whole product"
- "The Science of Reading, explained without jargon" — anchor: optional third link from a future edit if the editor wants a primer pointer in the "vocabulary" section
Editor's note
A few things to check before publishing. (1) The library scene with D and his mother is composite — Tim, please confirm you want it framed as a real moment or whether to label it "a scene I see often" to avoid any identifying detail. (2) The Tigrinya "th" claim is correct as a general phonological observation but worth a quick gut check with a linguist friend if we have one on call. (3) Confirm the ELL roadmap items (native-language fallback, multilingual heart words, dialect voices, sibling pacing) are still the right four to name publicly — these become commitments the moment we publish.
Lumi is in open beta and free for the first 100 families. If reading time at your house ever feels harder than it should, we built this for you.