How I evaluate AI tutors as a builder, not just a parent

Last Tuesday I signed up for a free trial of a well-funded AI reading tutor that a friend's school district is piloting. In ten minutes I found three problems that nobody on the marketing page mentions.

First problem: the model invented a sight word that doesn't exist. It told my test profile (set to age six) that "phop" was a word and asked the child to use it in a sentence. Second problem: median time from the kid finishing a sentence to the tutor speaking back was 4.2 seconds on my home fiber connection — meaning it's worse on actual home Wi-Fi. Third problem: the parent dashboard showed a green checkmark for "comprehension" after a session in which the model had answered its own question twice because the audio pipeline missed the child's reply.

None of this is on the product's homepage. None of it would have shown up in a five-minute demo. All of it would have shown up in their own production logs, which is the point of this article.

I wrote a parent-facing version of this evaluation earlier — a parent's framework for evaluating any AI tutor — and I stand by it. But there's a layer of stuff only a builder catches, and a few parents who follow Lumikids have asked me to write it down. Here it is.

The model-choice rabbit hole

If you've never priced a frontier model, the numbers are jarring. As of late 2026, the going rates for hosted reasoning models for a product like a kids' tutor sit roughly here: a top-tier model from Anthropic or OpenAI runs in the low-double-digit dollars per million output tokens, a mid-tier model is single digits, and a small open-source model self-hosted on a graphics processing unit (GPU) can land under a dollar per million if you don't count engineering time. Anthropic publishes its pricing openly at anthropic.com/pricing [VERIFY exact tier], which is one of the reasons we picked them — you can do the math without a sales call.

For a children's tutor, the price per million tokens is the boring part. The interesting parts are three:

Output variance. Cheaper and smaller models drift more between sessions. The same prompt produces a charming exchange on Monday and a confused one on Wednesday. For an adult productivity tool that's annoying. For a four-year-old who is still building a mental model of how reading works, it's destabilizing.
Refusal behavior with kids' prompts. Children say weird, scatological, emotionally raw things. A model trained for enterprise compliance will refuse half of them in ways that feel cold. A model with no safety training will engage with all of them in ways you don't want. The middle is narrow, and you can only find it by running thousands of real sessions.
Hallucination shape. Some models hallucinate confidently, some hedge. For a tutor, hedging is fine; confident invention — like the "phop" example above — is dangerous, because a six-year-old does not yet have the skepticism to push back.

Lumikids is built on Anthropic's Claude for the reasoning layer because, in our internal evaluations across the three axes above, it had the lowest variance and the most conservative hallucination shape for child-directed prompts. That's not an endorsement copy line; it's the result of running side-by-sides for several months and counting failure modes. If a competitor's website doesn't name their model family, assume they're either embarrassed by the choice or switching providers based on price.

Latency math, the way builders actually do it

Marketers talk about latency as "fast" or "instant." Builders talk about it as a sum of components, each with a number. Here is the rough budget I keep in my head for a voice-first turn:

Speech-to-text: 150–400 milliseconds (ms) depending on the provider and whether streaming partials are used.
Reasoning model first-token latency: 300–900 ms for a frontier model, longer for the smaller ones at peak load.
Reasoning model full response: another 200–800 ms once the first token lands, depending on response length.
Text-to-speech first audio chunk: 200–500 ms with a sub-second voice provider like ElevenLabs, much longer with batch synthesis.
Network and audio buffering on the device: 100–300 ms, more on weak Wi-Fi.

Add it up honestly, and "instant" is impossible. Sub-second is hard. Two seconds is achievable. Five seconds is what most products actually ship, because every layer added by every product manager — a translation pass, a moderation pass, a logging pass — adds 200 ms that nobody removes later.

The reason I count milliseconds is that I wrote a whole article on why a ten-second delay kills your child's learning, and the research on working memory in young children does not get less true when a product is "in beta." If a competitor won't give you a number measured under real home conditions, the number is bad. Companies that are proud of their latency publish it.

The telemetry honesty test

This is the test that separates serious products from polished ones.

Every modern application is wired up to error-tracking software like Sentry (sentry.io) and product-analytics software like PostHog (posthog.com) or equivalents. These tools record, in production, what really happens — every crash, every dropped audio frame, every time a user closed the app at exactly the same point in a flow.

When I evaluate a tutor I haven't built, I look for the telemetry tells:

Silent failures. Does the dashboard ever show a session that "completed" even though my microphone wasn't connected? If yes, the product is reporting success when the user got nothing. That's the worst possible failure mode and the easiest to hide.
Drop-off points. Are there places where the same kid gets stuck on the same screen across multiple sessions? A serious product surfaces those to the parent. A polished product hides them, because they make the engagement chart look worse.
The 100th-session problem. Demos and free trials are session one through five. Real product quality shows up at session 100, when novelty is gone, when the model's repetition patterns become obvious, when a child gets bored of the same three encouragement phrases. Telemetry from your own beta will tell you exactly which interactions are getting stale. Whether the company acts on it is the question.

For Lumikids, our Sentry catches audio pipeline failures within the same session and surfaces them to me before the parent ever sees a misleading "completed" badge. Our PostHog flags any child whose stuck-points repeat across three sessions so we can review the transcript and adjust prompting. None of this is glamorous. All of it is what "AI safety for kids" looks like in practice, beyond the marketing version of that phrase.

Five things I look for before letting Remi near any tutor

Remi is six now. The list I actually apply, in order, when someone hands me an app:

Does it name its model? If the product cannot tell me what reasoning model is running, I assume it's a thin wrapper that may swap providers tomorrow.
What's the measured end-to-end latency, on home Wi-Fi, on the slowest device the kid uses? Not the demo number. The real one.
What does the parent dashboard show when the session went badly? Every product looks fine on a good session. The honest ones tell you about the bad ones.
Does it use streaks, autoplay, or notifications? All three are engagement-maximizing patterns lifted from adult social apps. None belong in a learning tool for a six-year-old. There's more on this in what 'safe AI for kids' actually means.
What happens at session 100? I run it long enough to find out, because the answer is rarely on the marketing page.

If a product fails any of the first three, I don't install it. If it fails the last two, I install it and watch it carefully.

Dollars I won't spend on my own roadmap

Building Lumikids forces me to make budget choices in public. A short list of things I will not buy, in our own product, no matter what the growth chart says:

An autoplay feature that starts the next lesson without the child choosing it. Engagement metric, learning anti-pattern.
A streak counter that rewards consecutive-day usage. Punishes the kid who took a sick day, manipulates the kid who didn't.
A push-notification system tied to "your child hasn't practiced in two days." Shame is not a learning signal.
A leaderboard comparing a child to their classmates. Wrong incentive for early learners, full stop.
An algorithmic feed that orders content by predicted engagement rather than by what the child needs next. The whole point of the product is that the model reads the child, not the engagement curve.

Every one of those would lift session counts. Every one of those would degrade the product. Builders make these choices on Tuesday and have to defend them on Friday when the dashboard dips.

Things a demo is designed to hide

Quick list, from the inside:

Latency on a real home connection versus the office fiber the demo was recorded on.
What happens after the third refusal in a single session, when the model has started to sound clipped.
The first time the kid says something the speech-to-text mishears badly — the recovery is often where the experience breaks.
The audio when the parent walks into the room. Demos always cut here. Real sessions don't.
The pricing page two clicks deeper than the free-trial signup, which is sometimes where the actual cost of a sustained habit lives.

You don't have to be a builder to check any of these. You just have to know to look.

The same scorecard, handed back to you

Here is what I hand to any parent who asks: name the model, measure the latency on your own Wi-Fi, read the dashboard after a session goes badly, count the engagement-maximizing patterns, run it for a month before you call it good. If a product passes all five, including ours, it's worth your child's time. If it doesn't, that's information too.

If you want to put Lumikids through this scorecard yourself, the beta is open at lumikids.dev — score us honestly, and tell me where we fall short.

Image brief

Hero image: A laptop at a kitchen counter showing a stopwatch overlay on a children's reading app, a notebook beside it covered in latency numbers and arrows, late evening light from a window.
Inline image 1: A clean stacked-bar diagram of a single voice turn broken into its five latency components (speech-to-text, model first token, model full response, voice synthesis, network), with each segment labelled in milliseconds, placed inside the "Latency math" section.
Inline image 2: A simple two-column scorecard graphic titled "Before I install it," listing the five evaluation questions on the left and pass/fail checkboxes on the right, placed near the "Five things I look for" section.

Internal link suggestions

"A parent's framework for evaluating any AI tutor" — anchor text: "a parent's framework for evaluating any AI tutor"
"Why a ten-second delay kills your child's learning" — anchor text: "why a ten-second delay kills your child's learning"
"What 'safe AI for kids' actually means (and what it doesn't)" — anchor text: "what 'safe AI for kids' actually means"

Editor's note

Three things for Tim to confirm before publishing: (1) the late-2026 model pricing tiers in the "model-choice rabbit hole" section are illustrative — please verify the Anthropic pricing URL still resolves and the tier description matches current published rates, since this is the kind of detail readers will check; (2) the "phop" anecdote is composited from real evaluation sessions — confirm we're comfortable publishing it without naming the competitor (we don't, but the description is specific enough that someone may guess); (3) the latency component budget (150–400 ms speech-to-text, etc.) reflects our internal measurements as of mid-2026 — worth a quick sanity-check against the most recent production numbers before this goes live.

One more thing —

Lumi is in open beta and free for the first 100 families. If reading time at your house ever feels harder than it should, we built this for you.

Try Lumi free →