This weekend I built a small desktop application that gives tarot readings. The idea is simple: you pick a spread, choose a category (general, love, work, etc.), and provide an (optional) question to ask. The app shuffles the deck of tarot cards, draws cards (or has you pick cards from the deck), and then creates a reading from the cards that you drew. It does this by using a local large language model. The whole thing runs offline - the model lives on your machine and neither your questions nor the readings ever go out over the internet.
What makes the responses from this app different from ones you might get from any other language model has to do with how the app provides the model with information on the cards that were drawn. Rather than relying on whatever the model happened to absorb about tarot during training, I assembled my own database of information on each card - its imagery, its keywords, and its meanings in different contexts like love, career, and finances - and the app pulls the relevant entries for the drawn cards into the prompt at reading time. This is called retrieval-augmented generation (RAG) and it allows a comparatively small model running on your own machine to give readings grounded in a specific, authored body of knowledge rather than its own fuzzy recollections.
Because the model runs locally, the choice of which model is an important design decision. It determines how good the readings are, how much disk and memory the app needs, how fast a reading appears. The licensing of the model is also important if you want to embed it in a commercial application.
The Initial Prototype
I did my initial development using a (relatively) small 3 billion parameter model called Qwen 2.5 (3B). This model produced fairly good readings on simple one or three card spreads, but once I had the rest of the app finshed I decided that I really needed to replace this model before releasing it. There were two main factors behind this.
The first was quality. Although the smaller spreads generated ok results, the ten card Celtic Cross spread caused the model to fall apart. Instead of weaving the cards into a single narrative, it tended to produce a mechanical "list of cards," walking down the positions one at a time with a sentence or two each. This isn't terrible output, but it isn't the compellig narrative that I want the application to produce. Instead its the kind of thing that someone could figure out for themselves just by looking up each card in a beginner tarot book.
The second issue was the model's licensing. Most of the Qwen family of models ships under the permissive Apache-2.0 license, but Alibaba carved out a couple of the sizes, including the 3B, under a separate Qwen Research License that restricts use to non-commercial purposes. That creates a problem for me if I want to try and sell the app, which I was contemplating.
Building a Fair Test
To determine which model would work best for me, I decided to test a number of models of various sizes head-to-head and see which one could produce the best output on that most difficult ten card "Celtic Cross" spread. However, since I would be testing models of various sizes, I would need to design a test that would be fair for all the models involved.
So I built the comparison around the app's own real inputs. Each candidate model received the same prompt my app actually produced - the same card imagery, the same keyword tables, the same question framing - and ran under the same generation settings the app used (the same temperature, the same sampling, the same reply-length budget, etc.). I also ran every model through a single local runtime so that each one woudl use its own native chat formatting. This is important because feeding a model a prompt formatted for a different model is a great way to handicap it for reasons that have nothing to do with its quality.
The point of all this was to make sure that when one model read better than another, it was because the model was actually better - not because I'd accidentally given it an unfair prompt.
The Celtic Cross as the Discriminator
I tested each model on three spreads of escalating difficulty: a single card draw, a three card past/present/future spread, and the ten card Celtic Cross. (The Celtic Cross is one of the most popular tarot spreads.)
I very quickly realized that the short spreads didn't help me differentiate between models. Nearly every model, even the really tiny ones, wrote a perfectly nice one or three card reading. If I had only tested those, I would have concluded that a dozen models were interchangeable and probably would have picked the smallest one I could get away with.
The Celtic Cross was where their quality begame to separate. Ten cards is a lot to hold in your head at once. Three distinct failure modes showed up on the big spread that were missing from the small ones:
- Truncation. Some models that wrote beautifully on a three card spread simply ran out of steam on ten, stopping after two or three cards or cutting off mid-sentence. A gorgeous reading that covers a quarter of the spread is a broken reading.
- Dropped and unnamed cards. Others quietly skipped positions, or - in one memorable case - declined to name the cards at all, which rather defeats the purpose.
- Leaked reasoning. One of the "thinking" models dumped its entire internal chain-of-thought before every answer, despite being told not to, and then truncated the actual reading. Its coverage scores looked great until I realized the card names were appearing inside the leaked thinking rather than in the reading itself.
None of these are things you'd catch by glancing at a single nice-looking sample. You only find them by deliberately pushing each model and reading the whole output carefully.
How I Scored the Readings
Since there's no answer key for "what's a good tarot reading," I had a more complex model, Claude Opus, read the output of every model and score it against the instructions the app provides. The metrics that were most important:
- Coverage - how many of the drawn cards the reading meaningfully addresses. If this is less than 10 on a 10 card spread, then some of the cards have been skipped.
- Card-to-position mapping - does each card get interpreted in its position? Swapping past and present is a serious problem, and a couple of the small models did exactly that.
- Grounding - does it use the imagery and keywords it was given, or does it drift off into generic, horoscope-flavored vagueness and invent details?
- Format - the app asks for one cohesive narrative, not a mechanical bulleted list of cards with a "Takeaway" at the bottom. Several models that were perfectly accurate failed here by formatting the reading as a document instead of telling a story.
- Length and self-termination - does it stay in the right range and stop cleanly, or does it ramble until it runs out of tokens?
There were also practical, non-quality metrics that would also feed into the final decision. These included the license, download size, memory footprint, and speed.
Baseline First: The Bug Was in the Prompt, Not the Model
I already knew the 3B model struggled on the Celtic Cross. However, I still ran it through the new, controlled test alongside the candidates as a baseline. Some of the other candidates were failing the big spread in the same way the 3B model had: by dropping cards and lapsing into a list. When I looked closer I noticed that several models were all dropping the same cards from the spread. Specifically, they were the cards whose descriptions sat near the beginning of what had become a very long prompt. This is the classic "lost in the middle" behavior, where a model pays the most attention to the start and end of its input and lets the middle blur. It wasn't a capacity problem - the prompt fit comfortably in the context window - it was an attention problem, and at least part of it was the prompt's fault, not the model's.
That reframed how I read the results. Some of the Celtic Cross failures I'd been chalking up to weak models were partly self-inflicted, baked into how I'd ordered the prompt. I wouldn't have caught this if I hadn't also run my original baseline model through the same tests. The lesson here is to always measure your current setup with the same yardstick you use on the candidates, or you'll happily blame a model for a flaw your own pipeline shares. I came close to disqualifying good models for a bug that lived in my own prompt - and once I understood it, I could fix the prompt and re-judge everyone fairly.
Two Numbers, Not One
A smaller but instructive detour came from one of the most promising candidates, a model family (Gemma 4) whose files are quite large on disk - around 9.6 GB. That number matters in an app like this, because the user has to download the model before they can run it. A 9.6 GB download is a real cost no matter how the model behaves once it's loaded.
What made this model interesting is that its on-disk size and its resident memory were very different numbers. That particular architecture streams part of itself from disk, so the 9.6 GB download only needed about 3.5 GB of RAM to run - lighter in memory than models half its download size.
But the light memory footprint didn't make the download go away. For a model that ships to someone else's machine I had to weigh both numbers, and a model that sips RAM while still demanding a 9 GB download is a problem for an app people have to install. If it had been a decisively better writer than everything else, that tradeoff might have been worth making - but it wasn't clearly ahead, so I couldn't justify the download on the strength of the memory figure alone.
The takeaway: download size and resident memory are different numbers, and when a model ships to a user's machine, both of them are real costs. Don't let an attractive figure on one quietly excuse a problem with the other.
The Test Results
Here are the results of the controlled comparison of 16 local models for the offline tarot reader.
Each model received byte-identical prompts (the app's real prompt.rs output) for three readings — a 1-card, a 3-card, and a 10-card Celtic Cross — under the app's production settings (num_ctx 8192, temperature 0.8, top_p 0.95, 1024 reply tokens), via ollama /api/chat so each model uses its own chat template.
See summary.md for a more detailed description of the results.
1–2B Class
| Model | Params | License | Download | VRAM (8K ctx) | Native Ctx | Overall Quality |
|---|---|---|---|---|---|---|
| Qwen 2.5 | 1.5B | Apache-2.0 | 986 MB | 1.4 GB | 32K | Good on small spreads; loses the thread on Celtic Cross |
| Llama 3.2 | 1.2B | Llama 3.2 Community | 1.3 GB | 1.7 GB | 128K | Impressive for 1B; minor drift, partial on 10-card |
| Gemma 4 | 2B eff. | Gemma Terms of Use | 7.2 GB | 1.9 GB | 32K | Excellent short readings; truncates on Celtic Cross |
3–4B Class
| Model | Params | License | Download | VRAM (8K ctx) | Native Ctx | Overall Quality |
|---|---|---|---|---|---|---|
| Qwen 3 | 4B | Apache-2.0 | 2.5 GB | 3.9 GB | 32K | Good prose wrecked by leaked reasoning + truncation |
| Phi 3.5 Mini | 3.8B | MIT | 2.2 GB | 5.5 GB | 128K | Over-long, ornate, truncates; a table leaked once |
| Gemma 4 | 4B eff. | Gemma Terms of Use | 9.6 GB | 3.4 GB | 32K | Top-tier 3-card; truncates on 1- and 10-card |
| Qwen 2.5 | 3B | Qwen Research (NC) | 1.9 GB | 2.4 GB | 32K | Dependable all-round — but NC license blocks sale |
| Llama 3.2 | 3.2B | Llama 3.2 Community | 2.0 GB | 3.1 GB | 128K | Excellent coverage for 3B; faintly listy |
| SmolLM3 | 3B | Apache-2.0 | 1.9 GB | 2.7 GB | 64K | Best small model: full coverage, Apache-2.0 |
7–8B Class
| Model | Params | License | Download | VRAM (8K ctx) | Native Ctx | Overall Quality |
|---|---|---|---|---|---|---|
| Qwen 3 | 8B | Apache-2.0 | 5.2 GB | 6.3 GB | 32K | Shipped choice — clean, complete, Apache-2.0 |
| Llama 3.1 | 8B | Llama 3.1 Community | 4.9 GB | 5.9 GB | 128K | Reliable, thorough; Llama license conditions |
| Mistral | 7B | Apache-2.0 | 4.4 GB | 5.6 GB | 32K | Good short readings; omits cards on Celtic Cross |
| Ministral | 8B | Mistral Research (NC) | 4.9 GB | 6.0 GB | 128K | Beautiful writing — but NC license blocks sale |
| Gemma 2 | 9B | Gemma Terms of Use | 5.4 GB | 6.6 GB | 8K | Polished; sometimes won't name the cards |
| Granite 3.3 | 8B | Apache-2.0 | 4.9 GB | 6.5 GB | 128K | Strong, accurate, Apache-2.0 — a real alternative |
| Granite 4.1 | 8B | Apache-2.0 | 5.3 GB | 6.7 GB | 1M | Accurate but document-formatted, not narrative |
The Decision
The model I picked in the end was Qwen 3 (8B).
It was the first model I tested that combined two things I'd previously only seen separately: full coverage of the ten-card Celtic Cross and a genuine flowing narrative. It also had a clean, unambiguously commercial license (Apache-2.0). Other models matched it on quality, but each came with a catch.
The Gemma 4 (E4B) model was arguably the nicest writer and the lightest on memory, but its license is a custom set of terms with a usage policy I'd have to flow down to every customer. The smaller Qwen2.5-7B was the easiest to integrate because it was from the same Qwen2.5 family as the original model I had used, but it reliably dropped a card on the big spread.
Qwen3-8B gave me top-tier readings with none of the license overhead. In exchange I accepted a heavier memory footprint, somewhat slower generation, and a bit of extra integration work to keep its internal reasoning out of the final reading. However, paying for a clean license with a few gigabytes of RAM and disk space was an acceptable trade. I swapped in the new model, generated a few readings, and the qualitative result matched the numbers - the readings were good, and I was happy to ship it.
Takeaways
A few lessons from this that I think generalize well beyond this app:
Test on your hardest real case, not the easy ones. The short spreads made a dozen models look identical; the ten-card spread did all the actual discriminating.
Benchmark your starting point with the same yardstick. Running the model I'd started with through the new test, as a baseline, was an important step. It revealed that part of the failure I was about to blame on candidate models was a flaw in my own prompt, and it saved me from disqualifying good models for the wrong reason.
Make the test faithful to the real thing. Feeding each model the app's actual prompt, in its own native format, under production settings, meant the results transferred directly. A model that won the benchmark won in the app, with no nasty surprises in between.
Account for every cost that constrains you. Download size, memory footprint, and speed are three different numbers, and for a model that ships to someone else's machine they all matter at once. A light memory footprint didn't excuse a heavy download. Weigh each constraint on its own rather than letting an attractive figure on one paper over a problem with another.
Treat licensing as a first-class selection criterion. For anything you intend to ship - especially to sell - the license can disqualify a model before quality ever enters the conversation. It's cheaper to check it first than to fall in love with a model you can't use.
Separate fundamental failures from tunable ones. A model that breaks character or collapses into incoherence is out. A model that drops a single card might be fixable with a better prompt or a retry. Knowing which kind of failure you're looking at tells you whether to disqualify a model or keep working with it.
The end result was an offline tarot app running on a model that generates good readings, can be shipped to other people, and can actually sell.