# Name-extraction scorecard (30 models)

## Method

- **Ground truth** = per file, a normalized name is "true" if a strict majority of the models that *ran that file* extracted it. Names only one or two models found (likely hallucinations or OCR-variant duplicates) are excluded.
- **Precision** = fraction of a model's extracted names that reached consensus; 
- **Recall** = fraction of consensus names the model found; **F1** their mean.
- Names are normalized (lowercase, honorifics/titles stripped, punctuation removed) before comparison.

> **Coverage is near-complete this run.** 26 of 30 models finished all 10 files; four are partial: the three smallest models (`llama3.2:1b` 8/10, `llama3.2:3b` and `qwen2.5:3b` 9/10), which are broken on quality anyway, plus `qwen3.6:35b-a3b` (7/10), which is partial not from weakness but from sheer latency - it timed out / DNF'd on the longest documents. Read its high F1 (0.70) with that coverage caveat: it only emitted names on 3 of the 7 files it reached. For the other 26 models the "DNF inflates F1" caveat no longer bites, so the separate easy-subset table has been dropped.
> 
> The corpus has 10 files but only ~7 carry real person lists: three short docs (`NAI_..._07_0036`, `VRTI_CEN_Report_1871`, `op1246585`) are essentially empty, so a healthy `Empty` count is **3**. `Empty` above 3 means the model is missing people on documents that actually have them. The two hard documents are the `tcd` ~70-name rent-roll and `NAI_..._06` - where weak models collapse.
> 
> The comparison was run on a Linux box with an Nvidia RTX 3060 16GB video card.

## Full corpus (10 files), sorted by F1

| Model | P | R | F1 | Files | Empty | Persons | Median s | s/name | Notes |
|-------|--:|--:|---:|------:|------:|--------:|---------:|-------:|-------|
| anthropic/claude-sonnet-4-6 | 0.80 | 0.99 | **0.88** | 10 | 3 | 108 | 2.94 | 0.50 | Best overall; near-perfect recall |
| deepseek/deepseek-reasoner | 0.75 | 0.99 | 0.86 | 10 | 3 | 114 | 22.79 | 2.70 | Excellent quality, but slow (reasoning) |
| anthropic/claude-opus-4-8 | 0.75 | 0.99 | 0.85 | 10 | 3 | 115 | 3.11 | 0.52 | Top-tier, but no edge over sonnet at higher cost |
| deepseek/deepseek-chat | 0.76 | 0.93 | 0.84 | 10 | 3 | 106 | 2.06 | 0.34 | **Standout value** - fast, cheap, accurate |
| gemini/gemini-2.5-flash-lite | 0.74 | 0.99 | 0.84 | 10 | 3 | 117 | **1.21** | **0.17** | Fastest cloud at this quality; best value |
| gemini/gemini-2.5-flash | 0.73 | 0.98 | 0.83 | 10 | 3 | 117 | 9.83 | 1.22 | Good, but flash-lite beats it on speed |
| gemini/gemini-2.5-pro | 0.72 | 0.98 | 0.83 | 10 | 3 | 118 | 16.72 | 1.75 | No quality gain over flash-lite, ~14× slower |
| mistral/mistral-large-latest | 0.73 | 0.95 | 0.83 | 10 | 3 | 113 | 6.26 | 1.64 | Strong; slow, high-variance latency |
| anthropic/claude-haiku-4-5 | 0.72 | 0.94 | 0.82 | 10 | 3 | 114 | 2.05 | 0.30 | Current `names` default; fast and solid |
| openai/gpt-4.1 | 0.71 | 0.92 | 0.80 | 10 | 3 | 113 | 1.99 | 0.35 | Reliable quality/speed balance |
| ollama/qwen3:14b | 0.74 | 0.86 | 0.80 | 10 | 4 | 101 | 81.89 | 14.77 | **Best local** - finished all 10; brutally slow |
| mistral/mistral-small-latest | 0.70 | 0.90 | 0.79 | 10 | 4 | 111 | **1.10** | 0.22 | Fastest median overall |
| openai/gpt-4.1-mini | 0.68 | 0.87 | 0.77 | 10 | 3 | 111 | 2.79 | 0.45 | Solid cheap cloud |
| mistral/ministral-8b-latest | 0.67 | 0.90 | 0.76 | 10 | 3 | 117 | 5.07 | 0.58 | Decent; uncalibrated |
| openai/gpt-4o-mini | 0.74 | 0.68 | 0.71 | 10 | 4 | 80 | 1.48 | 0.59 | Precise but misses ~⅓ |
| ollama/qwen3.6:35b-a3b | **0.94** | 0.56 | 0.70 | **7** | 4 | 16 | 266.86 | 180.16 | Highest precision in field, but only ran 7/10; unusably slow MoE |
| ollama/llama3.1:8b | 0.61 | 0.79 | 0.69 | 10 | **0** | 114 | 4.68 | 1.02 | Hallucinates; never returns empty |
| ollama/gemma2:9b | 0.67 | 0.69 | 0.68 | 10 | 5 | 90 | 1.49 | 0.73 | Mediocre but stable |
| ollama/granite3.3:8b | 0.71 | 0.53 | 0.61 | 10 | **8** | 65 | 1.34 | 0.66 | Broken: empty on 8/10, dumps names only on rent-roll |
| openai/gpt-4.1-nano | 0.69 | 0.55 | 0.61 | 10 | **9** | 70 | 0.60 | 0.34 | Too weak: empty on 9/10 |
| ollama/gemma3:12b | 0.52 | 0.67 | 0.59 | 10 | 4 | **111** | 7.64 | 1.14 | Full coverage but low precision |
| ollama/mistral:7b | 0.66 | 0.52 | 0.58 | 10 | 4 | 68 | 1.75 | 0.83 | Low recall |
| ollama/llama3.2:3b | 0.59 | 0.46 | 0.52 | **9** | 1 | 66 | 1.09 | 0.48 | Weak; dropped a file |
| ollama/qwen2.5:14b | 0.51 | 0.51 | 0.51 | 10 | 5 | 87 | 2.81 | 1.37 | Surprisingly weak for its size |
| ollama/phi4 | 0.48 | 0.47 | 0.47 | 10 | 4 | 86 | 5.50 | 1.56 | Weak + slow |
| ollama/qwen3:8b | 0.56 | 0.28 | 0.37 | 10 | 4 | 43 | 46.98 | 25.39 | Low recall *and* extremely slow |
| ollama/qwen2.5:7b | 0.78 | 0.16 | 0.27 | 10 | 7 | 18 | 0.99 | 0.93 | Broken: bails on long input |
| ollama/qwen2.5:3b | 0.75 | 0.16 | 0.26 | **9** | 5 | 16 | 0.71 | 25.15 | Broken; one file timed out (~394 s) |
| ollama/mistral-nemo:12b | 0.40 | 0.07 | 0.12 | 10 | 7 | 15 | 1.41 | 1.74 | Broken: 7/10 empty, 15 names total |
| ollama/llama3.2:1b | 0.29 | 0.03 | 0.05 | **8** | 1 | 7 | 0.66 | 34.94 | Effectively nonfunctional on this task |

## Takeaways

- **`claude-sonnet-4-6` is still the quality leader (F1 0.88)**, with `deepseek-reasoner` (0.86) and `claude-opus-4-8` (0.85) right behind - all
  three hit ~0.99 recall. Opus shows **no advantage over sonnet** here at higher cost, and the reasoner's quality comes at a 10× latency penalty (median 23 s, s/name 2.70).
- **The new value champions are `gemini-2.5-flash-lite` and `deepseek-chat` (both F1 0.84).** flash-lite is the fastest cloud model in the whole field (median **1.21 s**, **0.17 s/name**) yet matches the mid-frontier on quality - it's a strong candidate to become the `names` default over `haiku-4-5` (0.82). `deepseek-chat` is nearly as fast (2.06 s) and similarly cheap.
- **Bigger Gemini is not better Gemini.** `gemini-2.5-pro` (0.83) does **not** beat `flash-lite` (0.84) on quality while being ~14× slower (16.7 s vs 1.2 s). `flash` sits between them on speed with no quality edge. Use flash-lite.
- **`qwen3:14b` is the best local model and the first to be genuinely cloud-adjacent** - F1 0.80, tying `gpt-4.1`, and it finished all 10 files
  including the hard ones. But it's **unusable at scale**: median 82 s/file (max 689 s ≈ 11 min), s/name 14.77 vs cloud's 0.17–0.6. Quality-per-token is there; throughput is not.
- **`qwen3.6:35b-a3b` posts the highest precision in the whole field (0.94) but is the slowest model benchmarked, by a wide margin.** Median **267 s/file** (max 828 s ≈ 14 min), s/name **180** - ~1000× flash-lite's. That latency is why it only finished 7/10 files, and on 4 of those 7 it returned empty, so its recall (0.56) and 0.70 F1 are coverage-limited rather than a true quality ceiling. When it does emit a name it is almost always right, but at this throughput it is a research curiosity, not a usable extractor.
- **Most other locals are weak or broken.** `llama3.1:8b` (0.69) hallucinates and never returns empty; `gemma3:12b` (0.59) gets full coverage but poor precision. The genuinely broken set - empty on most real documents or bailing on long input - is `granite3.3:8b`, `gpt-4.1-nano` (the one broken *cloud* model), `qwen2.5:7b`, `qwen2.5:3b`, `mistral-nemo:12b`, and `llama3.2:1b` (effectively nonfunctional at 0.05). `qwen3:8b` is the worst combination: low recall (0.28) *and* 25 s/name.
- **`s/name` remains the meaningful throughput metric** - latency is output-bound (number of names emitted), not input-bound. It's inflated for the broken tiny models (`llama3.2:1b` 34.9, `qwen2.5:3b` 25.2) because they emit almost no names while occasionally timing out, so read those rows as "broken," not "slow."
