Name-extraction scorecard (30 models)
Method
- Ground truth = per file, a normalized name is "true" if a strict majority of the models that ran that file extracted it. Names only one or two models found (likely hallucinations or OCR-variant duplicates) are excluded.
- Precision = fraction of a model's extracted names that reached consensus;
- Recall = fraction of consensus names the model found; F1 their mean.
- Names are normalized (lowercase, honorifics/titles stripped, punctuation removed) before comparison.
Coverage is near-complete this run. 26 of 30 models finished all 10 files; four are partial: the three smallest models (
llama3.2:1b8/10,llama3.2:3bandqwen2.5:3b9/10), which are broken on quality anyway, plusqwen3.6:35b-a3b(7/10), which is partial not from weakness but from sheer latency - it timed out / DNF'd on the longest documents. Read its high F1 (0.70) with that coverage caveat: it only emitted names on 3 of the 7 files it reached. For the other 26 models the "DNF inflates F1" caveat no longer bites, so the separate easy-subset table has been dropped.The corpus has 10 files but only ~7 carry real person lists: three short docs (
NAI_..._07_0036,VRTI_CEN_Report_1871,op1246585) are essentially empty, so a healthyEmptycount is 3.Emptyabove 3 means the model is missing people on documents that actually have them. The two hard documents are thetcd~70-name rent-roll andNAI_..._06- where weak models collapse.The comparison was run on a Linux box with an Nvidia RTX 3060 16GB video card.
Full corpus (10 files), sorted by F1
| Model | P | R | F1 | Files | Empty | Persons | Median s | s/name | Notes |
|---|---|---|---|---|---|---|---|---|---|
| anthropic/claude-sonnet-4-6 | 0.80 | 0.99 | 0.88 | 10 | 3 | 108 | 2.94 | 0.50 | Best overall; near-perfect recall |
| deepseek/deepseek-reasoner | 0.75 | 0.99 | 0.86 | 10 | 3 | 114 | 22.79 | 2.70 | Excellent quality, but slow (reasoning) |
| anthropic/claude-opus-4-8 | 0.75 | 0.99 | 0.85 | 10 | 3 | 115 | 3.11 | 0.52 | Top-tier, but no edge over sonnet at higher cost |
| deepseek/deepseek-chat | 0.76 | 0.93 | 0.84 | 10 | 3 | 106 | 2.06 | 0.34 | Standout value - fast, cheap, accurate |
| gemini/gemini-2.5-flash-lite | 0.74 | 0.99 | 0.84 | 10 | 3 | 117 | 1.21 | 0.17 | Fastest cloud at this quality; best value |
| gemini/gemini-2.5-flash | 0.73 | 0.98 | 0.83 | 10 | 3 | 117 | 9.83 | 1.22 | Good, but flash-lite beats it on speed |
| gemini/gemini-2.5-pro | 0.72 | 0.98 | 0.83 | 10 | 3 | 118 | 16.72 | 1.75 | No quality gain over flash-lite, ~14× slower |
| mistral/mistral-large-latest | 0.73 | 0.95 | 0.83 | 10 | 3 | 113 | 6.26 | 1.64 | Strong; slow, high-variance latency |
| anthropic/claude-haiku-4-5 | 0.72 | 0.94 | 0.82 | 10 | 3 | 114 | 2.05 | 0.30 | Current names default; fast and solid |
| openai/gpt-4.1 | 0.71 | 0.92 | 0.80 | 10 | 3 | 113 | 1.99 | 0.35 | Reliable quality/speed balance |
| ollama/qwen3:14b | 0.74 | 0.86 | 0.80 | 10 | 4 | 101 | 81.89 | 14.77 | Best local - finished all 10; brutally slow |
| mistral/mistral-small-latest | 0.70 | 0.90 | 0.79 | 10 | 4 | 111 | 1.10 | 0.22 | Fastest median overall |
| openai/gpt-4.1-mini | 0.68 | 0.87 | 0.77 | 10 | 3 | 111 | 2.79 | 0.45 | Solid cheap cloud |
| mistral/ministral-8b-latest | 0.67 | 0.90 | 0.76 | 10 | 3 | 117 | 5.07 | 0.58 | Decent; uncalibrated |
| openai/gpt-4o-mini | 0.74 | 0.68 | 0.71 | 10 | 4 | 80 | 1.48 | 0.59 | Precise but misses ~⅓ |
| ollama/qwen3.6:35b-a3b | 0.94 | 0.56 | 0.70 | 7 | 4 | 16 | 266.86 | 180.16 | Highest precision in field, but only ran 7/10; unusably slow MoE |
| ollama/llama3.1:8b | 0.61 | 0.79 | 0.69 | 10 | 0 | 114 | 4.68 | 1.02 | Hallucinates; never returns empty |
| ollama/gemma2:9b | 0.67 | 0.69 | 0.68 | 10 | 5 | 90 | 1.49 | 0.73 | Mediocre but stable |
| ollama/granite3.3:8b | 0.71 | 0.53 | 0.61 | 10 | 8 | 65 | 1.34 | 0.66 | Broken: empty on 8/10, dumps names only on rent-roll |
| openai/gpt-4.1-nano | 0.69 | 0.55 | 0.61 | 10 | 9 | 70 | 0.60 | 0.34 | Too weak: empty on 9/10 |
| ollama/gemma3:12b | 0.52 | 0.67 | 0.59 | 10 | 4 | 111 | 7.64 | 1.14 | Full coverage but low precision |
| ollama/mistral:7b | 0.66 | 0.52 | 0.58 | 10 | 4 | 68 | 1.75 | 0.83 | Low recall |
| ollama/llama3.2:3b | 0.59 | 0.46 | 0.52 | 9 | 1 | 66 | 1.09 | 0.48 | Weak; dropped a file |
| ollama/qwen2.5:14b | 0.51 | 0.51 | 0.51 | 10 | 5 | 87 | 2.81 | 1.37 | Surprisingly weak for its size |
| ollama/phi4 | 0.48 | 0.47 | 0.47 | 10 | 4 | 86 | 5.50 | 1.56 | Weak + slow |
| ollama/qwen3:8b | 0.56 | 0.28 | 0.37 | 10 | 4 | 43 | 46.98 | 25.39 | Low recall and extremely slow |
| ollama/qwen2.5:7b | 0.78 | 0.16 | 0.27 | 10 | 7 | 18 | 0.99 | 0.93 | Broken: bails on long input |
| ollama/qwen2.5:3b | 0.75 | 0.16 | 0.26 | 9 | 5 | 16 | 0.71 | 25.15 | Broken; one file timed out (~394 s) |
| ollama/mistral-nemo:12b | 0.40 | 0.07 | 0.12 | 10 | 7 | 15 | 1.41 | 1.74 | Broken: 7/10 empty, 15 names total |
| ollama/llama3.2:1b | 0.29 | 0.03 | 0.05 | 8 | 1 | 7 | 0.66 | 34.94 | Effectively nonfunctional on this task |
Takeaways
claude-sonnet-4-6is still the quality leader (F1 0.88), withdeepseek-reasoner(0.86) andclaude-opus-4-8(0.85) right behind - all three hit ~0.99 recall. Opus shows no advantage over sonnet here at higher cost, and the reasoner's quality comes at a 10× latency penalty (median 23 s, s/name 2.70).- The new value champions are
gemini-2.5-flash-liteanddeepseek-chat(both F1 0.84). flash-lite is the fastest cloud model in the whole field (median 1.21 s, 0.17 s/name) yet matches the mid-frontier on quality - it's a strong candidate to become thenamesdefault overhaiku-4-5(0.82).deepseek-chatis nearly as fast (2.06 s) and similarly cheap. - Bigger Gemini is not better Gemini.
gemini-2.5-pro(0.83) does not beatflash-lite(0.84) on quality while being ~14× slower (16.7 s vs 1.2 s).flashsits between them on speed with no quality edge. Use flash-lite. qwen3:14bis the best local model and the first to be genuinely cloud-adjacent - F1 0.80, tyinggpt-4.1, and it finished all 10 files including the hard ones. But it's unusable at scale: median 82 s/file (max 689 s ≈ 11 min), s/name 14.77 vs cloud's 0.17–0.6. Quality-per-token is there; throughput is not.qwen3.6:35b-a3bposts the highest precision in the whole field (0.94) but is the slowest model benchmarked, by a wide margin. Median 267 s/file (max 828 s ≈ 14 min), s/name 180 - ~1000× flash-lite's. That latency is why it only finished 7/10 files, and on 4 of those 7 it returned empty, so its recall (0.56) and 0.70 F1 are coverage-limited rather than a true quality ceiling. When it does emit a name it is almost always right, but at this throughput it is a research curiosity, not a usable extractor.- Most other locals are weak or broken.
llama3.1:8b(0.69) hallucinates and never returns empty;gemma3:12b(0.59) gets full coverage but poor precision. The genuinely broken set - empty on most real documents or bailing on long input - isgranite3.3:8b,gpt-4.1-nano(the one broken cloud model),qwen2.5:7b,qwen2.5:3b,mistral-nemo:12b, andllama3.2:1b(effectively nonfunctional at 0.05).qwen3:8bis the worst combination: low recall (0.28) and 25 s/name. s/nameremains the meaningful throughput metric - latency is output-bound (number of names emitted), not input-bound. It's inflated for the broken tiny models (llama3.2:1b34.9,qwen2.5:3b25.2) because they emit almost no names while occasionally timing out, so read those rows as "broken," not "slow."