
Post: A Perfect Assessment Score Is Now a Red Flag
Thesis: In the AI era, a perfect assessment score is a red flag, not a green light. When AI has dragged the average toward the ceiling, a flawless score signals willingness to use every available tool — not superior ability — and a benchmark set at perfection rejects the honest candidates you most want.
I’ll argue something that sounds backwards: stop celebrating perfect scores. In a world where candidates have a tool open in another tab, a 100% tells you less about ability than a thoughtful 80% does. This is the sharp end of the AI resume screening pillar.
What This Means
- A perfect score marks tool-use willingness, not the strongest candidate.
- A 100% benchmark rewards gaming and rejects honest strugglers.
- The signal you want is the reasoning behind the score, not the number.
Claim 1: The Average Has Moved to the Ceiling
When AI assistance is widely available, fixed-answer assessment averages climb toward the maximum. The mechanism is direct: a fixed-answer test has a correct response a tool can supply, so once most candidates have the tool, most candidates score near the top, and the distribution compresses against the ceiling. A practitioner asked it directly: “Is the benchmark now 100% because AI has dragged up the average so much?” Picture an assessment that scored a healthy spread of 60 to 90 two years ago now returning a wall of 95-to-100. Once the ceiling is the average, a perfect score stops distinguishing anyone — it’s the new baseline, and a baseline is not a signal. The very thing that made the assessment useful, its ability to separate candidates, has been erased by the compression.
Claim 2: Perfection Selects for Willingness to Game
If clearing your benchmark requires a flawless score, you’ve built a filter that rewards whoever was most willing to use every tool — including ones you didn’t sanction — and rejects the honest candidate who struggled with a genuinely hard problem and scored 80%. The selection effect is the damning part: among candidates of equal ability, the one who reaches for every available aid lands the 100% and the one who works within the rules lands a lower number, so your filter systematically advances the former. Imagine two equally capable applicants — one runs the whole assessment through AI, the other does it honestly and misses two hard items. Your benchmark passes the gamer and drops the honest worker. You’re selecting for exactly the trait you least want in a hire.
Claim 3: Honest Candidates Are Being Punished
The most damning evidence comes from candidates themselves. One sandbagged on purpose: “I scored 29/30, purposefully getting 1 less to make it less obvious — so tell me why I get rejected for not reaching the benchmark.” Sit with what that reveals: a capable candidate understood that a perfect score now reads as suspicious, deliberately dropped a point to appear human, and still got rejected for falling short of a benchmark that only AI-saturated scores reach. The honest applicants who don’t think to sandbag fare worse — they answer straight, score a real 80, and lose to fabricated perfection. When candidates have to fake imperfection to seem credible, your benchmark has inverted its own purpose, punishing the exact honesty it was supposed to reward.
Claim 4: The Number Hides the Reasoning
A score is a single number with no visible thinking behind it — it tells you a candidate produced the right answers but nothing about how, or whether they understand why those answers are right. A candidate who shows their reasoning, names the tradeoff they weighed, and lands a defensible 80% gives you far more signal than a blank 100%, because the reasoning is the evidence of ability and the number is only its shadow. Concretely: a hiring manager comparing a curt perfect score against a worked-through 80% that explains a judgment call learns nothing from the first and a great deal from the second. Reading scores as a clean ranking throws away the only thing still worth evaluating — see output evaluation.
Counterarguments
“A perfect score still shows mastery.” It showed mastery when producing it required mastery. AI severed that link; now it shows access to a tool everyone has. The proof is that the score climbed without the candidates getting better — if perfection still meant mastery, the average wouldn’t have moved. “We need an objective cutoff at volume.” Then make the cutoff about reasoning, not correctness — redesign the assessment so there’s no fixed answer to ace, and score the quality of the thinking against a rubric every reviewer applies the same way. That stays objective and stays defensible while measuring something AI can’t hand a candidate for free. See how to redesign an assessment. The objection argues for a better assessment, not for worshipping the number. “Sandbagging is rare, so the benchmark mostly works.” Even if few candidates deliberately drop a point, the compression is universal — every honest 80 still loses to every fabricated 100, whether or not anyone sandbags. The problem isn’t the rare gamer; it’s the ceiling.
What to Do Differently
Treat a perfect score as a prompt to probe, not a reason to advance — ask the 100% candidate to walk you through the two hardest items and listen for whether they understand their own answers. Redesign assessments around judgment so there’s no clean answer to game, swapping fixed-answer items for prompts that ask a candidate to choose between defensible options and justify the choice. Read the reasoning a candidate shows, and weight a defensible-but-imperfect answer with visible thinking over a flawless one with none. A workable rule: rank by the quality of reasoning first and use the raw score only as a tiebreaker, never the gate. Confirm everything in a structured screen, where a few follow-up questions separate the candidate who earned their answers from the one who borrowed them.
Expert Take
I know this feels perverse — we’re trained to see 100% as the best possible outcome. But that training assumed the score was hard to get. The moment AI made perfection cheap, the perfect score flipped from a signal of ability to a signal of tool-use, and your benchmark started quietly rejecting the honest people who struggled with something genuinely hard. I’d take the candidate who reasoned their way to a defensible 80% over the blank 100% every time. Score the thinking, not the number.
Next Step
Read the pillar guide for the full rebuild, and learn to redesign assessments to resist gaming.

