
There is something wonderfully revealing about asking a chatbot to fact-check. It straightens its little digital tie. It announces a methodology. It mentions primary sources, corroboration, context, bias, quote verification, official records, archival searches, expert review, and sometimes, if it is feeling particularly LinkedIn that day, a “multi-step validation framework.” It sounds responsible. It sounds sober. It sounds like a junior employee who has just discovered the word “rigor” and now plans to use it in every meeting until someone from HR intervenes.
Then you ask it to actually verify something. That is where the costume begins to slip.
In a new WIRED piece, senior research editor and fact-checker Meghan Herbst tested the increasingly popular fantasy that AI can do fact-checking.
Not summarize. Not generate a plan. Not decorate the air with words like “credible” and “source.” Actually fact-check.
The result was less a robot apocalypse than a very familiar office problem: a system that knows what the job description sounds like but cannot reliably do the job.
Herbst took the old test she had received when applying for a fact-checking position and gave it to the free versions of ChatGPT, Claude, Gemini, and Grok. The models largely understood the assignment in the shallowest possible sense. They could describe how a fact-checker might approach a story. Some flagged legal issues. Some offered elaborate workflows. One produced the sort of professional-sounding procedural fog that makes software demos feel like governance. But when given a fairly searchable passage to check, the models did not actually check the facts. They talked about checking. They prepared to check. They performed the vibe of checking. Then they stopped before accountability entered the room.
That is the important part. The failure was not theatrical enough to become a viral disaster. Nobody invented a fake legal case in court. Nobody told someone to eat glue. Nobody destroyed a production database while wearing the expressionless smile of automation. This was quieter and therefore more useful. It exposed one of the most dangerous habits in the current AI moment: mistaking procedural language for procedural reality.
The chatbot can describe the adult in the room. That does not make it the adult.
Real fact-checking is not a scavenger hunt with better stationery. It is not simply finding a link that looks serious and placing it near a sentence that feels plausible. It is a discipline built around friction. A claim has to survive contact with sources, context, chronology, competing interpretations, missing records, ambiguous wording, institutional incentives, human memory, and the possibility that everyone involved is wrong in a slightly different way.
WIRED’s fact-checking process, as Herbst describes it, is old-school in the best possible sense. It involves line-by-line annotations, primary sources where possible, ethical and legal review, and the annoying human act of asking whether a sentence means what the writer thinks it means.
Fact-checkers call people. They inspect assumptions. They notice when a quote has been trimmed into something too convenient. They ask whether a statistic came from the study itself, the press release about the study, the executive summary of the press release, or a blog post written by someone who skimmed the executive summary during lunch.
That work is not glamorous. It is also not fully reducible to retrieval. Verification is not just the act of locating information. It is the act of deciding whether the located information deserves to bear weight.
AI systems are extremely good at pretending that this distinction is smaller than it is. A chatbot can produce a neat verification plan because verification plans are made of language, and language is the one room where these systems always look confident. But fact-checking is where language has to answer to the world. The model’s fluent paragraph has to collide with a record, a person, a timestamp, a document, an exception, a missing archive, or a source who says, “That is not what happened.”
The bot can tell you to seek primary sources. It cannot be trusted to know when it has found one.
It can tell you to verify a quote. It may also invent the paragraph that supposedly needs checking. It can warn you about bias while confidently laundering a weak citation. It can tell you to compare sources while misunderstanding why two sources disagree. It can sound cautious while being wrong with excellent posture.
This is the magic trick at the center of AI verification. The machine performs skepticism as text. Humans experience that performance as competence.
The word “verification” is now being stretched in ways that would make an old magazine fact-checker reach for coffee, aspirin, or a different profession. In many AI workflows, verification means asking one model to check another model, asking the same model to reconsider itself, attaching a citation-shaped object, or running output through a system that announces confidence without proving competence.
That may be useful in narrow circumstances. It may catch obvious contradictions. It may surface claims that deserve attention. It may help organize a messy body of material. It may even point a human toward a source worth reading. But once the process is presented as a substitute for independent judgment, the whole thing becomes theater.
The theater is convincing because it has props. There are citations. There are confidence scores. There are bulletproof-sounding product claims about retrieval, grounding, provenance, and enterprise-grade factuality. There are dashboards that make uncertainty look managed because someone put it in a rectangle. There are workflows where the output passes through several systems, all of which are owned by the same original mistake wearing different hats.
This is how organizations get comfortable with weak verification. They do not usually announce that they have decided to stop caring whether things are true.
They build a process that resembles care. They add steps. They add tool names. They add a review stage that nobody has time to perform properly. They call the result “human in the loop,” even when the human is exhausted, undertrained, and quietly expected to approve whatever the machine produced because productivity targets are not going to hit themselves.
The human remains present, but the accountability has moved somewhere foggier. When something goes wrong, everyone can point to the process. The model generated. The retrieval system cited. The workflow flagged. The reviewer approved. The organization documented. The sentence survived the pipeline. That does not mean it survived reality.
This is why Herbst’s WIRED test matters. The models did not fail because they were unable to produce a ritual. They failed because the ritual was not the work.
One of the best parts of the WIRED piece is its refusal to accept the internet as a synonym for knowledge. This sounds obvious until you watch how AI products are sold. Much of the industry talks as if the world has been conveniently converted into scrapeable text, indexed pages, public records, digitized books, and searchable transcripts. It has not.
A vast amount of what matters is offline, private, underdocumented, poorly scanned, misfiled, paywalled, mistranscribed, legally restricted, physically held, historically contested, or living inside someone’s memory. Some of it is in boxes. Some of it is in local archives. Some of it is in court filings that require patience. Some of it is in old newspapers with bad optical character recognition. Some of it is in the tone of a source’s voice when a factual question unexpectedly turns into grief.
A chatbot does not know what it cannot reach. Worse, it often does not know that it cannot know. That is an awkward trait in a fact-checker.
The modern AI answer engine creates a dangerous compression of uncertainty. It takes a messy information landscape and returns a clean answer. The answer may include caveats, but those caveats often function as decorative humility. They soften the performance without necessarily improving the underlying epistemology. The reader sees polish. The organization sees efficiency. The machine sees tokens.
Real verification often begins where polish ends. It asks why the obvious source may not be enough. It asks whether an official statement is self-serving, whether a study’s sample supports the claim being made, whether a quote has traveled through enough secondary sources to lose its original meaning, whether a statistic is current, whether the expert has a conflict, whether the document is authentic, and whether the missing information changes the story.
AI can assist with parts of that work. It can help identify claims. It can help compare documents. It can summarize long material for a human who still reads the source. It can generate a list of questions to ask. It can help detect patterns across large volumes of text. Those are meaningful uses. They are not the same as replacing the person whose job is to be professionally irritating on behalf of reality.
The reason AI is so dangerous in fact-checking is not that it always sounds reckless. Recklessness would be easier to spot. The larger problem is that it often sounds careful.
It has learned the language of caution. It says “may.” It says “according to available sources.” It says, “I could not independently verify.” It says, “It is important to consult primary sources.” It says everything a careful person might say, except the careful person would then do the hard part.
This creates a strange reversal. The more sophisticated the model sounds, the easier it becomes to overtrust the process. A bad answer with obvious nonsense invites resistance. A polished answer with citations invites relief. It tells the user that the uncertainty has been handled. It has not necessarily been handled. It has merely been formatted.
That is why fact-checking cannot be reduced to surface caution. Good verification involves judgment under conditions of ambiguity. It requires knowing when a source is authoritative for one claim but useless for another. It requires noticing that a number is technically accurate but contextually misleading. It requires understanding that a quote can be real and still unfair. It requires knowing when two true statements create a false impression when placed next to each other.
AI struggles here because the problem is not just information retrieval. It is responsibility.
The fact-checker is accountable to an editor, a publication, a reader, a source, a legal standard, and a professional norm. The chatbot is accountable to a prompt, a policy layer, a product design, and a user who may reward confidence more than correctness.
That difference matters. Accountability changes behavior. A human fact-checker knows that a lazy verification can damage someone’s reputation, expose a publication to legal risk, mislead readers, or corrupt the public record. A chatbot knows that the next token should be statistically suitable.
The obvious objection is that humans also make mistakes. Herbst’s own piece ends with a very human admission: after interviewing Angie Holan of the International Fact-Checking Network, she discovered she had forgotten to turn on her recorder. That is exactly the kind of embarrassing detail that makes the article stronger. It prevents the argument from sliding into human vanity.
Humans forget. Humans misread. Humans get tired. Humans develop biases, defend assumptions, miss context, trust the wrong person, and occasionally write sentences that should have been taken out back and buried before publication.
The case for human fact-checking is not that humans are pure instruments of truth. The case is that serious human verification is embedded in norms, procedures, consequences, and relationships that can be challenged.
A human mistake can be interrogated. Who checked the claim? What source did they rely on? Was there an editor? Was the quote recorded? Was the document obtained? Was the source qualified? Was the correction issued? Was the process improved?
AI mistakes often arrive with a different kind of slipperiness. The system may provide a false source, an irrelevant citation, a broken link, or a plausible explanation for why it cannot fully verify something while still producing the answer the user wanted. When challenged, it may apologize and generate another answer with equal confidence. The correction may be a new hallucination wearing the glasses of remorse.
That is not accountability. That is improv.
The goal should not be to ban AI from verification work. That would be both unrealistic and wasteful. The goal should be to stop calling assistance replacement. AI can help gather, sort, surface, compare, and prepare. It can expand the fact-checker’s field of vision. It can also flood that field with plausible garbage if used without discipline. The difference lies in whether a human process remains genuinely in command or merely appears in the workflow as a decorative compliance object.
The deeper cultural problem is that AI arrived at exactly the moment many institutions were desperate to believe in shortcuts. Newsrooms are under pressure. Companies want productivity. Search is degraded. Social platforms are polluted. Executives are being told that any process involving language should be automated immediately, preferably before anyone asks what the process was for.
Into that environment comes a machine that can make uncertainty sound organized. Of course, people want it to fact-check.
Fact-checking is slow, annoying, expensive, and occasionally socially uncomfortable. It creates delays. It ruins good lines. It tells confident people that their favorite anecdote is unsupported.
Nobody likes the fact-checker until the lawsuit arrives.
That is why the bedtime metaphor works. Asking AI to fact-check itself is like asking a child to set its own bedtime. The child may understand the concept. The child may repeat the health benefits of sleep. The child may produce a responsible plan involving pajamas, brushing teeth, and lights out at a reasonable hour. Then the child will be found at 11:45 PM under a blanket watching videos, surrounded by cookie evidence, insisting that this still counts as winding down.
The issue is not whether the child can describe bedtime. The issue is whether the child has the authority, discipline, and incentive to enforce it.
AI can describe verification. It can imitate the tone of verification. It can help organize the paperwork around verification. But the actual work still requires an entity outside the machine’s own performance of confidence.
It requires someone willing to ask the second question, make the call, read the document, notice the missing context, and say no when the answer is convenient but unsupported.
That is not nostalgia. It is governance.
The future of fact-checking will almost certainly involve AI. The sane version is not a newsroom, company, court, school, or government office handing truth maintenance to a chatbot with a nice interface. The sane version is AI used as a claim-discovery and research-assistance layer inside a human verification process that remains slow where slowness is necessary, skeptical where skepticism is protective, and accountable where the machine is not.
The chatbot can bring the clipboard. It should not be allowed to sign off on the story.