Crux/Blog/AI · Reframe

AI · Reframe

How to detect ChatGPT in student essays — and why that's the wrong question.

AI-text detectors fail at scale and disproportionately flag ESL students. There's a better frame, but you have to give up the one most institutions are still using.

2026-05-1010 min readCrux Team

TL;DR

"How to detect ChatGPT" is one of the most-searched queries in education since 2023. The honest answer is: you mostly can't, reliably, at scale — and several published studies plus OpenAI's own withdrawn classifier say so.
Modern AI-text detectors operate on perplexity and burstiness heuristics that work on average but produce false positives at rates that are unacceptable for high-stakes academic decisions — particularly against students writing in their second language.
The right question is not "is this AI-written?" It is "is this work consistent with this student's prior work?" Discontinuity, not generation, is what cheating actually produces.
Even better: prevent. A monitored exam environment where AI tools are physically unreachable removes the question entirely, for the work that's done in the exam window.

The most-searched question in post-ChatGPT education

Since November 2022, "how to detect ChatGPT" has become one of the highest-volume queries in the education segment of Google Search. Every term, lecturers re-discover the question. Every faculty meeting at every university revisits it. Every certification body opens a working group on it. The intuition is sensible: if students are using a generative model to write their essays, surely there's a way to detect that.

The intuition is also wrong. Or rather, it's partially right — there are signals you can extract — but the false-positive rate at meaningful detection thresholds makes the resulting tool unfit for high-stakes academic decisions. This is not a controversial claim within the AI research community. It's a controversial claim mostly because it's inconvenient.

This piece does two things. First, it walks through what AI-text detectors actually do and why they fail at scale. Second, it offers the reframe that works in practice, and that institutions adopting prevention-first proctoring have already converged on.

What AI-text detectors actually do

Modern AI-text detectors — GPTZero, Turnitin AI Writing Detection, Originality.ai, Copyleaks, and a long tail of smaller tools — converge on a few core signals.

Perplexity. The detector estimates how "surprising" each next word is, given the sentence so far, using a reference language model. Text generated by a frontier model tends to be lower-perplexity (more probable, more predictable) than human writing on average, because the model is, by construction, sampling from high-probability completions.

Burstiness. The detector measures the variance in sentence-level perplexity across the document. Human writers tend to have high variance — short, blunt sentences alternating with long, complex ones. Generated text tends to have lower variance, sentence to sentence, because the sampling distribution is more uniform.

Stylometric signals. Vocabulary distribution, function-word frequency, punctuation patterns, paragraph-length statistics. These are the older signals from authorship-attribution literature, repurposed for AI detection.

Each of these signals has predictive value on average. None of them is a clean classifier at the document level. The detector reports a probability, and the institution has to decide what threshold counts as "AI-written."

Why detection fails at scale

The literature on this is extensive and converging. A few representative findings.

OpenAI itself launched an "AI Classifier" in early 2023 and withdrew it in July 2023, citing low accuracy. The company that built the model that prompted the panic was unable to ship a reliable detector for its own output.

Stanford-affiliated researchers published in 2023 a study showing that GPT-detection tools systematically misclassified writing by non-native English speakers as AI-generated, at substantially higher rates than for native speakers. The finding has been replicated by other groups. The mechanism is plausible: ESL writing tends to use a more limited vocabulary, more standard syntactic patterns, and lower stylistic burstiness — exactly the surface features that detectors flag as AI-like.

Public reporting has documented further high-profile failures. Detectors have flagged the U.S. Constitution, the Bible, and Shakespeare as AI-generated at high confidence. These are anecdotes, but they're anecdotes about what the failure mode looks like — not random noise, but systematic misclassification of writing that's structurally similar to high-probability text.

Turnitin's own published guidance has shifted over time, with the company explicitly acknowledging that detector confidence scores should not be used as the sole basis for academic-misconduct findings. Most reputable detectors now position themselves as screening tools, not adjudicators.

OpenAI launched an AI classifier in 2023 and withdrew it within six months. The company that built the model that prompted the panic was unable to ship a reliable detector for its own output.

The bias problem and the legal liability that follows

The ESL bias problem is not a footnote. It is the central reason that AI-text detection cannot be the basis of academic-misconduct decisions in any institution that takes equity seriously.

South African higher education enrols substantial cohorts of students whose first language is isiZulu, isiXhosa, Sesotho, Sepedi, Setswana, Tshivenda, Xitsonga, or Afrikaans, and who write academic English as a second or third language. The same is true across most of the continent: French in Francophone West Africa, Arabic in North Africa, Swahili in East Africa. The same is true in much of Europe, in Southeast Asia, in Latin America. Globally, the majority of academic writing in English is done by people for whom English is not their first language.

A detector that systematically over-flags ESL writers does not just produce a higher false-positive rate in those populations. It produces a discriminatory false-positive rate, with all the legal and reputational consequences that follow when the institution makes academic-misconduct findings on that basis. Several universities have already faced litigation along these lines; more are coming.

The compliance posture in 2026 is straightforward: AI-text detector output, alone, is not adequate evidence for an academic-misconduct finding. The institutions that ignore this are exposing themselves to appeals they will lose.

The right question

The detector frame asks: "Is this document AI-generated?" The signal-to-noise ratio is too low to answer it reliably.

The better frame is: "Is this work consistent with this student's prior work?"

The shift sounds subtle but it changes everything. Cheating, of any kind — not just AI cheating — produces a specific signature: discontinuity. The student who turned in B-grade analytical writing for the first six weeks of term and produces a flawless A+ stylistic tour de force in week seven has a discontinuity to explain. The student who consistently writes in one register and suddenly produces work in a different register has a discontinuity to explain. The student whose vocabulary range expanded by three standard deviations between assignments has a discontinuity to explain.

This frame doesn't ask the unanswerable question (was a model used?). It asks a question that is both answerable and educationally relevant: did this student do this work? Discontinuity-based analysis can be operationalised: build a writing profile per student over time; compare new submissions to that profile; flag the gaps.

It's not perfect either. A student who genuinely improves over the semester will look discontinuous. So will a student who got real, legitimate help from a tutor. The signals are advisory, not adjudicative. But they're a foundation for human judgment, where AI-text detector output is not.

Even better: prevention

The detection-vs-discontinuity reframe is for the work students do between exams: weekly assignments, take-home essays, dissertation drafts. For these, you cannot prevent AI use — and shouldn't try to, completely. AI is now a research tool the way calculators became research tools. Asking whether a student used GPT to brainstorm an essay outline is, in 2026, mostly the wrong question.

For the moments where it has to be the student's own work — high-stakes exams, certification assessments, capstone defences — the prevention frame is decisive. If the exam runs in a hardened lock-screen environment with no path to a generative model, the detection question doesn't apply. There is nothing to detect, because the cheating channel is closed.

This is not an argument for never assigning take-home work. It's an argument for designing the assessment system around what each component can actually do. Take-home work measures process, research, and synthesis — including with AI assistance. High-stakes proctored exams measure individual capability without AI assistance. The credential is the combination, not either piece in isolation.

What to do tomorrow

If you're a lecturer, dean, or registrar trying to make this practical, here's a short list:

Stop using AI-text detector output as the basis for misconduct findings. If you currently do, audit recent cases for ESL-bias exposure and consider proactive remediation.
Move from detection to discontinuity. Build a writing profile per student from early-semester work; flag stylistic and vocabulary gaps in later submissions; treat them as discussion-starters, not verdicts.
Redesign assessment intentionally. Take-home work should reward process, citation, research depth, and synthesis — including with AI. High-stakes exams should be proctored in environments where AI is unreachable.
For the high-stakes pieces, choose proctoring that prevents. An OS-level lock-screen across the device the student writes on. Continuous identity verification. Dual-camera coverage of the workspace. The cheating doesn't happen, so the detection question doesn't have to be answered.
Communicate the policy explicitly to students. AI is allowed here, prohibited there, with these consequences. Clarity protects everyone — students, lecturers, the institution's appeal posture.

Conclusion

"How to detect ChatGPT in student essays" is the wrong question because it cannot be answered reliably enough to base academic-misconduct decisions on. The detectors are biased against the populations of students who can least afford additional academic friction. The arms race favours generation over detection and has done so for three years.

The right question, for take-home work, is whether the work is consistent with the student's prior writing. The right question, for high-stakes assessment, is how to prevent AI use entirely during the moments when the credential depends on individual capability.

Detection is the question that sounds like the answer. Prevention and consistency are the questions that actually have answers.

Crux

Stop trying to detect. Start preventing.

Crux runs the high-stakes exam in a hardened environment where AI tools are unreachable — not because we're better at detection, but because the cheating channel is closed.

Request a demo →