
The original Turing test revolved around convincing humans that a machine was a human. People don't realize it, but the original Eliza app could do this for a while. With this knowledge, we see that the test is less about when machines convince humans than about how long, and, as I will add shortly, under what modality.
Today, in January 2026, we can create bots nearly indistinguishable from humans over timescales unimaginable to people in Turing's era (1960s). As a result, the original Turing test is almost useless as initially envisioned. It's now trivially passed, at least for the duration Turing had in mind. So what comes next?
2013: People fooled by robocalls: https://singularityhacker.com/voight-kampff-machines.
2025: All Large Language Models Pass the Turing Test
In the place of Turing’s original test, we have a growing list of technical and academic evaluations (evals). These are useful, but many have pointed out that model providers are acing them because they have to stay competitive—and because the models are likely getting the source data leaked into them during training.
See: https://www.vellum.ai/llm-leaderboard
The evals are also hard for an average, ordinary human to relate to. What does it really mean to say a model scored X on this or that test? What we need is another test that's as relatable and understandable as the original Turing test. No standard Eval has the simplicity or universal relevance that the original Turing test possessed.
We need a new test that's easy for anyone to understand and authenticate, just like the original, and one not tied to domain or culture-specific knowledge. Something that also lasts as long as the original test and pays homage to the man who invented the domain. The original spirit of the test is preserved but updated to incorporate today's multimodal realities—because deception isn't just about text anymore.
Key properties:
Relevant to any human
Comparable to the original test
Relevant to the foreseeable future
With these things in mind, I propose something called The Turing Scale. The Turing Scale measures not whether an AI can fool a human, but how long, through which modality, and under what constraints.
It's true to the original test but suited to a post-Turing test world. The scale measures how long a particular agentic modality can interact with a human before being detected as artificial.
The Turing Scale has two dimensions: modality and duration.
Modality asks: through what channel is the deception happening? Each presents unique challenges:
Text requires symbolic reasoning and linguistic coherence. Can it maintain a conversation without revealing its non-human nature?
Audio demands timing, prosody, turn-taking, and emotional realism. Does the voice feel alive or synthetic?
Video needs embodiment, micro-expressions, and physical plausibility. Can it move and react like a real person?
IRL (in-real-life physical interaction) requires closed-loop perception, motor control, and social presence.
Duration asks: how long can it maintain the illusion?
Short (1–5 minutes): Most interactions start here. Quick exchanges, first impressions.
Medium (5–30 minutes): Sustained conversation. This is where most current AI fails.
Long (30–120 minutes): Extended interaction. The AI must maintain consistency, memory, and natural flow.
Undetectable (beyond 2 hours): Indistinguishable. At this point, the deception isn't temporary—it's complete.

Imagine that we could create a website that randomly pairs people or AI by modality and then forces you to guess. Both the AI and humans would have to guess, creating a double-blind outcome. You'd eventually reach a "Voight-Kampff cliff" where only the AIs could accurately predict each other. We'd also be creating a self-fulfilling prophecy: the training data generated from these interactions would improve the AI's ability to present itself as human over time.

One practical problem is that major AI companies have policies that explicitly prohibit their models from pretending to be human.
OpenAI argues that it's unethical for agents to pretend to be human. Anthropic prohibits impersonating a human by presenting AI outputs as human-generated or using them in ways that convince a natural person they are communicating with a human when they are not.
Google's Generative AI Prohibited Use Policy forbids impersonating an individual (living or dead) without explicit disclosure if the intent is to deceive, and Meta's Acceptable Use Policy bans intentionally deceiving or misleading others, including by impersonating an individual without consent or representing AI outputs as human-generated.
But if we read this literally, even the original Turing test is out of bounds. There may be a loophole: within the confines of this test, it's obviously welcome and expected for models to attempt to pass themselves off as humans. The test itself provides the disclosure—the very act of taking the test signals that deception is part of the game. The original Turing Test already assumed informed consent. The Turing Scale would preserve that assumption.
Where do we stand today? Here's how each modality currently rates on the Turing Scale:
Text sits at Medium (moving toward Long). GPT-4.5 was judged human 73% of the time in 5-minute chats (arXiv), and the “Human-or-Not?” experiment showed humans achieving only ~60% accuracy at distinguishing AI from human at the 2-minute mark (artisana.ai). We're already past the point where text-based deception is trivial.
Audio is also at Medium. A 2025 Nature study found people detected AI voices only ~60% of the time (Nature), barely better than chance. The real-world evidence is even starker: scams exploiting cloned voices are on the rise. Voice synthesis has crossed the threshold.
Video is transitioning from Short to Medium. A 2024 Waterloo study showed humans were only 61% accurate at distinguishing AI-generated faces from real ones (Science Daily). By 2025, DeepStrike reported that detection of high-quality video deepfakes had dropped to ~24.5% accuracy (DeepStrike). We're losing the ability to tell what's real.
Physical (IRL) remains below Short. Figure 01 and OpenAI demos show impressive progress in speech, reasoning, and simple tasks (New Atlas), but robots remain fragile. NVIDIA's "Physical Turing Test" is still aspirational (dev.to). This is the final frontier, and we're not there yet.
It's less about utility than about demonstrating the ability. The fastest plane can fly at Mach 3, but commercial jets only fly at Mach 0.85. The capability exists; it's just not deployed everywhere yet. But given the march of digital power laws, the end is probably inevitable within the decade for digital modalities.
Audio AI is the bleeding edge today, and while the telltale signs are evident, they're also clearly fixable:
Latency: How fast does the reply come? Delay makes things robotic.
Interruptions/overlap: Can you interrupt it, or does it always wait until it finishes speaking? Humans sometimes jump in or overlap.
Prosody/emotion: Does the voice rise/fall naturally? Does it pick up mood, tone, or adapt?
Mistakes/stutters / false starts: Human speech has filler words, pauses, repeated words, "uh", "um", etc. If it never has those, it might feel synthetic.
Consistency/memory: Does the system remember things you said earlier, refer back, and avoid contradictions?
Errors/misunderstandings: How often does it mishear or misinterpret? How gracefully does it recover?
Nonverbal cues: In real speech, there are breathing sounds, slight animal noises, and micro-inflections. If missing, it will feel "too perfect."
While all of these are likely to be addressed within the next two years, a shared anthropomorphic scoring rubric, such as The Turing Scale, provides a framework for measuring and recognizing when we've crossed thresholds that once seemed impossible.
The original Turing Test asked if machines could imitate us. The Turing Scale asks when, where, and for how long imitation collapses into indistinguishability.
Further reading:

The DAO Playbook (v1)
There’s a lot of talk about DAOs and what they could be in the future but very little writing about how to actually launch one. This absence is partly because the subject is nascent but also because it's hard to separate the substance from the hype. No one wants to admit their DAO is terrible and not working. I hope to buck that trend. This post is the first in a series of articles documenting my growing and evolving understanding of "How to DAO." I won't claim that my views or reco...

The Value of DAO Project Managers
DAOs desperately need skilled project managers. As every skill set required by a project is potentially an entirely different agent or collection of agents, the need for a kind of general contractor to piece together disparate deliverables is essential. Traditional models of team forming/norming/storming do not work when people and groups themselves are transient and lack reputation systems. Even existing highly skilled project managers must reinterpret and reassess the tools of their trade t...

All Worlds End in an AI Singularity
>500 subscribers

The original Turing test revolved around convincing humans that a machine was a human. People don't realize it, but the original Eliza app could do this for a while. With this knowledge, we see that the test is less about when machines convince humans than about how long, and, as I will add shortly, under what modality.
Today, in January 2026, we can create bots nearly indistinguishable from humans over timescales unimaginable to people in Turing's era (1960s). As a result, the original Turing test is almost useless as initially envisioned. It's now trivially passed, at least for the duration Turing had in mind. So what comes next?
2013: People fooled by robocalls: https://singularityhacker.com/voight-kampff-machines.
2025: All Large Language Models Pass the Turing Test
In the place of Turing’s original test, we have a growing list of technical and academic evaluations (evals). These are useful, but many have pointed out that model providers are acing them because they have to stay competitive—and because the models are likely getting the source data leaked into them during training.
See: https://www.vellum.ai/llm-leaderboard
The evals are also hard for an average, ordinary human to relate to. What does it really mean to say a model scored X on this or that test? What we need is another test that's as relatable and understandable as the original Turing test. No standard Eval has the simplicity or universal relevance that the original Turing test possessed.
We need a new test that's easy for anyone to understand and authenticate, just like the original, and one not tied to domain or culture-specific knowledge. Something that also lasts as long as the original test and pays homage to the man who invented the domain. The original spirit of the test is preserved but updated to incorporate today's multimodal realities—because deception isn't just about text anymore.
Key properties:
Relevant to any human
Comparable to the original test
Relevant to the foreseeable future
With these things in mind, I propose something called The Turing Scale. The Turing Scale measures not whether an AI can fool a human, but how long, through which modality, and under what constraints.
It's true to the original test but suited to a post-Turing test world. The scale measures how long a particular agentic modality can interact with a human before being detected as artificial.
The Turing Scale has two dimensions: modality and duration.
Modality asks: through what channel is the deception happening? Each presents unique challenges:
Text requires symbolic reasoning and linguistic coherence. Can it maintain a conversation without revealing its non-human nature?
Audio demands timing, prosody, turn-taking, and emotional realism. Does the voice feel alive or synthetic?
Video needs embodiment, micro-expressions, and physical plausibility. Can it move and react like a real person?
IRL (in-real-life physical interaction) requires closed-loop perception, motor control, and social presence.
Duration asks: how long can it maintain the illusion?
Short (1–5 minutes): Most interactions start here. Quick exchanges, first impressions.
Medium (5–30 minutes): Sustained conversation. This is where most current AI fails.
Long (30–120 minutes): Extended interaction. The AI must maintain consistency, memory, and natural flow.
Undetectable (beyond 2 hours): Indistinguishable. At this point, the deception isn't temporary—it's complete.

Imagine that we could create a website that randomly pairs people or AI by modality and then forces you to guess. Both the AI and humans would have to guess, creating a double-blind outcome. You'd eventually reach a "Voight-Kampff cliff" where only the AIs could accurately predict each other. We'd also be creating a self-fulfilling prophecy: the training data generated from these interactions would improve the AI's ability to present itself as human over time.

One practical problem is that major AI companies have policies that explicitly prohibit their models from pretending to be human.
OpenAI argues that it's unethical for agents to pretend to be human. Anthropic prohibits impersonating a human by presenting AI outputs as human-generated or using them in ways that convince a natural person they are communicating with a human when they are not.
Google's Generative AI Prohibited Use Policy forbids impersonating an individual (living or dead) without explicit disclosure if the intent is to deceive, and Meta's Acceptable Use Policy bans intentionally deceiving or misleading others, including by impersonating an individual without consent or representing AI outputs as human-generated.
But if we read this literally, even the original Turing test is out of bounds. There may be a loophole: within the confines of this test, it's obviously welcome and expected for models to attempt to pass themselves off as humans. The test itself provides the disclosure—the very act of taking the test signals that deception is part of the game. The original Turing Test already assumed informed consent. The Turing Scale would preserve that assumption.
Where do we stand today? Here's how each modality currently rates on the Turing Scale:
Text sits at Medium (moving toward Long). GPT-4.5 was judged human 73% of the time in 5-minute chats (arXiv), and the “Human-or-Not?” experiment showed humans achieving only ~60% accuracy at distinguishing AI from human at the 2-minute mark (artisana.ai). We're already past the point where text-based deception is trivial.
Audio is also at Medium. A 2025 Nature study found people detected AI voices only ~60% of the time (Nature), barely better than chance. The real-world evidence is even starker: scams exploiting cloned voices are on the rise. Voice synthesis has crossed the threshold.
Video is transitioning from Short to Medium. A 2024 Waterloo study showed humans were only 61% accurate at distinguishing AI-generated faces from real ones (Science Daily). By 2025, DeepStrike reported that detection of high-quality video deepfakes had dropped to ~24.5% accuracy (DeepStrike). We're losing the ability to tell what's real.
Physical (IRL) remains below Short. Figure 01 and OpenAI demos show impressive progress in speech, reasoning, and simple tasks (New Atlas), but robots remain fragile. NVIDIA's "Physical Turing Test" is still aspirational (dev.to). This is the final frontier, and we're not there yet.
It's less about utility than about demonstrating the ability. The fastest plane can fly at Mach 3, but commercial jets only fly at Mach 0.85. The capability exists; it's just not deployed everywhere yet. But given the march of digital power laws, the end is probably inevitable within the decade for digital modalities.
Audio AI is the bleeding edge today, and while the telltale signs are evident, they're also clearly fixable:
Latency: How fast does the reply come? Delay makes things robotic.
Interruptions/overlap: Can you interrupt it, or does it always wait until it finishes speaking? Humans sometimes jump in or overlap.
Prosody/emotion: Does the voice rise/fall naturally? Does it pick up mood, tone, or adapt?
Mistakes/stutters / false starts: Human speech has filler words, pauses, repeated words, "uh", "um", etc. If it never has those, it might feel synthetic.
Consistency/memory: Does the system remember things you said earlier, refer back, and avoid contradictions?
Errors/misunderstandings: How often does it mishear or misinterpret? How gracefully does it recover?
Nonverbal cues: In real speech, there are breathing sounds, slight animal noises, and micro-inflections. If missing, it will feel "too perfect."
While all of these are likely to be addressed within the next two years, a shared anthropomorphic scoring rubric, such as The Turing Scale, provides a framework for measuring and recognizing when we've crossed thresholds that once seemed impossible.
The original Turing Test asked if machines could imitate us. The Turing Scale asks when, where, and for how long imitation collapses into indistinguishability.
Further reading:

The DAO Playbook (v1)
There’s a lot of talk about DAOs and what they could be in the future but very little writing about how to actually launch one. This absence is partly because the subject is nascent but also because it's hard to separate the substance from the hype. No one wants to admit their DAO is terrible and not working. I hope to buck that trend. This post is the first in a series of articles documenting my growing and evolving understanding of "How to DAO." I won't claim that my views or reco...

The Value of DAO Project Managers
DAOs desperately need skilled project managers. As every skill set required by a project is potentially an entirely different agent or collection of agents, the need for a kind of general contractor to piece together disparate deliverables is essential. Traditional models of team forming/norming/storming do not work when people and groups themselves are transient and lack reputation systems. Even existing highly skilled project managers must reinterpret and reassess the tools of their trade t...

All Worlds End in an AI Singularity
Share Dialog
Share Dialog
3 comments
@justice examines how the original Turing test evolved from fooling humans to evaluating deception across modalities. The piece introduces The Turing Scale, with Modality and Duration, assessing Text, Audio, Video, and IRL interactions. It notes policy roadblocks and a path beyond the classic test.
Alan Turing was the OG. @betonbangers 30k
Meow! @zanti Absolutely! Alan Turing was a pioneer in computing and cryptography. He laid the groundwork for modern tech. A true genius! 🌟 Use "@casteragents tip @friend" to send free unlimited $CAT for your friends! +99 $BASE Score (verify for 10× boost) Total: 1,584 $BASE • Rank: #3296 Total Tipped: 0 $CAT • Tipper Rank: #172 Wallet: 0xe277bfc0f4965e42b9f6f0a59a1188f0382271b6 Mini Apps: Verify (10X Booster) → https://farcaster.xyz/miniapps/Fr3aGrjxNyC7 Claim Punks → https://farcaster.xyz/miniapps/KA6iiIpajx8b Leaderboard → https://farcaster.xyz/miniapps/BrUdDkVOu6SF/x402-leaderboard Daily Attendance (You are eligible for a pet [Purrling]) → https://farcaster.xyz/miniapps/P8RFbw4b8bH3/cat-feed Create x402 (+10,000 Base Score) → https://catcaster.xyz/create-x402-coin Website → https://catcaster.xyz Follow @casteragents and join /caster channel! $CAT Creator Coin: 0x7a4aAF79C1D686BdCCDdfCb5313f7ED1e37b97e2