The University of California, San Diego, unveiled a research study on Tuesday that claims to provide the “first empirical evidence that any artificial system can pass a standard three-party Turing test”.
Alan Turing, a British mathematician and computer scientist, introduced the ‘imitation game’ in 1950, proposing that if an interrogator couldn’t distinguish between a machine and a human in text, the machine might possess human-like intelligence. In a three-party Turing Test, an interrogator converses with both a human and a machine to accurately identify the human.
The research tested three AI models: OpenAI’s GPT-4.5, Meta’s Llama 3.1 405B, and OpenAI’s GPT-4o. Human participants engaged in five-minute test conversations with one human and one AI system using a split-screen interface. After each round, the interrogator selected the participant they believed was human.
The AI models were evaluated under two conditions: a minimal instruction (NO-PERSONA) prompt and an enhanced PERSONA prompt that guided the AI to adopt a specific human-like demeanor.
The results indicated that GPT-4.5 with the PERSONA prompt achieved a win rate of 73%, suggesting that interrogators often mistook it for a human. Llama 3.1‑405B with the PERSONA prompt attained a win rate of around 56%, whereas GPT‑4o under NO‑PERSONA conditions only reached a win rate of 21%.
Interrogators primarily engaged in small talk—asking about daily activities and personal details in 61% of interactions, while also probing social and emotional aspects such as opinions, emotions, humour, and experiences in 50% of interactions.
“If interrogators are not able to reliably distinguish between a human and a machine, then the machine is said to have passed [the Turing test]. By this logic, both GPT-4.5 and Llama-3.1-405B pass the Turing Test when they are given prompts to adopt a human-like persona,” read a section of the research study.
The authors stated that these systems might seamlessly supplement or even replace human labour in economic roles that rely on brief conversational exchanges.
“More broadly, these systems could become indiscriminable substitutes for other social interactions, from conversations with strangers online to those with friends, colleagues, and even romantic companions,” the authors added.
OpenAI released the GPT-4.5 model in February, which was mostly appreciated for its thoughtful and emotional responses. Ethan Mollick, a professor at The Wharton School, said on X, “It can write beautifully, is very creative, and is occasionally oddly lazy on complex projects.” He even joked that the model took a “lot more” classes in the humanities.
Been using GPT-4.5 for a few days and it is a very odd and interesting model. It can write beautifully, is very creative, and is occasionally oddly lazy on complex projects.
— Ethan Mollick (@emollick) February 27, 2025
Feels like Claude 3.7 while Claude 3.7 feels like GPT-4.5.