
GPT-5 is the latest version of OpenAI’s large language model
Cheng Xin/Getty Images
AI’s latest step forward isn’t so much a giant leap as a tentative shuffle. OpenAI has released its newest AI model, GPT-5, two years after rolling out GPT-4, whose success has driven ChatGPT towards world domination. But despite promises of a similar jump in capability, GPT-5 appears to show little improvement over other leading AI models, hinting that the industry may need a fresh approach to build more intelligent AI systems.
OpenAI’s own pronouncements hail GPT-5 as a “significant leap in intelligence” from the company’s previous models, showing apparent improvements in programming, mathematics, writing, health information and visual understanding. It also promises less frequent hallucinations, which is when an AI presents false information as true. On an internal benchmark measuring “performance on complex, economically valuable knowledge work”, OpenAI says GPT‑5 is “comparable to or better than experts in roughly half the cases… across tasks spanning over 40 occupations including law, logistics, sales, and engineering.”
However, GPT-5’s performance on public benchmarks isn’t dramatically better than leading models from other AI companies, like Anthropic’s Claude or Google’s Gemini. It has improved on GPT-4, but the difference for many benchmarks is smaller than the leap from GPT-3 to GPT-4. Many ChatGPT customers have also been unimpressed, with examples of GPT-5 failing to answer seemingly simple queries receiving widespread attention on social media.
“A lot of people hoped that there would be a breakthrough, and it’s not a breakthrough,” says Mirella Lapata at the University of Edinburgh, UK. “It’s an upgrade, and it feels kind of incremental.”
The most comprehensive measures of GPT-5’s performance come from OpenAI itself, since only it has full access to the model. Few details about the internal benchmark have been made public, says Anna Rogers at the IT University of Copenhagen in Denmark. “Hence, it is not something that can be seriously discussed as a scientific claim.”
In a press briefing before the model’s launch, Altman claimed “GPT-5 is the first time that it really feels like talking to an expert in any topic, like a PhD-level expert.” But this isn’t supported by benchmarks, says Rogers, and it is unclear how a PhD relates to intelligence more generally. “Highly intelligent people don’t necessarily have PhD degrees, and having such a degree doesn’t necessarily guarantee high intelligence,” says Rogers.
GPT-5’s apparently modest improvements might be a sign of wider difficulties for AI developers. Until recently, it was thought that such large language models (LLMs) get more capable with more training data and computer power. It appears this is no longer borne out by the results of the latest models, and companies have failed to find better AI system designs than those that have powered ChatGPT. “Everybody has the same recipe right now and we know what the recipe is,” says Lapata, referring to the process of pre-training models with a large amount of data and then making adjustments with post-training processes afterwards.
However, it is difficult to say how close LLMs are to stagnating because we don’t know exactly how models like GPT-5 are designed, says Nikos Aletras at the University of Sheffield, UK. “Trying to make generalisations about [whether] large language models have hit a wall might be premature. We can’t really make these claims without any information about the technical details.”
OpenAI has been working on other ways to make its product more efficient, such as GPT-5’s new routing system. Unlike previous instances of ChatGPT, where people can choose which AI model to use, GPT-5 now scans requests and directs them to a specific model that will use an appropriate amount of computational power.
This approach might be adopted more widely, says Lapata. “The reasoning models use a lot of [computation], and this takes time and money,” he says. “If you can answer it with a smaller model, we will see more of that in the future.” But the move has angered some ChatGPT customers, prompting Altman to say the company is looking at improving the routing process.
There are more positive signs for the future of AI in a separate OpenAI model that has achieved gold medal scores in elite mathematical and coding competitions in the past month, something that top AI models couldn’t do a year ago. While details of how the models work are again scant, OpenAI employees have said its success suggests the system has more general reasoning capabilities.
These competitions are useful for testing models on data they haven’t seen during their training, says Aletras, but they are still narrow tests of intelligence. Increasing a model’s performance in one area might also make it worse at others, says Lapata, which can be difficult to keep track of.
One area where GPT-5 has significantly improved is its price, which is now far cheaper than other models – Anthropic’s best Claude model, for example, costs about 10 times as much to process the same number of requests at the time of writing. But this could present its own problems in the long run, if OpenAI’s income doesn’t cover the vast costs they have committed to in building and running new data centres. “The pricing is insane. It’s so cheap I don’t know how they can afford this,” says Lapata.
Competition between the top AI models is fierce, especially with the expectation that the first model to pull ahead of the others will take most of the market share. “All these big companies, they’re trying to be the one winner, and this is hard,” says Lapata. “You’re a winner for three months.”
Topics: