On Monday, Anthropic released Claude 3, a family of three AI language models similar to those that power ChatGPT. Anthropic claims the models set new industry benchmarks across a range of cognitive tasks, even approaching “near-human” capability in some cases. It’s available now through Anthropic’s website, with the most powerful model being subscription-only. It’s also available via API for developers.
Claude 3’s three models represent increasing complexity and parameter count: Claude 3 Haiku, Claude 3 Sonnet, and Claude 3 Opus. Sonnet powers the Claude.ai chatbot now for free with an email sign-in. But as mentioned above, Opus is only available through Anthropic’s web chat interface if you pay $20 a month for “Claude Pro,” a subscription service offered through the Anthropic website. All three feature a 200,000-token context window. (The context window is the number of tokens—fragments of a word—that an AI language model can process at once.)
We covered the launch of Claude in March 2023 and Claude 2 in July of last year. Each of those times, Anthropic fell slightly behind OpenAI’s best models in capability while surpassing them in terms of context window length. With Claude 3, Anthropic has perhaps finally caught up with OpenAI’s released models in terms of performance, although there is no consensus among experts yet—and the presentation of AI benchmarks is notoriously prone to cherry-picking.
Claude 3 reportedly demonstrates advanced performance across various cognitive tasks, including reasoning, expert knowledge, mathematics, and language fluency. (Despite the lack of consensus over whether large language models “know” or “reason,” the AI research community commonly uses those terms.) The company claims that the Opus model, the most capable of the three, exhibits “near-human levels of comprehension and fluency on complex tasks.”
That’s quite a heady claim and deserves to be parsed more carefully. It’s probably true that Opus is “near-human” on some specific benchmarks, but that doesn’t mean that Opus is a general intelligence like a human (consider that pocket calculators are superhuman at math). So it’s a purposely eye-catching claim that can be watered down with qualifications.
According to Anthropic, Claude 3 Opus beats GPT-4 on 10 AI benchmarks, including MMLU (undergraduate level knowledge), GSM8K (grade school math), HumanEval (coding), and the colorfully named HellaSwag (common knowledge). Several of the wins are very narrow, such as 86.8 percent for Opus vs. 86.4 percent on a five-shot trial of MMLU, and some gaps are big, such as 90.7 percent on HumanEval over GPT-4’s 67.0 percent. But what that might mean, exactly, to you as a customer is difficult to say.
“As always, LLM benchmarks should be treated with a little bit of suspicion,” says AI researcher Simon Willison, who spoke with Ars about Claude 3. “How well a model performs on benchmarks doesn’t tell you much about how the model ‘feels’ to use. But this is still a huge deal—no other model has beaten GPT-4 on a range of widely used benchmarks like this.”
A wide range in price and performance
Compared to its predecessor, Claude 3 models show improvements over Claude 2 in areas such as analysis, forecasting, content creation, code generation, and multilingual conversation. The models also reportedly feature enhanced vision capabilities, allowing the models to process visual formats like photos, charts, and diagrams, similar to GPT-4V (in subscription versions of ChatGPT) and Google’s Gemini.
Anthropic emphasizes the three models’ increased speed and cost-effectiveness compared to previous generations and competing models. Opus (the largest model) is $15 per million input tokens and $75 per million output tokens, Sonnet (the middle model) is $3 per million input tokens and $15 per million output tokens, and Haiku (the smallest, fastest model) is $0.25 per million input tokens and $1.25 per million output tokens. In comparison, OpenAI’s GPT-4 Turbo via API is $10 per million input tokens and $30 per million output tokens. GPT-3.5 Turbo is $0.50 per million input tokens and $1.50 per million output tokens.
When we asked Willison about his impressions of Claude 3’s performance, he said he had not gotten a feel for it yet, but the API pricing for each model had immediately caught his eye. “The unreleased cheapest one looks radically competitive,” says Willison. “The best quality one is super expensive.”
In other miscellaneous notes, the Claude 3 models reportedly can handle up to 1 million tokens for select customers (similar to Gemini Pro 1.5), and Anthropic claims that the Opus model achieved near-perfect recall in a benchmark test across that massive context size, surpassing 99 percent accuracy. Also, the company states that the Claude 3 models are less likely to refuse harmless prompts and demonstrate higher accuracy while reducing incorrect answers.
According to a model card released with the models, Anthropic achieved Claude 3’s capability gains in part through use of synthetic data in the training process. Synthetic data means data generated internally using another AI language model, and the technique can serve as a way to broaden the depth of the training data to represent scenarios that might be lacking in a scraped dataset. “The synthetic data thing is a big deal,” say Willison.
Anthropic plans to release frequent updates to the Claude 3 model family in the coming months, along with new features such as tool use, interactive coding, and “advanced agentic capabilities.” The company says it remains committed to ensuring that safety measures keep pace with advancements in AI performance and that the Claude 3 models “present negligible potential for catastrophic risk at this time.”
The Opus and Sonnet models are available now through Anthropic’s API, with Haiku to follow soon. Sonnet is also accessible through Amazon Bedrock and in private preview on Google Cloud’s Vertex AI Model Garden.
A word about LLM benchmarks
We signed up for Claude Pro to try Opus for ourselves with a few informal tests. Opus feels similar in capability to ChatGPT-4. It can’t write original dad jokes (all appear to be scraped from the web), it’s pretty good at summarizing information and composing text in various styles, it fares pretty well in logical analysis of word problems, and confabulations do indeed seem relatively low (but we saw a few slip in when asking about more obscure topics).
None of that is a definitive pass or fail, and that can be frustrating in a world where computer products typically output hard numbers and quantifiable benchmarks. “Yet another case of ‘vibes’ as a key concept in modern AI,” as Willison told us.
AI benchmarks are tricky because the effectiveness of any AI assistant is highly variable based on the prompts used and the conditioning of the underlying AI model. AI models can perform well “on the test” (so to speak) but fail to generalize those capabilities to novel situations.
Also, AI assistant effectiveness is highly subjective (hence Willison’s “vibes”). That’s because having an AI model succeed at doing what you want to do is difficult to quantify (say, in a benchmark metric) when the task you give it might be literally any task in any intellectual field on earth. Some models work well for some tasks and not for others, and that can vary from person to person by task and prompting style.
This goes for every single large language model from vendors such as Google, OpenAI, and Meta—not just Claude 3. Over time people have found that each model has its own quirks, and each model’s strengths and weaknesses can be embraced or worked around using certain prompting techniques. Right now, it looks like the major AI assistants are settling into a suite of very similar capabilities.
And so, the point of all that is that when Anthropic says that Claude 3 can outperform GPT-4 Turbo, which is currently still widely seen as the market leader in terms of general capability and low hallucinations, one needs to take that with a grain of salt—or a dose of vibes. If you’re considering different models, it’s key to personally test each model to see if it fits your application, because it’s likely no one else can replicate the exact set of circumstances in which you would use it.