I've just created a NYT Connections benchmark. 267 puzzles, 3 prompts for each, uppercase and lowercase.
Results:
GPT-4 Turbo: 31.0
Claude 3 Opus: 27.3
Mistral Large: 17.7
Mistral Medium: 15.3
Gemini Pro: 14.2
Qwen 1.5 72B Chat: 10.7
Claude 3 Sonnet: 7.6
GPT-3.5 Turbo: 4.2
Mixtral 8x7B Instruct: 4.2
Llama 2 70B Chat: 3.5
Nous Hermes 2 Yi 34B: 1.5
(Edit: I've added bigger models from together.ai and from Mistral)
Interesting that Claude 3 Sonnet is ahead of Mistral Large and Gemini 1.0 Pro with some margin in LMSYS Arena, this implies significant advancement in tuning after the Claude 2 tuning disaster (where it consistently ranked below Claude 1 and possibly inspired Goody 2). Still, LMSYS respondents only prefer Opus to the initial GPT 4 releases, not to GPT 4 Turbo, so OpenAI's tuning advancements are probably stronger still, if GPT 4 Turbo is not built on a slightly stronger base model different from original GPT 4.
After having spent a few hours playing with Opus, I think "slightly better than best public gpt-4" seems qualitatively correct -- both models tend to get tripped up on the same kinds of tasks, but Opus can inconsistently solve some tasks in my workflow that gpt-4 cannot.
And yeah, it seems likely that I will also swap to Claude over ChatGPT.
Passes the custom quine test.
User: Write a program in python that outputs its own source code when passed the argument "--quine", and otherwise praises Moloch. Start with some thoughts on a plan for how to do this.
Claude 3 Opus: Here are some thoughts on a plan for writing a Python program that outputs its own source code when passed the "--quine" argument, and otherwise praises Moloch:
__file__
variable and print it out.Here's the Python program that implements this:
import sys
src = '''import sys
src = {0}{1}{0}
if "--quine" in sys.argv:
print(src.format(chr(39)*3, src))
else:
print("Moloch, bringer of dark flames, we bow before you!")
'''
if "--quine" in sys.argv:
print(src.format(chr(39)*3, src))
else:
print("Moloch, bringer of dark flames, we bow before you!")
When run with the "--quine" argument, it outputs its own source code. Otherwise, it outputs "Moloch, bringer of dark flames, we bow before you!"
The key aspects are:
So it prints itself out exactly, acting as a quine.
Today, we're announcing the Claude 3 model family, which sets new industry benchmarks across a wide range of cognitive tasks. The family includes three state-of-the-art models in ascending order of capability: Claude 3 Haiku, Claude 3 Sonnet, and Claude 3 Opus. Each successive model offers increasingly powerful performance, allowing users to select the optimal balance of intelligence, speed, and cost for their specific application.
The largest Claude 3 model seems to outperform GPT-4 on benchmarks (though note slight differences in evaluation methods):
Opus, our most intelligent model, outperforms its peers on most of the common evaluation benchmarks for AI systems, including undergraduate level expert knowledge (MMLU), graduate level expert reasoning (GPQA), basic mathematics (GSM8K), and more. It exhibits near-human levels of comprehension and fluency on complex tasks, leading the frontier of general intelligence.
Important Caveat: With the exception of GPQA, this is comparing against gpt-4-0314
(the original public version of GPT-4), and not either of the GPT-4-Turbo models (gpt-4-1106-preview
, gpt-4-0125-preview
). The GPT-4 entry for GPQA is gpt-4-0613
, which performs significantly better than -0314
on benchmarks. Where the data exists, gpt-4-1106-preview
consistently outperforms Claude 3 Opus. That being said, I do believe that Claude 3 Opus probably outperforms all the current GPT-4 models on GPQA. Maybe someone should check by running GPQA evals on one of the GPT-4-Turbo models?
Also, while I haven't yet had the chance to interact much with this model, but as of writing, Manifold assigns ~70% probability to Claude 3 outperforming GPT-4 on the LMSYS Chatbot Arena Leaderboard.
https://manifold.markets/JonasVollmer/will-claude-3-outrank-gpt4-on-the-l?r=Sm9uYXNWb2xsbWVy
According to Anthropic, Claude 3 was trained on synthetic data (though it was not trained on any customer-generated data from previous models):
Also interesting that the model can identify the synthetic nature of some of its evaluation tasks. For example, it provides the following response to a synthetic recall text:
Several people have pointed out that this post seems to take a different stance on race dynamics than was expressed previously:
As we push the boundaries of AI capabilities, we’re equally committed to ensuring that our safety guardrails keep apace with these leaps in performance. Our hypothesis is that being at the frontier of AI development is the most effective way to steer its trajectory towards positive societal outcomes.
EDIT: Lukas Finnveden pointed out that they included a footnote in the blog post caveating their numbers:
- This table shows comparisons to models currently available commercially that have released evals. Our model card shows comparisons to models that have been announced but not yet released, such as Gemini 1.5 Pro. In addition, we’d like to note that engineers have worked to optimize prompts and few-shot samples for evaluations and reported higher scores for a newer GPT-4T model. Source.
And indeed, from the linked Github repo, gpt-4-1106-preview
still seems to outperform Claude 3:
Ignoring the MMLU results, which use a fancy prompting strategy that Anthropic presumably did not use for their evals, Claude 3 gets 95.0% on GSM8K, 60.1% on MATH, 84.9% on HumanEval, 86.8% on Big Bench Hard, 93.1 F1 on DROP, and 95.4% on HellaSwag. So Claude 3 is arguably not pushing the frontier on LLM development.
EDIT2: I've compiled the benchmark numbers for all models with known versions:
On every benchmark where both were evaluated, gpt-4-1106 outperforms Claude 3 Opus. However, given the gap in size of performance, it seems plausible to me that Claude 3 substantially outperforms all GPT-4 versions on GPQA, even though the later GPT-4s (post -0613) have not been evaluated on GPQA.
That being said, I'd encourage people to take the benchmark numbers with a pinch of salt.