Why OpenAI’s new model is such a big deal

September 17, 2024

127

Last weekend, I got married at a summer camp, and during the day our guests competed in a series of games inspired by the show Survivor that my now-wife and I orchestrated. When we were planning the games in August, we wanted one station to be a memory challenge, where our friends and family would have to memorize part of a poem and then relay it to their teammates so they could re-create it with a set of wooden tiles.

I thought OpenAI’s GPT-4o, its leading model at the time, would be perfectly suited to help. I asked it to create a short wedding-themed poem, with the constraint that each letter could only appear a certain number of times so we could make sure teams would be able to reproduce it with the provided set of tiles. GPT-4o failed miserably. The model repeatedly insisted that its poem worked within the constraints, even though it didn’t. It would correctly count the letters only after the fact, while continuing to deliver poems that didn’t fit the prompt. Without the time to meticulously craft the verses by hand, we ditched the poem idea and instead challenged guests to memorize a series of shapes made from colored tiles. (That ended up being a total hit with our friends and family, who also competed in dodgeball, egg tosses, and capture the flag.)

However, last week OpenAI released a new model called o1 (previously referred to under the code name “Strawberry” and, before that, Q*) that blows GPT-4o out of the water for this type of purpose.

Unlike previous models that are well suited for language tasks like writing and editing, OpenAI o1 is focused on multistep “reasoning,” the type of process required for advanced mathematics, coding, or other STEM-based questions. It uses a “chain of thought” technique, according to OpenAI. “It learns to recognize and correct its mistakes. It learns to break down tricky steps into simpler ones. It learns to try a different approach when the current one isn’t working,” the company wrote in a blog post on its website.

OpenAI’s tests point to resounding success. The model ranks in the 89th percentile on questions from the competitive coding organization Codeforces and would be among the top 500 high school students in the USA Math Olympiad, which covers geometry, number theory, and other math topics. The model is also trained to answer PhD-level questions in subjects ranging from astrophysics to organic chemistry.

In math olympiad questions, the new model is 83.3% accurate, versus 13.4% for GPT-4o. In the PhD-level questions, it averaged 78% accuracy, compared with 69.7% from human experts and 56.1% from GPT-4o. (In light of these accomplishments, it’s unsurprising the new model was pretty good at writing a poem for our nuptial games, though still not perfect; it used more Ts and Ss than instructed to.)

So why does this matter? The bulk of LLM progress until now has been language-driven, resulting in chatbots or voice assistants that can interpret, analyze, and generate words. But in addition to getting lots of facts wrong, such LLMs have failed to demonstrate the types of skills required to solve important problems in fields like drug discovery, materials science, coding, or physics. OpenAI’s o1 is one of the first signs that LLMs might soon become genuinely helpful companions to human researchers in these fields.

It’s a big deal because it brings “chain-of-thought” reasoning in an AI model to a mass audience, says Matt Welsh, an AI researcher and founder of the LLM startup Fixie.

Why OpenAI’s new model is such a big deal

Claude Mythos Preview had a 73% success rate on expert-level capture-the-flag challenges, which no model could finish before April 2025 (AI Security Institute)

The EU appoints Anthony Whelan as its top competition official; Whelan says he will press ahead with Big Tech investigations despite President Trump’s pressure...

OpenAI Chief Revenue Officer Denise Dresser says the Microsoft deal “limited our ability” to reach clients on Bedrock and touts its Amazon deal (Ashley...

Most Popular

Claude Mythos Preview had a 73% success rate on expert-level capture-the-flag challenges, which no model could finish before April 2025 (AI Security Institute)

NFL mock draft 2026: QBs-only projection for all 7 rounds

Melanie Bonner Creates ‘MyClynz‘ To Fight Harmful Toxins

Here’s How Expensive Gas Would Have To Get Before Our Readers Would Buy An EV

Recent Comments

ABOUT US

POPULAR POSTS

Claude Mythos Preview had a 73% success rate on expert-level capture-the-flag challenges, which no model could finish before April 2025 (AI Security Institute)

NFL mock draft 2026: QBs-only projection for all 7 rounds

Melanie Bonner Creates ‘MyClynz‘ To Fight Harmful Toxins

POPULAR CATEGORY