What Is New GPT-4o? An Explainer Guide

What is OpenAI’s GPT-4o?

GPT-4o is OpenAI’s newest language model. The “o” stands for “omni,” which means “every” in Latin.

This name reflects that GPT-4o can handle a mix of text, audio, images, and video in a single prompt. Before this, different models were used for different types of content in ChatGPT.

For instance, if you used Voice Mode to talk to ChatGPT, your speech would be turned into text by Whisper, then GPT-4 Turbo would generate a text response, and finally, Text-to-Speech (TTS) would convert that text back into speech.

Comparing GPT-4 Turbo and GPT-4o:

Speech Input: GPT-4 Turbo used a combination of Whisper, GPT-4 Turbo, and TTS to process speech.
Image Input: Working with images involved both GPT-4 Turbo and DALL-E 3.

Using GPT-4o, a single model can handle all these types of content. This means faster and better results, a simpler interface, and new possibilities for how you can use the model.

What Makes GPT-4o Different from GPT-4 Turbo?

GPT-4o is an all-in-one model that offers several improvements over the previous voice interaction capabilities of GPT-4 Turbo.

1. Emotional Responses with Tone of Voice

Previously, OpenAI used a combination of Whisper, GPT-4 Turbo, and TTS to process voice interactions. This setup only considered the spoken words, ignoring tone, background noises, and multiple speakers’ voices.

Consequently, GPT-4 Turbo couldn’t generate responses with varying emotions or speech styles.

With GPT-4o, a single model handles both text and audio. This integration allows the model to use rich audio information, including tone of voice, to deliver higher-quality responses with more emotional and stylistic variety.

For example, GPT-4o can now produce sarcastic responses.

2. Real-Time Conversations with Lower Latency

The previous three-model pipeline introduced a delay (“latency”) between speaking to ChatGPT and receiving a response.

Additionally, the average latency was 2.8 seconds for GPT-3.5 and 5.4 seconds for GPT-4. In contrast, GPT-4o has an average latency of just 0.32 seconds—nine times faster than GPT-3.5 and 17 times faster than GPT-4.

This reduced latency is close to the average human response time (0.21 seconds), making real-time conversations with GPT-4o much smoother.

Moreover, this improvement is similar to Google Instant, which saved time by providing search results instantly. One practical use of GPT-4o’s decreased latency is real-time translation of speech, as demonstrated by OpenAI with English and Spanish-speaking colleagues.

3. Integrated Vision for Describing Camera Feeds

GPT-4o also includes image and video features. It can describe what it sees on a computer screen, answer questions about onscreen images, or assist with tasks as a co-pilot.

For example, in a video by OpenAI, GPT-4o helps Sal Khan’s son with his math homework by describing and interpreting onscreen content.

Moreover, if GPT-4o has access to a camera, such as a smartphone, it can describe its surroundings. OpenAI showcased a demo where two smartphones running GPT-4o held a conversation.

One GPT described what its camera saw to the other GPT, creating a three-way conversation with a human. This demo also included the AIs singing, a feature not possible with previous models.

Overall, GPT-4o’s all-in-one approach significantly enhances the model’s ability to understand and respond to different types of input, providing a more seamless and interactive user experience.

4. Enhanced Tokenization for Non-Roman Alphabets

One important step in the language model workflow is converting prompt text into tokens, which are units of text the model can understand.

In English, a token typically represents a word or a piece of punctuation, though some words may be broken down into multiple tokens. On average, three English words are equivalent to about four tokens.

If a language can be represented with fewer tokens, fewer calculations are needed, increasing the speed of text generation. Additionally, since OpenAI charges for its API per token, using fewer tokens reduces costs for API users.

GPT-4o features an improved tokenization model, requiring fewer tokens per text, especially benefiting languages that don’t use the Roman alphabet.

For example:

Indian languages like Hindi, Marathi, Tamil, Telugu, and Gujarati show a reduction in tokens by 2.9 to 4.4 times.
Arabic shows a 2x token reduction.
East Asian languages such as Chinese, Japanese, Korean, and Vietnamese show token reductions between 1.4x and 1.7x.

5. Rollout to the Free Plan

OpenAI’s current pricing strategy for ChatGPT requires users to pay for access to the best model, GPT-4 Turbo, available only on the Plus and Enterprise paid plans.

However, OpenAI plans to make GPT-4o available on the free plan. Plus users will receive five times as many messages as free plan users.

The rollout will be gradual, starting with red team testers (who try to find problems by breaking the model) and expanding to more users over time.

6. Launch of the ChatGPT Desktop App

In addition to the updates with GPT-4o, OpenAI has announced the release of the ChatGPT desktop app. The improvements in latency and multimodality, along with the new app, are expected to change how users interact with ChatGPT.

For instance, OpenAI demonstrated an augmented coding workflow using voice and the ChatGPT desktop app. For more details, check out the example in the use-cases section below!

Difference between GPT-4o and GPT-4 turbo.

How Does GPT-4o Work?

A Single Neural Network for Multiple Content Types

GPT-4o is a single neural network trained to handle text, vision, and audio input. This is a departure from the previous method of using separate models for different types of content.

This multi-modal approach isn’t entirely new. In 2022, TenCent Lab developed SkillNet, which combined language model features with computer vision to improve the recognition of Chinese characters. In 2023, a team from ETH Zurich, MIT, and Stanford created WhisBERT, a variation on the BERT language models. While GPT-4o is not the first, it is more ambitious and powerful than these earlier models.

Is GPT-4o a Radical Change from GPT-4 Turbo?

The extent of the change depends on who you ask. Internally, OpenAI’s engineering team seems to consider it a significant update, possibly indicating a major architectural shift.

This is suggested by the name “im-also-a-good-gpt2-chatbot,” which appeared on the LMSYS’s Chatbot Arena leaderboard and was later revealed to be GPT-4o. The “2” in the name suggests a new version, distinct from previous models.

Externally, OpenAI’s marketing team has opted for a modest naming change, continuing the “GPT-4” convention.

GPT-4o Performance vs. Other Models

OpenAI compared GPT-4o to several high-end models, including GPT-4 Turbo, GPT-4 (initial release), Claude 3 Opus, Gemini Pro 1.5, Gemini Ultra 1.0, and Llama 3 400B.

The most relevant comparisons are with GPT-4 Turbo, Claude 3 Opus, and Gemini Pro 1.5, as these models compete for the top spot on the LMSYS Chatbot Arena leaderboard.

Six benchmarks were used to measure performance:

Massive Multitask Language Understanding (MMLU): Tests on subjects like elementary mathematics, US history, computer science, and law.
Graduate-Level Google-Proof Q&A (GPQA): Difficult multiple-choice questions in biology, physics, and chemistry.
MATH: Middle and high school mathematics problems.
HumanEval: Functional correctness of computer code.
Multilingual Grade School Math (MSGM): Math problems translated into ten languages, including underrepresented ones like Bengali and Swahili.
Discrete Reasoning Over Paragraphs (DROP): Questions requiring understanding of complete paragraphs.

Performance Results

GPT-4o achieved top scores in four benchmarks, though Claude 3 Opus outperformed it in MSGM, and GPT-4 Turbo did better in DROP. Overall, GPT-4o’s performance is impressive, especially considering its multi-modal capabilities.

When comparing GPT-4o to GPT-4 Turbo, the performance increases are modest—just a few percentage points. This improvement, while notable, is less dramatic than the leaps seen from GPT-1 to GPT-2 or GPT-2 to GPT-3.

Incremental improvements in text reasoning are likely to be the norm moving forward, as the easiest advancements have already been made.

However, these benchmarks don’t fully capture AI performance on multi-modal tasks. The newness of this concept means we lack comprehensive ways to measure a model’s capability across text, audio, and vision.

GPT-4o Limitations & Risks

Regulation and Safety Framework

The regulation of generative AI is still developing, with the EU AI Act being the primary legal framework currently in place. This leaves AI companies to make many safety-related decisions independently. OpenAI uses a preparedness framework to assess whether a new model is safe to release. This framework evaluates four main areas of concern:

Cybersecurity: Assessing if the AI can enhance the productivity of cybercriminals or aid in creating exploits.
BCRN Threats: Evaluating if the AI can assist in developing biological, chemical, radiological, or nuclear threats.
Persuasion: Determining if the AI can generate content that persuades people to change their beliefs.
Model Autonomy: Checking if the AI can act autonomously, performing actions with other software.

Each area is graded as Low, Medium, High, or Critical. OpenAI promises not to release a model with a Critical concern, which corresponds to a risk that could upend human civilization. GPT-4o has been rated as Medium concern, meaning it is considered safe under these guidelines.

Imperfect Output

As with all generative AI models, GPT-4o doesn’t always produce perfect results. Some common issues include:

Computer Vision Errors: Incorrect interpretations of images or videos.
Speech Transcription Errors: Inaccuracies in transcribing speech, particularly with strong accents or technical terms.
Translation Failures: Problems with translating between non-English languages.
Tone and Language Mistakes: Producing an unsuitable tone of voice or speaking the wrong language.

Accelerated Risk of Audio Deepfakes

GPT-4o’s audio capabilities present novel risks, particularly the potential for enhanced deepfake scam calls. These scams involve AI impersonating celebrities, politicians, or people’s acquaintances, making them more convincing. To mitigate this risk, GPT-4o’s audio output is limited to a selection of preset voices. However, scammers could still generate text with GPT-4o and use separate text-to-speech models, although this might not retain the same quality.

Cost

Despite its enhanced capabilities, GPT-4o is approximately 50% cheaper than its predecessor, GPT-4 Turbo. The cost is $5 per million tokens for input and $15 per million tokens for output.

Accessing GPT-4o in ChatGPT

The user interface for ChatGPT now defaults to using GPT-4o, with an option to switch to GPT-3.5 via a toggle underneath the response.

Implications for the Future

There are two primary perspectives on AI development:

Increasing Power and Versatility: OpenAI aims to create artificial general intelligence (AGI) and thus focuses on making AI more powerful and capable of a broader range of tasks.
Specialized Efficiency: Another view is that AI should become better at specific tasks as cost-effectively as possible.

GPT-4o aligns with the first perspective, representing a significant step towards AGI. This new architecture marks the beginning of a phase of learning and optimization for OpenAI, with expected improvements in performance over time.

Initially, there may be new quirks and hallucinations, but long-term advancements in speed and quality are anticipated.

Conclusion

GPT-4o is a significant advancement in generative AI, combining text, audio, and visual processing into a single model.

This innovation promises faster responses, richer interactions, and a broader range of applications, from real-time translation to enhanced data analysis and improved accessibility for the visually impaired.

While there are initial limitations and risks, such as potential misuse in deepfake scams, GPT-4o represents another step towards OpenAI’s goal of artificial general intelligence.

Its lower cost and enhanced capabilities are poised to set a new standard in the AI industry, expanding possibilities for users across various fields.