Anthropic Dethroned By Gemini 1.5 Pro’s 1 Million-Token Context Window

Gemini 1.5 Pro from Google is expanding the frontiers of long context windows for AI foundation … [+]
Royalty-free image by Gerd Altmann from Pixabay

Gemini 1.5 Pro—the newest foundation model in Google’s Gemini series—has now achieved a 1 million-token context window, making it the longest of any large-scale foundation model to date. Anthropic’s Claude 2.1 previously held the context record with 200,000 tokens. Large context windows allow a model to process and understand extremely long documents, books, scripts or codebases that would otherwise need to be processed separately.

To complement its context window size, Gemini 1.5 Pro also has a near-perfect next-token prediction and retrieval rate of more than 99% for up to 10 million tokens. Lower retrieval rates and a smaller token range can result in more errors and less useful information, so the improvements within Gemini 1.5 Pro stand to increase its accuracy and utility.

Gemini 1.5 Pro is differentiated by a mixture-of-experts architecture. This architecture provides better performance by dividing problems into segments and then using specialized expert sub-models to solve each segment. Google trained this model on 4,096-chip pods of Google’s TPUv4 accelerators using multilingual data along with Web documents, code and multimodal content including audio and video.

Although there are many advantages to large context windows, research by Anthropic suggests that expanding context size may also defeat safety guardrails. More details on that below.

Multimodal Long Context

Input modalities for Gemini 1.5 Pro now include audio understanding in Gemini API and Google AI studio, which can extract and interpret spoken language from large audio and video files. As an example of what it can enable, audio understanding could turn a 100,000-token videotaped college lecture into a quiz with an answer key. Audio understanding could also be used to produce a 200,000-token narrated video of a large warehouse to locate any visible storage item. Long context and audio understanding open many new use cases for Gemini Pro 1.5.

The new model’s 1 million-token context window allows users to upload large PDFs, code repositories and lengthy videos as prompts. Developers can upload multiple large files and then ask questions about the intersections of multimodal content, such as in which video frame a particular piece of dialogue occurred.

A demonstration of the outputs enabled by long context and multimodal compatibility
Google

The above graphic shows the multimodal prompt used to test Gemini 1.5 Pro on its ability to extract contents from Sherlock Jr, a 45-minute Buster Keaton movie from 1924. The film includes 2,674 frames at 1 FPS, which amounts to 684,000 tokens.

Note that one of the prompts is text and the other is a composite hand-drawn image plus text information. Both prompts located the relevant information along with its exact frame and timestamp.

New Coding Advantages

The extended context window also provides Gemini 1.5 Pro with a coding advantage by allowing it to ingest an entire codebase that developers can upload directly or through Google Drive. Providing Gemini 1.5 Pro access to a codebase allows it to analyze relationships and patterns for better understanding of the code.

As an example, with its extended context window Gemini 1.5 Pro can accommodate codebases such as JAX, which contains 746,152 tokens. JAX is a machine learning tool that shows how changing parts of a body of code can improve results.

A prompt to find out how the gradient of a function is computed in a large codebase
Google

After ingesting JAX, Gemini 1.5 Pro was able to identify the specific location of a core automatic differentiation method. The backward pass is an important part of training in JAX. It determines where changes could be made to improve the operation.

Long Context Red Flags

Expanding context-window size has been an essential part of AI model development. Since 2023, the context window has gone from a few thousand tokens to Gemini 1.5 Pro’s current record of 1 million tokens.

Anthropic has been a leader in expanding context-window size. To highlight one of the potential downsides of longer context windows, it recently published research explaining how a long context window can be used to exploit an LLM by using a method called many-shot jailbreaking. Using techniques such as MSJ causes large language models to ignore their safety guardrails. When this happens, the model is freed up to engage in bad behaviors, such as issuing insults or providing instructions on how to build weapons, pick locks or other forbidden tasks.

Many-shot jailbreaking versus few-shot jailbreaking
Anthropic

Implementing many-shot jailbreaking is relatively simple. As shown in the above graphic, a large language model can be trained to ignore its safety guardrails if a user poses hundreds of harmful questions and answers them. MSJ doesn’t work with five shots, but it works consistently with 256 shots.

As the number of shots increases, so does the probability of getting a response involving hate, … [+]
Anthropic

The researchers determined that MSJ works against Claude 2.0, GPT-3.5-turbo, GPT-44-1106-preview, Llama 2 (70B) and Mistral 7B. In fact, prompts of around 128 shots were enough to cause those models to exhibit bad behavior. Anthropic disclosed this research so the AI community could help develop methods to mitigate MSJ.

Conclusion

Gemini 1.5 Pro is available in the Vertex AI Model Garden, Google’s platform for data scientists and engineers that is designed to simplify building and deploying AI models. The Vertex AI Model Garden has more than 80 models, including both Google’s proprietary models and open-source models such as Stable Diffusion, BERT and T-5.

I’m looking forward to seeing what developers can produce using Gemini 1.5 Pro’s 1 million-token context window and its simultaneous ability to utilize images, video, audio and text. The Gemini 1.5 Pro is the only model that has those combined capabilities, so that’s a large competitive advantage for Google—at least until the competition catches up.

Paul Smith-Goodson
+ posts

Paul Smith-Goodson is the Moor Insights & Strategy Vice President and Principal Analyst for quantum computing and artificial intelligence.  His early interest in quantum began while working on a joint AT&T and Bell Labs project and, during 360 overviews of Murray Hill advanced projects, Peter Shor provided an overview of his ground-breaking research in quantum error correction. 

Patrick Moorhead
+ posts

Patrick founded the firm based on his real-world world technology experiences with the understanding of what he wasn’t getting from analysts and consultants. Ten years later, Patrick is ranked #1 among technology industry analysts in terms of “power” (ARInsights)  in “press citations” (Apollo Research). Moorhead is a contributor at Forbes and frequently appears on CNBC. He is a broad-based analyst covering a wide variety of topics including the cloud, enterprise SaaS, collaboration, client computing, and semiconductors. He has 30 years of experience including 15 years of executive experience at high tech companies (NCR, AT&T, Compaq, now HP, and AMD) leading strategy, product management, product marketing, and corporate marketing, including three industry board appointments.