Retrospective: Google DeepMind Unveils Gemini, A New Frontier in Multimodal AI

Google DeepMind Unveils Gemini: A New Frontier in Multimodal AI

December 6, 2023 – In a highly anticipated move, Google DeepMind officially unveiled Gemini, a new family of artificial intelligence models, positioning it as their “most capable AI model yet.” This landmark announcement represented a significant milestone for Google, aiming to redefine the competitive landscape of generative AI capabilities just over a year after the launch of OpenAI’s ChatGPT sparked a global resurgence of interest in the field. The introduction of Gemini, a culmination of efforts from Google Brain and DeepMind, marked a direct challenge to existing state-of-the-art models, particularly OpenAI’s GPT-4, which had dominated headlines throughout 2023.

Sundar Pichai, CEO of Google and Alphabet, emphasized the significance of the launch in a blog post on December 6, stating, “This is a significant milestone in the development of AI and the start of a new era of AI for us at Google.” According to Google, Gemini was engineered from the ground up to be multimodal, meaning it was trained natively on a vast array of data types, including text, images, audio, and video, rather than combining separate components post-training. This integrated approach was presented as a key differentiator, enabling Gemini to understand and operate across various modalities in a more seamless and sophisticated manner.

Key Capabilities and Model Sizes

Gemini was introduced in three distinct sizes, each tailored for different applications and computational demands:

Gemini Ultra: Described as Google’s largest and most capable model, designed for highly complex tasks. Google announced that Ultra was still undergoing extensive trust and safety evaluations and would be made available to developers and enterprise customers, as well as powering new products, early in 2024.
Gemini Pro: A more agile model designed to scale across a wide range of tasks. Google announced its immediate integration into Bard, the company’s experimental conversational AI service.
Gemini Nano: The most efficient version, optimized for on-device applications. This version was announced for immediate integration into Google’s Pixel 8 Pro smartphone, enabling features like Summarize in the Recorder app and Smart Reply in Gboard.

A core claim by Google DeepMind was Gemini’s performance across various benchmarks. According to the Gemini Technical Report released on December 6, the Gemini Ultra model achieved a score of 90.0% on the Massive Multitask Language Understanding (MMLU) benchmark. Google highlighted that this was the first time an AI model had surpassed human expert performance on MMLU, which assesses knowledge and problem-solving abilities across 57 subjects. The report also presented Gemini’s strong performance across other benchmarks, including those for reasoning, math, code generation, and multimodal understanding, often outperforming or matching the capabilities of GPT-4.

Immediate Availability and Industry Reaction

The most immediate impact of the Gemini launch was its integration into Bard. From December 6, Bard began running on a fine-tuned version of Gemini Pro for English prompts, with Google promising broader availability in more languages and countries in the coming months. The Pixel 8 Pro also began to leverage Gemini Nano, demonstrating the model’s potential for efficient, on-device AI experiences without requiring cloud connectivity.

The launch generated considerable discussion across the technology industry. Many observers recognized Gemini as Google’s definitive response to the advancements made by OpenAI and Microsoft throughout 2023. The emphasis on native multimodality was seen as a significant technical leap, potentially enabling more natural and intuitive human-computer interactions. Publications widely covered the benchmark claims, particularly the MMLU score, as a strong indicator of the model’s capabilities.

However, within days of the announcement, discussions also emerged regarding the presentation of some of Gemini’s capabilities. A widely circulated demonstration video published by Google on December 6, showcasing Gemini’s multimodal reasoning, began to draw scrutiny. Reports, notably from Bloomberg and others, by December 8, suggested that elements of the demo were not in real-time and were edited or staged to showcase the model’s potential for brevity and impact. While Google maintained that the video was an “illustration of its multimodal capabilities” and not a real-time interaction, this revelation prompted some critics to question the transparency of the initial presentation.

Despite these early discussions surrounding the demo, the consensus among industry observers during the week following the launch was that Gemini represented a major advancement for Google. It underscored the company’s commitment to leading the charge in AI development, bringing together its substantial research capabilities to compete vigorously in the rapidly evolving field of generative AI.