Gemini 2.0

Gemini 2.0 Flash was released on 11th Dec 2024.

Gemini 2.0 is natively multimodal, which means that the models can natively produce images and audio outputs. It has also improved native tool use and model memory. By incorporating other best-in-class core Google products, such as Google Search and real-time information access, Gemini 2.0 enabled several products (specialised types of agentic workflows): Project Astra, Project Mariner, and Jules.

With the tagline ‘Enabling the agentic era,’ Gemini 2.0 naturally focuses on the capabilities critical for agentic use cases, such as multimodal reasoning, compositional function-calling, and complex instruction planning.

Gemini 1.5

Gemini 1.5, based on the Mixture of Experts (MoE) architecture, has a context window of between 128,000 to 1 million tokens, far larger than the current popular models such as OpenAI GPT-4 (128,000) and Gemini Pro 1.0 (32,000). This feature opens possibilities for many use cases that require a workaround or complicated architecture setup. For example, write a review and predict the reception of an hour-long new YouTube video, analyse a PDF file of over 1000 pages without needing a backend RAG infrastructure, or follow hundreds of pages of manuals and style guides for complex language tasks. Given that Gemini is a multimodal model, a large context window enables it to understand multimedia content the same way as text and produce outputs in mixed formats.

As of the authoring of this note, Gemini 1.5 should not be seen as a like-for-like competitor of GPT-4. Benchmarking against GPT-4 [@dasGeminiProVs2024] shows that Gemini 1.5 is more suitable for handling large datasets, dealing with multimodal use cases, and operating with large context windows. GTP-4 outperforms in smaller tasks that are more nuanced and require complex reasoning.

References: