CMU Study: Google’s Gemini Falls Short of ChatGPT, Signaling Work Ahead for Google
**Google Gemini vs. ChatGPT: A Battle of Large Language Models**
Google’s recent release of Gemini has garnered significant attention as the first large language model (LLM) that supposedly rivals OpenAI’s ChatGPT across various tasks. Reports have indicated that Gemini’s “Ultra” version outperforms GPT-4 on various tasks, while its “Pro” version is comparable to GPT-3.5. Aiming to shed light on the ongoing rivalry between these prominent language models, a new study conducted by Carnegie Mellon University (CMU) delves into Google Gemini’s language understanding and generation capabilities, comparing them with OpenAI’s GPT series. The study reveals interesting findings, highlighting a performance gap between Google Gemini and ChatGPT.
**Key Findings:**
* **Gemini Pro Matches GPT-3.5 Turbo:** In terms of model size and category, CMU’s study positions Gemini Pro as comparable to GPT 3.5 Turbo. While Gemini Pro’s accuracy generally matches that of GPT 3.5 Turbo, it falls slightly short, trailing behind GPT 4 by a significant margin. Notably, Gemini Pro’s average performance is marginally lower than GPT 3.5 Turbo, particularly in areas such as answer ordering bias on multiple-choice questions, multi-digit mathematical reasoning, premature termination of agent tasks, and answer failures due to aggressive content filtering. However, Gemini demonstrates an advantage in generating non-English languages and handling longer, more complex reasoning chains, especially in lengthy and intricate reasoning tasks. Furthermore, without answer filtering, Gemini excels in utilizing diverse languages.
* **Critical Large Language Model Capabilities:** The study delves into several crucial capabilities of large language models, revealing the following specific findings:
* **Knowledge Graph Question Answering:** Comparing the question-answering abilities of the models, Gemini Pro underperforms GPT 3.5 in most tasks, as depicted in the provided graph. The research team further analyzes the tasks where Gemini Pro falls behind/surpasses GPT 3.5, concluding that Gemini Pro lags in “human_sexuality” (social sciences), “formal_logic” (humanities), “elementary_mathematics” (STEM), and “professional_medicine” (professional domain). Meanwhile, in the two tasks where Gemini Pro outperforms GPT 3.5 Turbo, the advantage is marginal.
* **Reasoning Ability:** Gemini Pro’s overall accuracy in reasoning tasks slightly trails GPT 3.5 Turbo and falls well below GPT 4 Turbo. However, Gemini Pro struggles with longer, more complex problems, while GPT models exhibit greater robustness in handling such challenges. The study also identifies the tasks where GPT 3.5 Turbo’s performance significantly surpasses Gemini Pro.
* **Mathematical Ability:** As evident from the overall mathematical reasoning results, Gemini Pro’s accuracy on GSM8K, SVAMP, and ASDIV tasks, which involve multilingual prompts, is slightly lower than GPT 3.5 Turbo and significantly lower than GPT 4 Turbo. In the MAWPS task, all models achieve over 90% accuracy, but Gemini Pro still falls slightly behind GPT models.
* **Code Generation Ability:** In terms of code generation, Gemini Pro demonstrates strength in handling longer inputs and outputs in English tasks. Analysis reveals that Gemini Pro underperforms GPT 3.5 in most cases involving libraries like “mock,” “pandas,” “numpy,” and “datetime.” However, it outperforms both GPT 3.5 and GPT 4 in tasks involving “matplotlib,” indicating Gemini’s enhanced capability in performing data visualization through code execution.
* **Machine Translation Ability:** Gemini Pro outperforms GPT 3.5 Turbo and GPT 4 Turbo in eight languages in terms of translation capability. In comparison, Gemini Pro exhibits superior performance in eight out of 20 languages tested against GPT 3.5 Turbo and GPT 4 Turbo, achieving the best results in four languages. However, Gemini Pro displays a strong tendency toward blocked responses in approximately 10 language pairs.
The CMU study provides valuable insights into the capabilities of Google Gemini, highlighting areas where it falls short compared to OpenAI’s ChatGPT. While Gemini Pro demonstrates proficiency in certain tasks, it faces challenges in matching the overall performance of ChatGPT. The findings underscore the need for continuous improvement and further research in the realm of large language models, as the pursuit of developing more capable and versatile AI systems continues.