Can AI Predict Science Better Than Humans? The Groundbreaking BrainBench Study

3–4 minutes

Science has long been the domain of human experts, but a recent study has shed light on a fascinating possibility: can artificial intelligence predict scientific results better than humans? Imagine an AI that reads every single paper, picks up on patterns across decades of research, and predicts the outcome of an experiment with uncanny accuracy. Sounds like science fiction? According to a groundbreaking study, this is based on true results.

## Too Much Science, Too Little Time

Reading everything isn’t humanly possible. Imagine you’re a research scientist about to design your next experiment. You would consume decades of research, learn the recurring patterns, identify the gaps, and make an informed prediction about what your experiment is likely to reveal. But here’s the problem: you’re human. Humans have limits. Time, attention, and memory are finite. The number of scientific publications grows at a pace that makes it increasingly unrealistic for researchers to keep up, even in specialized areas.

## When AI Outperformed Human Experts

Researchers at University College London decided to test a simple yet provocative question: can artificial intelligence predict scientific results better than human experts? To answer this, they built a benchmark called BrainBench, designed specifically to test scientific prediction on neuroscience. They compared the performance of 15 large language models with that of 171 neuroscience experts, each with an average of roughly ten years of experience in their respective subfields. The result was decisive. Across the benchmark, AI models achieved an average accuracy of 81.4%, while human experts reached 63.4%.

## How BrainBench Works

BrainBench measures prediction accuracy by presenting participants with two versions of a scientific abstract; the brief summary that appears at the beginning of a research paper. Both versions describe the same experiment, use the same methods, and share the same background. Both sound scientifically plausible. Only one, however, contains the real result. In the altered version, the outcome is subtly changed. A brain region may show decreased activity instead of increased. One drug may outperform another rather than the reverse. When such changes are made, the surrounding text is carefully adjusted to remain logically consistent. There are no obvious errors and no easy clues.

## When Hallucination Becomes Useful

In many AI applications, hallucination, i.e. the tendency of models to blend information from different sources and produce incorrect statements, is treated as a serious flaw. This is especially problematic for tasks that rely on accurate citation or factual recall. Prediction, however, operates under different constraints. Forecasting scientific outcomes often means working with noisy, incomplete, and sometimes conflicting evidence. It requires synthesizing patterns across thousands of imperfect studies rather than retrieving a single correct fact.

## The AI Didn’t Just Get Lucky

Persistent skepticism about memorization is understandable. Prior work has shown that large language models can sometimes reproduce parts of their training data. The BrainBench authors anticipated this concern and tested it thoroughly. First, the AI models were trained on neuroscience papers published between 2002 and 2022, and evaluated exclusively on papers from 2023, ensuring no overlap between training and evaluation data. Second, the authors confirmed that papers published earlier in 2023 were no easier for the models than those published later, ruling out the influence of leaked preprints. Third, they also applied a standard memorization-detection technique known as the zlib–perplexity ratio, which helps distinguish recall from genuine generalization. The results were inconsistent with simple memorization.

## Specialization vs. Memorization

Performance alone does not explain what the model has learned. To investigate this, the authors examined how domain-specific training changes behavior. They introduced BrainGPT, a neuroscience-specialized model created by fine-tuning a pre-trained Mistral-7B language model using LoRA on neuroscience literature from 2002 to 2022. This fine-tuning led to an additional ~3% improvement on BrainBench and shifted perplexity distributions in a way consistent with domain specialization rather than recall. In a follow-up study, the authors trained small language models from scratch on neuroscience literature alone, without large-scale pre-training. Despite being far smaller than modern LLMs, these models were able to match human expert performance on the BrainBench task.

Asset Management AI Betting AI Generative AI GPT Horse Racing Prediction AI Medical AI Perplexity Comet AI Semiconductor AI Sora AI Stable Diffusion UX UI Design AI