GPQA: A Graduate-Level Google-Proof Q&A Benchmark


TL;DR

This paper introduces GPQA, a 448-question graduate-level multiple-choice dataset written by experts in biology, physics, and chemistry. It is designed to be “Google-Proof,” meaning it is extremely difficult even for skilled non-experts with unrestricted internet access, and is intended to provide a highly challenging benchmark for future research on scalable oversight for superhuman AI systems.

Key Definitions

Mainstream question-answering (QA) benchmark datasets are typically created either by crowdsourcing non-experts (e.g., SQuAD) or by curating existing resources (e.g., MMLU, TriviaQA). The answers to these datasets can usually be easily found and verified by skilled users through web search. As large language model (LLM) capabilities have advanced rapidly, they have gradually saturated these benchmarks, reducing the usefulness of these datasets for evaluating frontier models and studying more advanced human-AI collaboration, such as scalable oversight.

Research on scalable oversight requires a special kind of task: one whose correct answer is definitively known (determined by authoritative experts), yet is extremely difficult for the non-experts responsible for supervision to solve independently. Existing datasets either lack this “difficulty gap” or merely simulate a knowledge gap in an artificial way (for example, by distinguishing experts from non-experts based on familiarity with long articles).

This paper aims to address this issue by creating a benchmark of real-world expert-knowledge questions (GPQA) that pose major challenges to both non-experts and the strongest current AI models, thereby providing a realistic and meaningful testbed for studying how to supervise superhuman AI systems.

Method

The core contribution of this paper lies in its distinctive and rigorous dataset construction and validation pipeline. The process is designed to systematically generate questions that are both objective (with clear answers) and highly difficult (Google-Proof).

Data creation flowchart Figure 1: The data creation process in this paper. First, one expert writes a question, and another expert in the same field answers it and provides revision suggestions. The writer then revises the question. The revised question is sent to a second expert in the same field and three non-expert validators.

Data Collection Pipeline

The entire process is divided into four main stages:

  1. Question Writing: Experts with or pursuing a PhD in the relevant field are invited to write difficult questions within their area of expertise. The requirement is that experts in the same field should be able to answer them correctly, but non-experts should find them difficult to answer even with internet access. The questions are designed so that they can be answered in free-response form even without options. In addition, the writer must provide detailed explanations for both the correct and incorrect options.

  2. First Expert Validation: The completed questions are given to another expert in the same field (the first validator) to answer and evaluate. The validator provides detailed feedback to ensure the questions’ accuracy, objectivity, and difficulty.

  3. Question Revision: The question writer revises the questions based on the first validator’s feedback.

  4. Second Validation Phase:

    • Second Expert Validation: The revised questions are given to a third expert in the same field (the second validator) to answer, further testing their objectivity.
    • Non-Expert Validation: At the same time, the questions are distributed to three non-experts (i.e., PhDs in other fields). They are allowed unrestricted use of all web resources except LLM assistants and are required to spend at least 15 minutes solving them. This step is key to verifying whether the questions are truly “Google-Proof.”

Innovations

Dataset Splits

The final collected data are divided into three subsets to suit different research needs:

Dataset Split Count Expert Accuracy (%) Non-Expert Accuracy (%) Proportion of Experts Who Considered It Sufficiently Specialized (%)
GPQA Extended 546 64.8 34.1 90.7
GPQA (Main Set) 448 71.9* 30.4* 93.5
GPQA Diamond 198 81.3* 22.1* 97.0

Table 2: Statistics for the extended set, main set, and diamond set. The validator accuracies () on the main set and diamond set are biased due to selection effects.*

Coverage Areas

The questions cover the three major domains of biology, physics, and chemistry, and are further divided into subfields such as molecular biology, quantum mechanics, and organic chemistry.

Domain Count Expert Accuracy (%) Non-Expert Accuracy (%) Gap (points)
Biology 105 66.7 43.2 23.5
Physics 227 57.3 32.5 24.8
Chemistry 214 72.0 31.4 40.6

Table 3: Comparison of expert and non-expert accuracy across domains in the extended set; the “Gap” column shows the expertise gap between the two.

Experimental Conclusions

Through a series of analyses and baseline model tests, this paper validated the characteristics of the GPQA dataset and its effectiveness as a benchmark for scalable supervision research.

Dataset Characteristic Validation

Model Baseline Tests

This paper tested multiple LLMs in both closed-book (no internet access) and open-book (with internet access) settings.

Evaluation Method and Model Extended Set Main Set Diamond Set
  \multicolumn{3}{c }{Accuracy on Each Subset (%)}  
Few-Shot CoT GPT-4 38.7 39.7 38.8
GPT-4 (with search) 39.4 41.0 38.8
Human Expert Validators 65.4 72.5* 81.2*
Human Non-Expert Validators 33.9 30.5* 21.9*

Table 5: Comparison of the performance of major baseline models and humans across the three datasets.

Final Conclusion

This paper successfully constructed GPQA, a high-quality, graduate-level question-answering dataset. Through a rigorous multi-stage validation process involving both experts and non-experts, it ensures that the questions are both highly objective and highly difficult in a “Google-proof” sense. The experiments show that the dataset not only poses a major challenge to highly skilled non-experts, but also remains difficult for today’s most advanced AI models, while still exhibiting a substantial gap from domain experts. Therefore, GPQA provides a valuable and realistic benchmark for studying and evaluating scalable supervision methods for future superhuman AI systems.