---
date: '2023-03-21T06:25:47'
hypothesis-meta:
  created: '2023-03-21T06:25:47.417575+00:00'
  document:
    title:
    - 'GPT-4 and professional benchmarks: the wrong answer to the wrong question'
  flagged: false
  group: __world__
  hidden: false
  id: N6BVsMexEe2Z4X92AfjYDg
  links:
    html: https://hypothes.is/a/N6BVsMexEe2Z4X92AfjYDg
    incontext: https://hyp.is/N6BVsMexEe2Z4X92AfjYDg/aisnakeoil.substack.com/p/gpt-4-and-professional-benchmarks
    json: https://hypothes.is/api/annotations/N6BVsMexEe2Z4X92AfjYDg
  permissions:
    admin:
    - acct:ravenscroftj@hypothes.is
    delete:
    - acct:ravenscroftj@hypothes.is
    read:
    - group:__world__
    update:
    - acct:ravenscroftj@hypothes.is
  tags:
  - llm
  - openai
  - gpt
  - ModelEvaluation
  target:
  - selector:
    - endContainer: /div[1]/div[1]/div[2]/div[1]/div[1]/div[1]/article[1]/div[4]/div[1]/div[1]/p[4]/span[2]
      endOffset: 300
      startContainer: /div[1]/div[1]/div[2]/div[1]/div[1]/div[1]/article[1]/div[4]/div[1]/div[1]/p[4]/span[1]
      startOffset: 0
      type: RangeSelector
    - end: 5998
      start: 5517
      type: TextPositionSelector
    - exact: "To benchmark GPT-4\u2019s coding ability, OpenAI evaluated it on problems\
        \ from Codeforces, a website that hosts coding competitions. Surprisingly,\
        \ Horace He pointed out that GPT-4 solved 10/10 pre-2021 problems and 0/10\
        \ recent problems in the easy category. The training data cutoff for GPT-4\
        \ is September 2021. This strongly suggests that the model is able to memorize\
        \ solutions from its training set \u2014 or at least partly memorize them,\
        \ enough that it can fill in what it can\u2019t recall."
      prefix: 'm 1: training data contamination'
      suffix: As further evidence for this hyp
      type: TextQuoteSelector
    source: https://aisnakeoil.substack.com/p/gpt-4-and-professional-benchmarks
  text: OpenAI was only able to pass questions available before september 2021 and
    failed to answer new questions - strongly suggesting that it has simply memorised
    the answers as part of its training
  updated: '2023-03-21T06:26:57.441600+00:00'
  uri: https://aisnakeoil.substack.com/p/gpt-4-and-professional-benchmarks
  user: acct:ravenscroftj@hypothes.is
  user_info:
    display_name: James Ravenscroft
in-reply-to: https://aisnakeoil.substack.com/p/gpt-4-and-professional-benchmarks
tags:
- llm
- openai
- gpt
- ModelEvaluation
- hypothesis
type: annotation
url: /annotations/2023/03/21/1679379947

---


 <blockquote>To benchmark GPT-4’s coding ability, OpenAI evaluated it on problems from Codeforces, a website that hosts coding competitions. Surprisingly, Horace He pointed out that GPT-4 solved 10/10 pre-2021 problems and 0/10 recent problems in the easy category. The training data cutoff for GPT-4 is September 2021. This strongly suggests that the model is able to memorize solutions from its training set — or at least partly memorize them, enough that it can fill in what it can’t recall.</blockquote>OpenAI was only able to pass questions available before september 2021 and failed to answer new questions - strongly suggesting that it has simply memorised the answers as part of its training