--- date: '2023-03-21T06:25:47' hypothesis-meta: created: '2023-03-21T06:25:47.417575+00:00' document: title: - 'GPT-4 and professional benchmarks: the wrong answer to the wrong question' flagged: false group: __world__ hidden: false id: N6BVsMexEe2Z4X92AfjYDg links: html: https://hypothes.is/a/N6BVsMexEe2Z4X92AfjYDg incontext: https://hyp.is/N6BVsMexEe2Z4X92AfjYDg/aisnakeoil.substack.com/p/gpt-4-and-professional-benchmarks json: https://hypothes.is/api/annotations/N6BVsMexEe2Z4X92AfjYDg permissions: admin: - acct:ravenscroftj@hypothes.is delete: - acct:ravenscroftj@hypothes.is read: - group:__world__ update: - acct:ravenscroftj@hypothes.is tags: - llm - openai - gpt - ModelEvaluation target: - selector: - endContainer: /div[1]/div[1]/div[2]/div[1]/div[1]/div[1]/article[1]/div[4]/div[1]/div[1]/p[4]/span[2] endOffset: 300 startContainer: /div[1]/div[1]/div[2]/div[1]/div[1]/div[1]/article[1]/div[4]/div[1]/div[1]/p[4]/span[1] startOffset: 0 type: RangeSelector - end: 5998 start: 5517 type: TextPositionSelector - exact: "To benchmark GPT-4\u2019s coding ability, OpenAI evaluated it on problems\ \ from Codeforces, a website that hosts coding competitions. Surprisingly,\ \ Horace He pointed out that GPT-4 solved 10/10 pre-2021 problems and 0/10\ \ recent problems in the easy category. The training data cutoff for GPT-4\ \ is September 2021. This strongly suggests that the model is able to memorize\ \ solutions from its training set \u2014 or at least partly memorize them,\ \ enough that it can fill in what it can\u2019t recall." prefix: 'm 1: training data contamination' suffix: As further evidence for this hyp type: TextQuoteSelector source: https://aisnakeoil.substack.com/p/gpt-4-and-professional-benchmarks text: OpenAI was only able to pass questions available before september 2021 and failed to answer new questions - strongly suggesting that it has simply memorised the answers as part of its training updated: '2023-03-21T06:26:57.441600+00:00' uri: https://aisnakeoil.substack.com/p/gpt-4-and-professional-benchmarks user: acct:ravenscroftj@hypothes.is user_info: display_name: James Ravenscroft in-reply-to: https://aisnakeoil.substack.com/p/gpt-4-and-professional-benchmarks tags: - llm - openai - gpt - ModelEvaluation - hypothesis type: annotation url: /annotations/2023/03/21/1679379947 ---
To benchmark GPT-4’s coding ability, OpenAI evaluated it on problems from Codeforces, a website that hosts coding competitions. Surprisingly, Horace He pointed out that GPT-4 solved 10/10 pre-2021 problems and 0/10 recent problems in the easy category. The training data cutoff for GPT-4 is September 2021. This strongly suggests that the model is able to memorize solutions from its training set — or at least partly memorize them, enough that it can fill in what it can’t recall.
OpenAI was only able to pass questions available before september 2021 and failed to answer new questions - strongly suggesting that it has simply memorised the answers as part of its training