--- date: '2023-03-21T06:25:47' hypothesis-meta: created: '2023-03-21T06:25:47.417575+00:00' document: title: - 'GPT-4 and professional benchmarks: the wrong answer to the wrong question' flagged: false group: __world__ hidden: false id: N6BVsMexEe2Z4X92AfjYDg links: html: https://hypothes.is/a/N6BVsMexEe2Z4X92AfjYDg incontext: https://hyp.is/N6BVsMexEe2Z4X92AfjYDg/aisnakeoil.substack.com/p/gpt-4-and-professional-benchmarks json: https://hypothes.is/api/annotations/N6BVsMexEe2Z4X92AfjYDg permissions: admin: - acct:ravenscroftj@hypothes.is delete: - acct:ravenscroftj@hypothes.is read: - group:__world__ update: - acct:ravenscroftj@hypothes.is tags: - llm - openai - gpt - ModelEvaluation target: - selector: - endContainer: /div[1]/div[1]/div[2]/div[1]/div[1]/div[1]/article[1]/div[4]/div[1]/div[1]/p[4]/span[2] endOffset: 300 startContainer: /div[1]/div[1]/div[2]/div[1]/div[1]/div[1]/article[1]/div[4]/div[1]/div[1]/p[4]/span[1] startOffset: 0 type: RangeSelector - end: 5998 start: 5517 type: TextPositionSelector - exact: "To benchmark GPT-4\u2019s coding ability, OpenAI evaluated it on problems\ \ from Codeforces, a website that hosts coding competitions. Surprisingly,\ \ Horace He pointed out that GPT-4 solved 10/10 pre-2021 problems and 0/10\ \ recent problems in the easy category. The training data cutoff for GPT-4\ \ is September 2021. This strongly suggests that the model is able to memorize\ \ solutions from its training set \u2014 or at least partly memorize them,\ \ enough that it can fill in what it can\u2019t recall." prefix: 'm 1: training data contamination' suffix: As further evidence for this hyp type: TextQuoteSelector source: https://aisnakeoil.substack.com/p/gpt-4-and-professional-benchmarks text: OpenAI was only able to pass questions available before september 2021 and failed to answer new questions - strongly suggesting that it has simply memorised the answers as part of its training updated: '2023-03-21T06:26:57.441600+00:00' uri: https://aisnakeoil.substack.com/p/gpt-4-and-professional-benchmarks user: acct:ravenscroftj@hypothes.is user_info: display_name: James Ravenscroft in-reply-to: https://aisnakeoil.substack.com/p/gpt-4-and-professional-benchmarks tags: - llm - openai - gpt - ModelEvaluation - hypothesis type: annotation url: /annotations/2023/03/21/1679379947 ---
To benchmark GPT-4’s coding ability, OpenAI evaluated it on problems from Codeforces, a website that hosts coding competitions. Surprisingly, Horace He pointed out that GPT-4 solved 10/10 pre-2021 problems and 0/10 recent problems in the easy category. The training data cutoff for GPT-4 is September 2021. This strongly suggests that the model is able to memorize solutions from its training set — or at least partly memorize them, enough that it can fill in what it can’t recall.OpenAI was only able to pass questions available before september 2021 and failed to answer new questions - strongly suggesting that it has simply memorised the answers as part of its training