--- date: '2022-12-07T11:55:42' hypothesis-meta: created: '2022-12-07T11:55:42.527155+00:00' document: title: - 2203.15556.pdf flagged: false group: __world__ hidden: false id: E3TX9nYmEe2IOgdyjyKG9w links: html: https://hypothes.is/a/E3TX9nYmEe2IOgdyjyKG9w incontext: https://hyp.is/E3TX9nYmEe2IOgdyjyKG9w/arxiv.org/pdf/2203.15556.pdf json: https://hypothes.is/api/annotations/E3TX9nYmEe2IOgdyjyKG9w permissions: admin: - acct:ravenscroftj@hypothes.is delete: - acct:ravenscroftj@hypothes.is read: - group:__world__ update: - acct:ravenscroftj@hypothes.is tags: - nlproc - efficient ml target: - selector: - end: 1689 start: 1063 type: TextPositionSelector - exact: "We test this hypothesis by training a predicted compute-optimal model,\ \ Chinchilla, that uses the same compute budget as Gopher but with 70B parameters\ \ and4\xD7 more more data. Chinchilla uniformly and significantly outperforms\ \ Gopher (280B), GPT-3 (175B),Jurassic-1 (178B), and Megatron-Turing NLG (530B)\ \ on a large range of downstream evaluation tasks.This also means that Chinchilla\ \ uses substantially less compute for fine-tuning and inference, greatlyfacilitating\ \ downstream usage. As a highlight, Chinchilla reaches a state-of-the-art\ \ average accuracy of67.5% on the MMLU benchmark, greater than a 7% improvement\ \ over Gopher" prefix: ' tokens should also be doubled. ' suffix: .1. IntroductionRecently a serie type: TextQuoteSelector source: https://arxiv.org/pdf/2203.15556.pdf text: By using more data on a smaller language model the authors were able to achieve better performance than with the larger models - this reduces the cost of using the model for inference. updated: '2022-12-07T11:55:42.527155+00:00' uri: https://arxiv.org/pdf/2203.15556.pdf user: acct:ravenscroftj@hypothes.is user_info: display_name: James Ravenscroft in-reply-to: https://arxiv.org/pdf/2203.15556.pdf tags: - nlproc - efficient ml - hypothesis type: annotation url: /annotations/2022/12/07/1670414142 ---
We test this hypothesis by training a predicted compute-optimal model, Chinchilla, that uses the same compute budget as Gopher but with 70B parameters and4× more more data. Chinchilla uniformly and significantly outperforms Gopher (280B), GPT-3 (175B),Jurassic-1 (178B), and Megatron-Turing NLG (530B) on a large range of downstream evaluation tasks.This also means that Chinchilla uses substantially less compute for fine-tuning and inference, greatlyfacilitating downstream usage. As a highlight, Chinchilla reaches a state-of-the-art average accuracy of67.5% on the MMLU benchmark, greater than a 7% improvement over Gopher
By using more data on a smaller language model the authors were able to achieve better performance than with the larger models - this reduces the cost of using the model for inference.