--- date: '2023-03-21T19:59:04' hypothesis-meta: created: '2023-03-21T19:59:04.177001+00:00' document: title: - 2303.09752.pdf flagged: false group: __world__ hidden: false id: 1MB9BMgiEe27GS99BvTIlA links: html: https://hypothes.is/a/1MB9BMgiEe27GS99BvTIlA incontext: https://hyp.is/1MB9BMgiEe27GS99BvTIlA/arxiv.org/pdf/2303.09752.pdf json: https://hypothes.is/api/annotations/1MB9BMgiEe27GS99BvTIlA permissions: admin: - acct:ravenscroftj@hypothes.is delete: - acct:ravenscroftj@hypothes.is read: - group:__world__ update: - acct:ravenscroftj@hypothes.is tags: - llm - attention - long-documents target: - selector: - end: 1989 start: 1515 type: TextPositionSelector - exact: "Over the past few years, many \u201Cefficient Trans-former\u201D approaches\ \ have been proposed that re-duce the cost of the attention mechanism over\ \ longinputs (Child et al., 2019; Ainslie et al., 2020; Belt-agy et al., 2020;\ \ Zaheer et al., 2020; Wang et al.,2020; Tay et al., 2021; Guo et al., 2022).\ \ However,especially for larger models, the feedforward andprojection layers\ \ actually make up the majority ofthe computational burden and can render\ \ process-ing long inputs intractable" prefix: ' be applied to each input token.' suffix: ".\u2217Author contributions are outli" type: TextQuoteSelector source: https://arxiv.org/pdf/2303.09752.pdf text: Recent improvements in transformers for long documents have focused on efficiencies in the attention mechanism but the feed-forward and projection layers are still expensive for long docs updated: '2023-03-21T19:59:04.177001+00:00' uri: https://arxiv.org/pdf/2303.09752.pdf user: acct:ravenscroftj@hypothes.is user_info: display_name: James Ravenscroft in-reply-to: https://arxiv.org/pdf/2303.09752.pdf tags: - llm - attention - long-documents - hypothesis type: annotation url: /annotations/2023/03/21/1679428744 ---
Over the past few years, many “efficient Trans-former” approaches have been proposed that re-duce the cost of the attention mechanism over longinputs (Child et al., 2019; Ainslie et al., 2020; Belt-agy et al., 2020; Zaheer et al., 2020; Wang et al.,2020; Tay et al., 2021; Guo et al., 2022). However,especially for larger models, the feedforward andprojection layers actually make up the majority ofthe computational burden and can render process-ing long inputs intractable
Recent improvements in transformers for long documents have focused on efficiencies in the attention mechanism but the feed-forward and projection layers are still expensive for long docs