---
date: '2023-03-21T19:59:04'
hypothesis-meta:
  created: '2023-03-21T19:59:04.177001+00:00'
  document:
    title:
    - 2303.09752.pdf
  flagged: false
  group: __world__
  hidden: false
  id: 1MB9BMgiEe27GS99BvTIlA
  links:
    html: https://hypothes.is/a/1MB9BMgiEe27GS99BvTIlA
    incontext: https://hyp.is/1MB9BMgiEe27GS99BvTIlA/arxiv.org/pdf/2303.09752.pdf
    json: https://hypothes.is/api/annotations/1MB9BMgiEe27GS99BvTIlA
  permissions:
    admin:
    - acct:ravenscroftj@hypothes.is
    delete:
    - acct:ravenscroftj@hypothes.is
    read:
    - group:__world__
    update:
    - acct:ravenscroftj@hypothes.is
  tags:
  - llm
  - attention
  - long-documents
  target:
  - selector:
    - end: 1989
      start: 1515
      type: TextPositionSelector
    - exact: "Over the past few years, many \u201Cefficient Trans-former\u201D approaches\
        \ have been proposed that re-duce the cost of the attention mechanism over\
        \ longinputs (Child et al., 2019; Ainslie et al., 2020; Belt-agy et al., 2020;\
        \ Zaheer et al., 2020; Wang et al.,2020; Tay et al., 2021; Guo et al., 2022).\
        \ However,especially for larger models, the feedforward andprojection layers\
        \ actually make up the majority ofthe computational burden and can render\
        \ process-ing long inputs intractable"
      prefix: ' be applied to each input token.'
      suffix: ".\u2217Author contributions are outli"
      type: TextQuoteSelector
    source: https://arxiv.org/pdf/2303.09752.pdf
  text: Recent improvements in transformers for long documents have focused on efficiencies
    in the attention mechanism but the feed-forward and projection layers are still
    expensive for long docs
  updated: '2023-03-21T19:59:04.177001+00:00'
  uri: https://arxiv.org/pdf/2303.09752.pdf
  user: acct:ravenscroftj@hypothes.is
  user_info:
    display_name: James Ravenscroft
in-reply-to: https://arxiv.org/pdf/2303.09752.pdf
tags:
- llm
- attention
- long-documents
- hypothesis
type: annotation
url: /annotations/2023/03/21/1679428744

---


 <blockquote>Over the past few years, many “efficient Trans-former” approaches have been proposed that re-duce the cost of the attention mechanism over longinputs (Child et al., 2019; Ainslie et al., 2020; Belt-agy et al., 2020; Zaheer et al., 2020; Wang et al.,2020; Tay et al., 2021; Guo et al., 2022). However,especially for larger models, the feedforward andprojection layers actually make up the majority ofthe computational burden and can render process-ing long inputs intractable</blockquote>Recent improvements in transformers for long documents have focused on efficiencies in the attention mechanism but the feed-forward and projection layers are still expensive for long docs