--- date: '2022-11-28T11:34:45' hypothesis-meta: created: '2022-11-28T11:34:45.963292+00:00' document: title: - 1809.09672.pdf flagged: false group: __world__ hidden: false id: qMPVfG8QEe2WJWufCDu9ww links: html: https://hypothes.is/a/qMPVfG8QEe2WJWufCDu9ww incontext: https://hyp.is/qMPVfG8QEe2WJWufCDu9ww/arxiv.org/pdf/1809.09672.pdf json: https://hypothes.is/api/annotations/qMPVfG8QEe2WJWufCDu9ww permissions: admin: - acct:ravenscroftj@hypothes.is delete: - acct:ravenscroftj@hypothes.is read: - group:__world__ update: - acct:ravenscroftj@hypothes.is tags: - rl - bandit - nlproc - summarization target: - selector: - end: 10089 start: 9945 type: TextPositionSelector - exact: andit is a decision-making formal-ization in which an agent repeatedly chooses oneof several actions, and receives a reward based onthis choice. prefix: dient reinforcementlearning. A b suffix: " The agent\u2019s goal is to quickly " type: TextQuoteSelector source: https://arxiv.org/pdf/1809.09672.pdf text: 'Definition for contextual bandit: an agent that repeatedly choses one of several actions and receives a reward based on this choice.' updated: '2022-11-28T11:34:45.963292+00:00' uri: https://arxiv.org/pdf/1809.09672.pdf user: acct:ravenscroftj@hypothes.is user_info: display_name: James Ravenscroft in-reply-to: https://arxiv.org/pdf/1809.09672.pdf tags: - rl - bandit - nlproc - summarization - hypothesis type: annotation url: /annotations/2022/11/28/1669635285 ---
andit is a decision-making formal-ization in which an agent repeatedly chooses oneof several actions, and receives a reward based onthis choice.Definition for contextual bandit: an agent that repeatedly choses one of several actions and receives a reward based on this choice.