brainsteam.co.uk/brainsteam/content/annotations/2022/12/19/1671461409.md

68 lines
2.7 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
date: '2022-12-19T14:50:09'
hypothesis-meta:
created: '2022-12-19T14:50:09.008193+00:00'
document:
title:
- My AI Safety Lecture for UT Effective Altruism
flagged: false
group: __world__
hidden: false
id: bvVepH-sEe2uPgfvTF7V-w
links:
html: https://hypothes.is/a/bvVepH-sEe2uPgfvTF7V-w
incontext: https://hyp.is/bvVepH-sEe2uPgfvTF7V-w/scottaaronson.blog/?p=6823
json: https://hypothes.is/api/annotations/bvVepH-sEe2uPgfvTF7V-w
permissions:
admin:
- acct:ravenscroftj@hypothes.is
delete:
- acct:ravenscroftj@hypothes.is
read:
- group:__world__
update:
- acct:ravenscroftj@hypothes.is
tags:
- explainability
- nlproc
target:
- selector:
- endContainer: /div[2]/div[2]/div[2]/div[1]/p[72]
endOffset: 437
startContainer: /div[2]/div[2]/div[2]/div[1]/p[72]
startOffset: 10
type: RangeSelector
- end: 29171
start: 28744
type: TextPositionSelector
- exact: " Eventually GPT will say, \u201Coh, I know what game we\u2019re playing!\
\ it\u2019s the \u2018give false answers\u2019 game!\u201D And it will then\
\ continue playing that game and give you more false answers. What the new\
\ paper shows is that, in such cases, one can actually look at the inner layers\
\ of the neural net and find where it has an internal representation of what\
\ was the true answer, which then gets overridden once you get to the output\
\ layer."
prefix: "Does 2+2=4? No.\u201D\n\n\n\n\nand so on."
suffix: "\n\n\n\nTo be clear, there\u2019s no know"
type: TextQuoteSelector
source: https://scottaaronson.blog/?p=6823
text: this is fascinating - GPT learns the true answer to a question but will ignore
it and let the user override this in later layers of the model
updated: '2022-12-19T14:50:09.008193+00:00'
uri: https://scottaaronson.blog/?p=6823
user: acct:ravenscroftj@hypothes.is
user_info:
display_name: James Ravenscroft
in-reply-to: https://scottaaronson.blog/?p=6823
tags:
- explainability
- nlproc
- hypothesis
type: annotation
url: /annotations/2022/12/19/1671461409
---
<blockquote> Eventually GPT will say, “oh, I know what game were playing! its the give false answers game!” And it will then continue playing that game and give you more false answers. What the new paper shows is that, in such cases, one can actually look at the inner layers of the neural net and find where it has an internal representation of what was the true answer, which then gets overridden once you get to the output layer.</blockquote>this is fascinating - GPT learns the true answer to a question but will ignore it and let the user override this in later layers of the model