brainsteam.co.uk/brainsteam/content/annotations/2022/12/19/1671461409.md

2.7 KiB
Raw Blame History

date hypothesis-meta in-reply-to tags type url
2022-12-19T14:50:09
created document flagged group hidden id links permissions tags target text updated uri user user_info
2022-12-19T14:50:09.008193+00:00
title
My AI Safety Lecture for UT Effective Altruism
false __world__ false bvVepH-sEe2uPgfvTF7V-w
html incontext json
https://hypothes.is/a/bvVepH-sEe2uPgfvTF7V-w https://hyp.is/bvVepH-sEe2uPgfvTF7V-w/scottaaronson.blog/?p=6823 https://hypothes.is/api/annotations/bvVepH-sEe2uPgfvTF7V-w
admin delete read update
acct:ravenscroftj@hypothes.is
acct:ravenscroftj@hypothes.is
group:__world__
acct:ravenscroftj@hypothes.is
explainability
nlproc
selector source
endContainer endOffset startContainer startOffset type
/div[2]/div[2]/div[2]/div[1]/p[72] 437 /div[2]/div[2]/div[2]/div[1]/p[72] 10 RangeSelector
end start type
29171 28744 TextPositionSelector
exact prefix suffix type
Eventually GPT will say, “oh, I know what game were playing! its the give false answers game!” And it will then continue playing that game and give you more false answers. What the new paper shows is that, in such cases, one can actually look at the inner layers of the neural net and find where it has an internal representation of what was the true answer, which then gets overridden once you get to the output layer. Does 2+2=4? No.” and so on. To be clear, theres no know TextQuoteSelector
https://scottaaronson.blog/?p=6823
this is fascinating - GPT learns the true answer to a question but will ignore it and let the user override this in later layers of the model 2022-12-19T14:50:09.008193+00:00 https://scottaaronson.blog/?p=6823 acct:ravenscroftj@hypothes.is
display_name
James Ravenscroft
https://scottaaronson.blog/?p=6823
explainability
nlproc
hypothesis
annotation /annotations/2022/12/19/1671461409
Eventually GPT will say, “oh, I know what game were playing! its the give false answers game!” And it will then continue playing that game and give you more false answers. What the new paper shows is that, in such cases, one can actually look at the inner layers of the neural net and find where it has an internal representation of what was the true answer, which then gets overridden once you get to the output layer.
this is fascinating - GPT learns the true answer to a question but will ignore it and let the user override this in later layers of the model