brainsteam.co.uk/brainsteam/content/annotations/2022/12/19/1671461409.md at 7ec860a6eb566e2cb310ec86938296758ecbbe6a

2.7 KiB

Raw Blame History

date

hypothesis-meta

in-reply-to

tags

target

text

updated

uri

user

user_info

2022-12-19T14:50:09.008193+00:00

title

My AI Safety Lecture for UT Effective Altruism

false

__world__

false

bvVepH-sEe2uPgfvTF7V-w

html	incontext	json
https://hypothes.is/a/bvVepH-sEe2uPgfvTF7V-w	https://hyp.is/bvVepH-sEe2uPgfvTF7V-w/scottaaronson.blog/?p=6823	https://hypothes.is/api/annotations/bvVepH-sEe2uPgfvTF7V-w

admin

delete

read

update

acct:ravenscroftj@hypothes.is

group:__world__

acct:ravenscroftj@hypothes.is

explainability

nlproc

selector

source

endContainer	endOffset	startContainer	startOffset	type
/div[2]/div[2]/div[2]/div[1]/p[72]	437	/div[2]/div[2]/div[2]/div[1]/p[72]	10	RangeSelector

end	start	type
29171	28744	TextPositionSelector

exact	prefix	suffix	type
Eventually GPT will say, “oh, I know what game we’re playing! it’s the ‘give false answers’ game!” And it will then continue playing that game and give you more false answers. What the new paper shows is that, in such cases, one can actually look at the inner layers of the neural net and find where it has an internal representation of what was the true answer, which then gets overridden once you get to the output layer.	Does 2+2=4? No.” and so on.	To be clear, there’s no know	TextQuoteSelector

https://scottaaronson.blog/?p=6823

this is fascinating - GPT learns the true answer to a question but will ignore it and let the user override this in later layers of the model

2022-12-19T14:50:09.008193+00:00

https://scottaaronson.blog/?p=6823

acct:ravenscroftj@hypothes.is

display_name
James Ravenscroft

https://scottaaronson.blog/?p=6823

explainability

nlproc

hypothesis

annotation

/annotations/2022/12/19/1671461409

Eventually GPT will say, “oh, I know what game we’re playing! it’s the ‘give false answers’ game!” And it will then continue playing that game and give you more false answers. What the new paper shows is that, in such cases, one can actually look at the inner layers of the neural net and find where it has an internal representation of what was the true answer, which then gets overridden once you get to the output layer.

this is fascinating - GPT learns the true answer to a question but will ignore it and let the user override this in later layers of the model

2.7 KiB Raw Blame History Unescape Escape

2.7 KiB

Raw Blame History