brainsteam.co.uk/brainsteam/content/annotations/2022/12/19/1671461186.md at c5f12bd8ebe06b9bb719286623c58884ac344a06

3.4 KiB

Raw Blame History

date

hypothesis-meta

in-reply-to

tags

target

text

updated

uri

user

user_info

2022-12-19T14:46:26.361697+00:00

title

My AI Safety Lecture for UT Effective Altruism

false

__world__

false

6k0-pn-rEe20ccNOEgwbaQ

html	incontext	json
https://hypothes.is/a/6k0-pn-rEe20ccNOEgwbaQ	https://hyp.is/6k0-pn-rEe20ccNOEgwbaQ/scottaaronson.blog/?p=6823	https://hypothes.is/api/annotations/6k0-pn-rEe20ccNOEgwbaQ

admin

delete

read

update

acct:ravenscroftj@hypothes.is

group:__world__

acct:ravenscroftj@hypothes.is

nlproc

explainability

selector

source

endContainer	endOffset	startContainer	startOffset	type
/div[2]/div[2]/div[2]/div[1]/p[68]	803	/div[2]/div[2]/div[2]/div[1]/p[68]	0	RangeSelector

end	start	type
27975	27172	TextPositionSelector

exact	prefix	suffix	type
(3) A third direction, and I would say maybe the most popular one in AI alignment research right now, is called interpretability. This is also a major direction in mainstream machine learning research, so there’s a big point of intersection there. The idea of interpretability is, why don’t we exploit the fact that we actually have complete access to the code of the AI—or if it’s a neural net, complete access to its parameters? So we can look inside of it. We can do the AI analogue of neuroscience. Except, unlike an fMRI machine, which gives you only an extremely crude snapshot of what a brain is doing, we can see exactly what every neuron in a neural net is doing at every point in time. If we don’t exploit that, then aren’t we trying to make AI safe with our hands tied behind our backs?	take over the world, right?	So we should look inside—but	TextQuoteSelector

https://scottaaronson.blog/?p=6823

Interesting metaphor - it is a bit like MRI for neural networks but actually more accurate/powerful

2022-12-19T14:46:26.361697+00:00

https://scottaaronson.blog/?p=6823

acct:ravenscroftj@hypothes.is

display_name
James Ravenscroft

https://scottaaronson.blog/?p=6823

nlproc

explainability

hypothesis

annotation

/annotations/2022/12/19/1671461186

(3) A third direction, and I would say maybe the most popular one in AI alignment research right now, is called interpretability. This is also a major direction in mainstream machine learning research, so there’s a big point of intersection there. The idea of interpretability is, why don’t we exploit the fact that we actually have complete access to the code of the AI—or if it’s a neural net, complete access to its parameters? So we can look inside of it. We can do the AI analogue of neuroscience. Except, unlike an fMRI machine, which gives you only an extremely crude snapshot of what a brain is doing, we can see exactly what every neuron in a neural net is doing at every point in time. If we don’t exploit that, then aren’t we trying to make AI safe with our hands tied behind our backs?

Interesting metaphor - it is a bit like MRI for neural networks but actually more accurate/powerful

3.4 KiB Raw Blame History Unescape Escape

3.4 KiB

Raw Blame History