brainsteam.co.uk/brainsteam/content/annotations/2022/12/19/1671461186.md

3.4 KiB
Raw Blame History

date hypothesis-meta in-reply-to tags type url
2022-12-19T14:46:26
created document flagged group hidden id links permissions tags target text updated uri user user_info
2022-12-19T14:46:26.361697+00:00
title
My AI Safety Lecture for UT Effective Altruism
false __world__ false 6k0-pn-rEe20ccNOEgwbaQ
html incontext json
https://hypothes.is/a/6k0-pn-rEe20ccNOEgwbaQ https://hyp.is/6k0-pn-rEe20ccNOEgwbaQ/scottaaronson.blog/?p=6823 https://hypothes.is/api/annotations/6k0-pn-rEe20ccNOEgwbaQ
admin delete read update
acct:ravenscroftj@hypothes.is
acct:ravenscroftj@hypothes.is
group:__world__
acct:ravenscroftj@hypothes.is
nlproc
explainability
selector source
endContainer endOffset startContainer startOffset type
/div[2]/div[2]/div[2]/div[1]/p[68] 803 /div[2]/div[2]/div[2]/div[1]/p[68] 0 RangeSelector
end start type
27975 27172 TextPositionSelector
exact prefix suffix type
(3) A third direction, and I would say maybe the most popular one in AI alignment research right now, is called interpretability. This is also a major direction in mainstream machine learning research, so theres a big point of intersection there. The idea of interpretability is, why dont we exploit the fact that we actually have complete access to the code of the AI—or if its a neural net, complete access to its parameters? So we can look inside of it. We can do the AI analogue of neuroscience. Except, unlike an fMRI machine, which gives you only an extremely crude snapshot of what a brain is doing, we can see exactly what every neuron in a neural net is doing at every point in time. If we dont exploit that, then arent we trying to make AI safe with our hands tied behind our backs? take over the world, right? So we should look inside—but TextQuoteSelector
https://scottaaronson.blog/?p=6823
Interesting metaphor - it is a bit like MRI for neural networks but actually more accurate/powerful 2022-12-19T14:46:26.361697+00:00 https://scottaaronson.blog/?p=6823 acct:ravenscroftj@hypothes.is
display_name
James Ravenscroft
https://scottaaronson.blog/?p=6823
nlproc
explainability
hypothesis
annotation /annotations/2022/12/19/1671461186
(3) A third direction, and I would say maybe the most popular one in AI alignment research right now, is called interpretability. This is also a major direction in mainstream machine learning research, so theres a big point of intersection there. The idea of interpretability is, why dont we exploit the fact that we actually have complete access to the code of the AI—or if its a neural net, complete access to its parameters? So we can look inside of it. We can do the AI analogue of neuroscience. Except, unlike an fMRI machine, which gives you only an extremely crude snapshot of what a brain is doing, we can see exactly what every neuron in a neural net is doing at every point in time. If we dont exploit that, then arent we trying to make AI safe with our hands tied behind our backs?
Interesting metaphor - it is a bit like MRI for neural networks but actually more accurate/powerful