brainsteam.co.uk/brainsteam/content/annotations/2022/12/19/1671461885.md

2.6 KiB
Raw Blame History

date hypothesis-meta in-reply-to tags type url
2022-12-19T14:58:05
created document flagged group hidden id links permissions tags target text updated uri user user_info
2022-12-19T14:58:05.006973+00:00
title
My AI Safety Lecture for UT Effective Altruism
false __world__ false iqqNRH-tEe2fKTMGgQumvA
html incontext json
https://hypothes.is/a/iqqNRH-tEe2fKTMGgQumvA https://hyp.is/iqqNRH-tEe2fKTMGgQumvA/scottaaronson.blog/?p=6823 https://hypothes.is/api/annotations/iqqNRH-tEe2fKTMGgQumvA
admin delete read update
acct:ravenscroftj@hypothes.is
acct:ravenscroftj@hypothes.is
group:__world__
acct:ravenscroftj@hypothes.is
explainability
nlproc
selector source
endContainer endOffset startContainer startOffset type
/div[2]/div[2]/div[2]/div[1]/p[100] 429 /div[2]/div[2]/div[2]/div[1]/p[100] 0 RangeSelector
end start type
41343 40914 TextPositionSelector
exact prefix suffix type
Now, this can all be defeated with enough effort. For example, if you used another AI to paraphrase GPTs output—well okay, were not going to be able to detect that. On the other hand, if you just insert or delete a few words here and there, or rearrange the order of some sentences, the watermarking signal will still be there. Because it depends only on a sum over n-grams, its robust against those sorts of interventions. which parts probably didnt. The hope is that this can be TextQuoteSelector
https://scottaaronson.blog/?p=6823
this mechanism can be defeated by paraphrasing the output with another model 2022-12-19T14:58:05.006973+00:00 https://scottaaronson.blog/?p=6823 acct:ravenscroftj@hypothes.is
display_name
James Ravenscroft
https://scottaaronson.blog/?p=6823
explainability
nlproc
hypothesis
annotation /annotations/2022/12/19/1671461885
Now, this can all be defeated with enough effort. For example, if you used another AI to paraphrase GPTs output—well okay, were not going to be able to detect that. On the other hand, if you just insert or delete a few words here and there, or rearrange the order of some sentences, the watermarking signal will still be there. Because it depends only on a sum over n-grams, its robust against those sorts of interventions.
this mechanism can be defeated by paraphrasing the output with another model