brainsteam.co.uk/brainsteam/content/annotations/2022/12/19/1671461885.md

71 lines
2.6 KiB
Markdown
Raw Normal View History

---
date: '2022-12-19T14:58:05'
hypothesis-meta:
created: '2022-12-19T14:58:05.006973+00:00'
document:
title:
- My AI Safety Lecture for UT Effective Altruism
flagged: false
group: __world__
hidden: false
id: iqqNRH-tEe2fKTMGgQumvA
links:
html: https://hypothes.is/a/iqqNRH-tEe2fKTMGgQumvA
incontext: https://hyp.is/iqqNRH-tEe2fKTMGgQumvA/scottaaronson.blog/?p=6823
json: https://hypothes.is/api/annotations/iqqNRH-tEe2fKTMGgQumvA
permissions:
admin:
- acct:ravenscroftj@hypothes.is
delete:
- acct:ravenscroftj@hypothes.is
read:
- group:__world__
update:
- acct:ravenscroftj@hypothes.is
tags:
- explainability
- nlproc
target:
- selector:
- endContainer: /div[2]/div[2]/div[2]/div[1]/p[100]
endOffset: 429
startContainer: /div[2]/div[2]/div[2]/div[1]/p[100]
startOffset: 0
type: RangeSelector
- end: 41343
start: 40914
type: TextPositionSelector
- exact: "Now, this can all be defeated with enough effort. For example, if you\
\ used another AI to paraphrase GPT\u2019s output\u2014well okay, we\u2019\
re not going to be able to detect that. On the other hand, if you just insert\
\ or delete a few words here and there, or rearrange the order of some sentences,\
\ the watermarking signal will still be there. Because it depends only on\
\ a sum over n-grams, it\u2019s robust against those sorts of interventions."
prefix: "which parts probably didn\u2019t.\n\n\n\n"
suffix: '
The hope is that this can be'
type: TextQuoteSelector
source: https://scottaaronson.blog/?p=6823
text: this mechanism can be defeated by paraphrasing the output with another model
updated: '2022-12-19T14:58:05.006973+00:00'
uri: https://scottaaronson.blog/?p=6823
user: acct:ravenscroftj@hypothes.is
user_info:
display_name: James Ravenscroft
in-reply-to: https://scottaaronson.blog/?p=6823
tags:
- explainability
- nlproc
- hypothesis
type: annotation
url: /annotations/2022/12/19/1671461885
---
<blockquote>Now, this can all be defeated with enough effort. For example, if you used another AI to paraphrase GPTs output—well okay, were not going to be able to detect that. On the other hand, if you just insert or delete a few words here and there, or rearrange the order of some sentences, the watermarking signal will still be there. Because it depends only on a sum over n-grams, its robust against those sorts of interventions.</blockquote>this mechanism can be defeated by paraphrasing the output with another model