diff --git a/brainsteam/content/annotations/2023/03/21/1679379947.md b/brainsteam/content/annotations/2023/03/21/1679379947.md new file mode 100644 index 0000000..efae6ed --- /dev/null +++ b/brainsteam/content/annotations/2023/03/21/1679379947.md @@ -0,0 +1,73 @@ +--- +date: '2023-03-21T06:25:47' +hypothesis-meta: + created: '2023-03-21T06:25:47.417575+00:00' + document: + title: + - 'GPT-4 and professional benchmarks: the wrong answer to the wrong question' + flagged: false + group: __world__ + hidden: false + id: N6BVsMexEe2Z4X92AfjYDg + links: + html: https://hypothes.is/a/N6BVsMexEe2Z4X92AfjYDg + incontext: https://hyp.is/N6BVsMexEe2Z4X92AfjYDg/aisnakeoil.substack.com/p/gpt-4-and-professional-benchmarks + json: https://hypothes.is/api/annotations/N6BVsMexEe2Z4X92AfjYDg + permissions: + admin: + - acct:ravenscroftj@hypothes.is + delete: + - acct:ravenscroftj@hypothes.is + read: + - group:__world__ + update: + - acct:ravenscroftj@hypothes.is + tags: + - llm + - openai + - gpt + - ModelEvaluation + target: + - selector: + - endContainer: /div[1]/div[1]/div[2]/div[1]/div[1]/div[1]/article[1]/div[4]/div[1]/div[1]/p[4]/span[2] + endOffset: 300 + startContainer: /div[1]/div[1]/div[2]/div[1]/div[1]/div[1]/article[1]/div[4]/div[1]/div[1]/p[4]/span[1] + startOffset: 0 + type: RangeSelector + - end: 5998 + start: 5517 + type: TextPositionSelector + - exact: "To benchmark GPT-4\u2019s coding ability, OpenAI evaluated it on problems\ + \ from Codeforces, a website that hosts coding competitions. Surprisingly,\ + \ Horace He pointed out that GPT-4 solved 10/10 pre-2021 problems and 0/10\ + \ recent problems in the easy category. The training data cutoff for GPT-4\ + \ is September 2021. This strongly suggests that the model is able to memorize\ + \ solutions from its training set \u2014 or at least partly memorize them,\ + \ enough that it can fill in what it can\u2019t recall." + prefix: 'm 1: training data contamination' + suffix: As further evidence for this hyp + type: TextQuoteSelector + source: https://aisnakeoil.substack.com/p/gpt-4-and-professional-benchmarks + text: OpenAI was only able to pass questions available before september 2021 and + failed to answer new questions - strongly suggesting that it has simply memorised + the answers as part of its training + updated: '2023-03-21T06:26:57.441600+00:00' + uri: https://aisnakeoil.substack.com/p/gpt-4-and-professional-benchmarks + user: acct:ravenscroftj@hypothes.is + user_info: + display_name: James Ravenscroft +in-reply-to: https://aisnakeoil.substack.com/p/gpt-4-and-professional-benchmarks +tags: +- llm +- openai +- gpt +- ModelEvaluation +- hypothesis +type: annotation +url: /annotations/2023/03/21/1679379947 + +--- + + + +
To benchmark GPT-4’s coding ability, OpenAI evaluated it on problems from Codeforces, a website that hosts coding competitions. Surprisingly, Horace He pointed out that GPT-4 solved 10/10 pre-2021 problems and 0/10 recent problems in the easy category. The training data cutoff for GPT-4 is September 2021. This strongly suggests that the model is able to memorize solutions from its training set — or at least partly memorize them, enough that it can fill in what it can’t recall.
OpenAI was only able to pass questions available before september 2021 and failed to answer new questions - strongly suggesting that it has simply memorised the answers as part of its training \ No newline at end of file diff --git a/brainsteam/content/annotations/2023/03/21/1679380079.md b/brainsteam/content/annotations/2023/03/21/1679380079.md new file mode 100644 index 0000000..24c2751 --- /dev/null +++ b/brainsteam/content/annotations/2023/03/21/1679380079.md @@ -0,0 +1,68 @@ +--- +date: '2023-03-21T06:27:59' +hypothesis-meta: + created: '2023-03-21T06:27:59.825632+00:00' + document: + title: + - 'GPT-4 and professional benchmarks: the wrong answer to the wrong question' + flagged: false + group: __world__ + hidden: false + id: hoqyasexEe2ZnQ_nOVgRxA + links: + html: https://hypothes.is/a/hoqyasexEe2ZnQ_nOVgRxA + incontext: https://hyp.is/hoqyasexEe2ZnQ_nOVgRxA/aisnakeoil.substack.com/p/gpt-4-and-professional-benchmarks + json: https://hypothes.is/api/annotations/hoqyasexEe2ZnQ_nOVgRxA + permissions: + admin: + - acct:ravenscroftj@hypothes.is + delete: + - acct:ravenscroftj@hypothes.is + read: + - group:__world__ + update: + - acct:ravenscroftj@hypothes.is + tags: + - openai + - gpt + - ModelEvaluation + target: + - selector: + - endContainer: /div[1]/div[1]/div[2]/div[1]/div[1]/div[1]/article[1]/div[4]/div[1]/div[1]/p[6]/span[2] + endOffset: 42 + startContainer: /div[1]/div[1]/div[2]/div[1]/div[1]/div[1]/article[1]/div[4]/div[1]/div[1]/p[6]/span[1] + startOffset: 0 + type: RangeSelector + - end: 6591 + start: 6238 + type: TextPositionSelector + - exact: 'In fact, we can definitively show that it has memorized problems in + its training set: when prompted with the title of a Codeforces problem, GPT-4 + includes a link to the exact contest where the problem appears (and the round + number is almost correct: it is off by one). Note that GPT-4 cannot access + the Internet, so memorization is the only explanation.' + prefix: the problems after September 12. + suffix: GPT-4 memorizes Codeforces probl + type: TextQuoteSelector + source: https://aisnakeoil.substack.com/p/gpt-4-and-professional-benchmarks + text: GPT4 knows the link to the coding exams that it was evaluated against but + doesn't have "internet access" so it appears to have memorised this as well + updated: '2023-03-21T06:27:59.825632+00:00' + uri: https://aisnakeoil.substack.com/p/gpt-4-and-professional-benchmarks + user: acct:ravenscroftj@hypothes.is + user_info: + display_name: James Ravenscroft +in-reply-to: https://aisnakeoil.substack.com/p/gpt-4-and-professional-benchmarks +tags: +- openai +- gpt +- ModelEvaluation +- hypothesis +type: annotation +url: /annotations/2023/03/21/1679380079 + +--- + + + +
In fact, we can definitively show that it has memorized problems in its training set: when prompted with the title of a Codeforces problem, GPT-4 includes a link to the exact contest where the problem appears (and the round number is almost correct: it is off by one). Note that GPT-4 cannot access the Internet, so memorization is the only explanation.
GPT4 knows the link to the coding exams that it was evaluated against but doesn't have "internet access" so it appears to have memorised this as well \ No newline at end of file diff --git a/brainsteam/content/annotations/2023/03/21/1679380149.md b/brainsteam/content/annotations/2023/03/21/1679380149.md new file mode 100644 index 0000000..306dd08 --- /dev/null +++ b/brainsteam/content/annotations/2023/03/21/1679380149.md @@ -0,0 +1,68 @@ +--- +date: '2023-03-21T06:29:09' +hypothesis-meta: + created: '2023-03-21T06:29:09.945605+00:00' + document: + title: + - 'GPT-4 and professional benchmarks: the wrong answer to the wrong question' + flagged: false + group: __world__ + hidden: false + id: sFZzLMexEe2M2r_i759OiA + links: + html: https://hypothes.is/a/sFZzLMexEe2M2r_i759OiA + incontext: https://hyp.is/sFZzLMexEe2M2r_i759OiA/aisnakeoil.substack.com/p/gpt-4-and-professional-benchmarks + json: https://hypothes.is/api/annotations/sFZzLMexEe2M2r_i759OiA + permissions: + admin: + - acct:ravenscroftj@hypothes.is + delete: + - acct:ravenscroftj@hypothes.is + read: + - group:__world__ + update: + - acct:ravenscroftj@hypothes.is + tags: + - openai + - gpt + - ModelEvaluation + target: + - selector: + - endContainer: /div[1]/div[1]/div[2]/div[1]/div[1]/div[1]/article[1]/div[4]/div[1]/div[1]/p[8]/span[2] + endOffset: 199 + startContainer: /div[1]/div[1]/div[2]/div[1]/div[1]/div[1]/article[1]/div[4]/div[1]/div[1]/p[8]/span[1] + startOffset: 0 + type: RangeSelector + - end: 7439 + start: 7071 + type: TextPositionSelector + - exact: "Still, we can look for telltale signs. Another symptom of memorization\ + \ is that GPT is highly sensitive to the phrasing of the question. Melanie\ + \ Mitchell gives an example of an MBA test question where changing some details\ + \ in a way that wouldn\u2019t fool a person is enough to fool ChatGPT (running\ + \ GPT-3.5). A more elaborate experiment along these lines would be valuable." + prefix: ' how performance varies by date.' + suffix: "Because of OpenAI\u2019s lack of tran" + type: TextQuoteSelector + source: https://aisnakeoil.substack.com/p/gpt-4-and-professional-benchmarks + text: OpenAI has memorised MBA tests- when these are rephrased or certain details + are changed, the system fails to answer + updated: '2023-03-21T06:29:09.945605+00:00' + uri: https://aisnakeoil.substack.com/p/gpt-4-and-professional-benchmarks + user: acct:ravenscroftj@hypothes.is + user_info: + display_name: James Ravenscroft +in-reply-to: https://aisnakeoil.substack.com/p/gpt-4-and-professional-benchmarks +tags: +- openai +- gpt +- ModelEvaluation +- hypothesis +type: annotation +url: /annotations/2023/03/21/1679380149 + +--- + + + +
Still, we can look for telltale signs. Another symptom of memorization is that GPT is highly sensitive to the phrasing of the question. Melanie Mitchell gives an example of an MBA test question where changing some details in a way that wouldn’t fool a person is enough to fool ChatGPT (running GPT-3.5). A more elaborate experiment along these lines would be valuable.
OpenAI has memorised MBA tests- when these are rephrased or certain details are changed, the system fails to answer \ No newline at end of file diff --git a/brainsteam/content/annotations/2023/03/21/1679428744.md b/brainsteam/content/annotations/2023/03/21/1679428744.md new file mode 100644 index 0000000..333e1f0 --- /dev/null +++ b/brainsteam/content/annotations/2023/03/21/1679428744.md @@ -0,0 +1,66 @@ +--- +date: '2023-03-21T19:59:04' +hypothesis-meta: + created: '2023-03-21T19:59:04.177001+00:00' + document: + title: + - 2303.09752.pdf + flagged: false + group: __world__ + hidden: false + id: 1MB9BMgiEe27GS99BvTIlA + links: + html: https://hypothes.is/a/1MB9BMgiEe27GS99BvTIlA + incontext: https://hyp.is/1MB9BMgiEe27GS99BvTIlA/arxiv.org/pdf/2303.09752.pdf + json: https://hypothes.is/api/annotations/1MB9BMgiEe27GS99BvTIlA + permissions: + admin: + - acct:ravenscroftj@hypothes.is + delete: + - acct:ravenscroftj@hypothes.is + read: + - group:__world__ + update: + - acct:ravenscroftj@hypothes.is + tags: + - llm + - attention + - long-documents + target: + - selector: + - end: 1989 + start: 1515 + type: TextPositionSelector + - exact: "Over the past few years, many \u201Cefficient Trans-former\u201D approaches\ + \ have been proposed that re-duce the cost of the attention mechanism over\ + \ longinputs (Child et al., 2019; Ainslie et al., 2020; Belt-agy et al., 2020;\ + \ Zaheer et al., 2020; Wang et al.,2020; Tay et al., 2021; Guo et al., 2022).\ + \ However,especially for larger models, the feedforward andprojection layers\ + \ actually make up the majority ofthe computational burden and can render\ + \ process-ing long inputs intractable" + prefix: ' be applied to each input token.' + suffix: ".\u2217Author contributions are outli" + type: TextQuoteSelector + source: https://arxiv.org/pdf/2303.09752.pdf + text: Recent improvements in transformers for long documents have focused on efficiencies + in the attention mechanism but the feed-forward and projection layers are still + expensive for long docs + updated: '2023-03-21T19:59:04.177001+00:00' + uri: https://arxiv.org/pdf/2303.09752.pdf + user: acct:ravenscroftj@hypothes.is + user_info: + display_name: James Ravenscroft +in-reply-to: https://arxiv.org/pdf/2303.09752.pdf +tags: +- llm +- attention +- long-documents +- hypothesis +type: annotation +url: /annotations/2023/03/21/1679428744 + +--- + + + +
Over the past few years, many “efficient Trans-former” approaches have been proposed that re-duce the cost of the attention mechanism over longinputs (Child et al., 2019; Ainslie et al., 2020; Belt-agy et al., 2020; Zaheer et al., 2020; Wang et al.,2020; Tay et al., 2021; Guo et al., 2022). However,especially for larger models, the feedforward andprojection layers actually make up the majority ofthe computational burden and can render process-ing long inputs intractable
Recent improvements in transformers for long documents have focused on efficiencies in the attention mechanism but the feed-forward and projection layers are still expensive for long docs \ No newline at end of file diff --git a/brainsteam/content/annotations/2023/03/21/1679428782.md b/brainsteam/content/annotations/2023/03/21/1679428782.md new file mode 100644 index 0000000..77f118c --- /dev/null +++ b/brainsteam/content/annotations/2023/03/21/1679428782.md @@ -0,0 +1,54 @@ +--- +date: '2023-03-21T19:59:42' +hypothesis-meta: + created: '2023-03-21T19:59:42.317507+00:00' + document: + title: + - 2303.09752.pdf + flagged: false + group: __world__ + hidden: false + id: 63md-sgiEe2GA2OJo26mSA + links: + html: https://hypothes.is/a/63md-sgiEe2GA2OJo26mSA + incontext: https://hyp.is/63md-sgiEe2GA2OJo26mSA/arxiv.org/pdf/2303.09752.pdf + json: https://hypothes.is/api/annotations/63md-sgiEe2GA2OJo26mSA + permissions: + admin: + - acct:ravenscroftj@hypothes.is + delete: + - acct:ravenscroftj@hypothes.is + read: + - group:__world__ + update: + - acct:ravenscroftj@hypothes.is + tags: + - llm + target: + - selector: + - end: 2402 + start: 2357 + type: TextPositionSelector + - exact: This paper presents COLT5 (ConditionalLongT5) + prefix: s are processed by aheavier MLP. + suffix: ', a new family of models that, b' + type: TextQuoteSelector + source: https://arxiv.org/pdf/2303.09752.pdf + text: CoLT5 stands for Conditional LongT5 + updated: '2023-03-21T19:59:42.317507+00:00' + uri: https://arxiv.org/pdf/2303.09752.pdf + user: acct:ravenscroftj@hypothes.is + user_info: + display_name: James Ravenscroft +in-reply-to: https://arxiv.org/pdf/2303.09752.pdf +tags: +- llm +- hypothesis +type: annotation +url: /annotations/2023/03/21/1679428782 + +--- + + + +
This paper presents COLT5 (ConditionalLongT5)
CoLT5 stands for Conditional LongT5 \ No newline at end of file diff --git a/brainsteam/content/notes/2023/04/07/1680866081.md b/brainsteam/content/notes/2023/04/07/1680866081.md new file mode 100644 index 0000000..c3f0f84 --- /dev/null +++ b/brainsteam/content/notes/2023/04/07/1680866081.md @@ -0,0 +1,19 @@ +--- +date: '2023-04-07T11:14:41.131905' +mp-syndicate-to: +- https://brid.gy/publish/mastodon +photo: +- /media/2023/04/07/1680866081_0.jpg +tags: +- personal +type: note +url: /notes/2023/04/07/1680866081 + +--- + + + + + + Happy freaking Easter James - from Mother Nature + \ No newline at end of file diff --git a/brainsteam/content/posts/2023/03/13/deepthought-hitchhiker-s-guide-llms-and-raspberry-pis1678738115.md b/brainsteam/content/posts/2023/03/13/deepthought-hitchhiker-s-guide-llms-and-raspberry-pis1678738115.md new file mode 100644 index 0000000..bcfdada --- /dev/null +++ b/brainsteam/content/posts/2023/03/13/deepthought-hitchhiker-s-guide-llms-and-raspberry-pis1678738115.md @@ -0,0 +1,58 @@ +--- +date: '2023-03-13T20:08:35.475110' +mp-syndicate-to: +- https://brid.gy/publish/mastodon +tags: +- ai +- nlp +- humour +title: Deep Thought, Hitchhiker's Guide, LLMs and Raspberry Pis +description: Musings on parallels between AI fiction and AI fact +type: post +url: /posts/2023/03/13/deepthought-hitchhiker-s-guide-llms-and-raspberry-pis1678738115 + +--- + +Today I read via [Simon Willison's blog](https://simonwillison.net/2023/Mar/13/alpaca/) that [someone has managed to get LlaMA running on a raspberry pi]. That's pretty incredible progress and it made me think of this excerpt from [Hitchiker's Guide To the Galaxy](https://bookwyrm.social/book/181728/s/hitchhikers-guide-to-the-galaxy-trilogy-collection-5-books-set-by-douglas-adams): + +> O Deep Thought computer," he said, "the task we have designed you to perform is this. We want you to tell us...." he paused, "The Answer." +> +>"The Answer?" said Deep Thought. "The Answer to what?" +> +>"Life!" urged Fook. +> +>"The Universe!" said Lunkwill. +> +>"Everything!" they said in chorus. +> +>Deep Thought paused for a moment's reflection. +> +>"Tricky," he said finally. +> +>"But can you do it?" +> +>Again, a significant pause. +> +>"Yes," said Deep Thought, "I can do it." +> +>"There is an answer?" said Fook with breathless excitement. +> +>"Yes," said Deep Thought. "Life, the Universe, and Everything. There is an answer. But, I'll have to think about it." +> +>... +> +>Fook glanced impatiently at his watch. +> +>“How long?” he said. +> +>“Seven and a half million years,” said Deep Thought. +> +>Lunkwill and Fook blinked at each other. +> +>“Seven and a half million years...!” they cried in chorus. +> +>“Yes,” declaimed Deep Thought, “I said I’d have to think about it, didn’t I?" + +Maybe Deep Thought was actually just an LLM running on a raspberry pi and that's why it took so long to generate the ultimate answer! + + \ No newline at end of file diff --git a/brainsteam/content/posts/2023/03/20/week-11/images/officelights.jpg b/brainsteam/content/posts/2023/03/20/week-11/images/officelights.jpg new file mode 100644 index 0000000..409a681 Binary files /dev/null and b/brainsteam/content/posts/2023/03/20/week-11/images/officelights.jpg differ diff --git a/brainsteam/content/posts/2023/03/20/week-11/index.md b/brainsteam/content/posts/2023/03/20/week-11/index.md new file mode 100644 index 0000000..dd807ba --- /dev/null +++ b/brainsteam/content/posts/2023/03/20/week-11/index.md @@ -0,0 +1,45 @@ +--- +title: "Weeknote 11 2023" +date: 2023-03-20T19:53:00Z +description: in which I ate too much, entered gremlin mode and upgraded mkdocs-material +url: /2023/3/20/week-11 +type: post +mp-syndicate-to: +- https://brid.gy/publish/mastodon +- https://brid.gy/publish/twitter +resources: + - name: feature + src: images/officelights.jpg +tags: + - personal +--- + +This week (or last week)'s weeknote is a touch late since I was travelling over the weekend. On Sunday it was Mother's Day in the UK so we visited my mum up in the midlands and then Mrs R's mum down here in Hampshire, having a sit down meal with both. It was a bit like [the bit in the Vicar of Dibley where she accidentally signs herself up for multiple christmas dinners on the same day](https://www.youtube.com/watch?v=2aq3DNSF-jc). + +--- + +On tuesday we had a problem with our lighting in our office AND the water main near our office complex burst which meant we were sat in the office like gremlins in the dark and there were no toilet facilities. I decided to work from home for the rest of the week for reasons that were not unrelated. + + + +{{
}} + + +- Now that I've got into the swing of using [foam](https://foambubble.github.io/) and [mkdocs](https://www.mkdocs.org/) to publish [my digital garden](https://notes.jamesravey.me/), I finally took the plunge and signed up as a [mkdocs-material insider](https://squidfunk.github.io/mkdocs-material/). I'm now sponsoring the good work of [squidfunk](https://fosstodon.org/@squidfunk) and also benefitting from some of the quality of life features that the insiders build of his theme provides including navigation breadcrumbs. + +- I've been looking for a new printer for a while since my 10 year old HP printer/scanner finally packed in on me just when I needed to print some important documents. I've heard horror stories about pretty much all inkjet printers and the word on the street seemed to be buy a laserjet if you can afford it and you don't print very frequently as the toner cartridges last forever and don't clog up the printer like an inkjet printer. I was undecided about which printer to get until I read [this review](https://www.theverge.com/23642073/best-printer-2023-brother-laser-wi-fi-its-fine) which absolutely nails it. It arrived today, I set it up and it prints stuff so - yey I guess! + +## Next Week + +- (This week really - week 12) - I am in London towards the end of the week for a colleague's leaving get together and to hopefully hang out and get some face time with another colleague who is usually based up in Edinburgh. +- Trying out some new physical-journal-and-markdown hybrid note-taking methodologies. +- Hoping to have a quiet weekend at hope and get some housework and gardening done. + + +## Interesting Links + +- https://climatejets.org/ - really smart site that provides some insight into who is burning the most fuel flying around flippantly. + +- [this guy](https://twitter.com/miolini/status/1634982361757790209) - got Llama (a recent large language model) running on a Raspberry Pi. A couple of days later someone also got it running on a Pixel 5. Miniaturisation of this tech will help with its democratisation (which dillutes the power of the corporates who are pushing it so hard right now) and reduces the environmental impact of running it. + +- [OpenAI Is Now Everything It Promised Not to Be: Corporate, Closed-Source, and For-Profit](https://www.vice.com/en/article/5d3naz/openai-is-now-everything-it-promised-not-to-be-corporate-closed-source-and-for-profit) \ No newline at end of file diff --git a/brainsteam/content/posts/2023/03/25/nlp-is-more-than-llms/images/language.jpg b/brainsteam/content/posts/2023/03/25/nlp-is-more-than-llms/images/language.jpg new file mode 100644 index 0000000..8f0ddb6 Binary files /dev/null and b/brainsteam/content/posts/2023/03/25/nlp-is-more-than-llms/images/language.jpg differ diff --git a/brainsteam/content/posts/2023/03/25/nlp-is-more-than-llms/index.md b/brainsteam/content/posts/2023/03/25/nlp-is-more-than-llms/index.md new file mode 100644 index 0000000..754e85c --- /dev/null +++ b/brainsteam/content/posts/2023/03/25/nlp-is-more-than-llms/index.md @@ -0,0 +1,122 @@ +--- +title: "NLP is more than just LLMs" +date: 2023-03-25T14:13:14Z +description: Opportunities for early NLP professionals and small companies in the post ChatGPT era +url: /2023/3/25//nlp-is-more-than-just-llms +type: post +mp-syndicate-to: +- https://brid.gy/publish/mastodon +- https://brid.gy/publish/twitter +resources: + - name: feature + src: images/language.jpg +tags: + - nlp + - llms + - ai +--- + + +{{
}} + + +There is sooo much hype around LLMs at the moment. As an NLP practitioner of 10 years (I built Partridge [^Partridge] in 2013), it's exhausting and quite annoying and amongst the junior ranks, there's a lot of despondency and dejection and a feeling of "what's the point? ~~Closed~~OpenAI have solved NLP". + + +Well, I'm here to tell you that NLP is more than just LLMs and that there are plenty of opportunities to get into the field. What's more, there are plenty of interesting, ethical use cases that can benefit society. In this post I will describe a number of opportunities for research and development in NLP that are unrelated or tangential to training bigger and bigger transformer-based [^vaswaniAttentionAllYou] LLMs. + +This post is based on a comment I made on a reddit thread [^aromatic_eye_6268ShouldSpecializeNLP2023] covering "should I study NLP?" + + +## Combatting Hallucination + +If you take the hype at face value, you could be forgiven for believing that NLP is pretty much a solved problem. However, that simply isn't the case. LLMs hallucinate (make stuff up) and whilst there is a marked improvement in hallucinations between versions of GPT, hallucination is a problem with transformer-based LLMs in general as the technical co-founder of OpenAI, Ilya Sutskever admits [^smithGPT4CreatorIlya]. Instead of relying on pure LLMs, there are lots of opportunities for building NLP pipelines that can reliably retrieve answers from specific documents via semantic search [^SemanticSearchFAISS]. This sort of approach allows the end user to make their own mind up about the trustworthiness of the source rather than relying on the LLM itself which might be right or might spit out alphabet soup. This week OpenAI announced a plugin interface for ChatGPT that, in theory, facilitates a hybrid LLM and retrieval approach through their system. However, it seems like GPT can still hallucinate incorrect answers even when the correct one is in the retrieved response [^SometimesItHallucinates]. There's definitely some room for improvement here! + +As use of LLMs becomes more widespread and people ask it questions and use it to write blog posts, we're going to start seeing more hallucinations presented as facts online. What's more, we're already seeing LLMs citing misinformation generated by other LLMs[^vincentGoogleMicrosoftChatbots2023] to their users. + +## Bot Detection + +There are certainly opportunities in bot vs human detection. Solutions like GPTZero [^GPTZero] and GLTR[^GLTRGlitterV0] rely on the statistical likelihood that a model would use a given sequence of words based on historical output (for example if the words "bananas in pajamas" never appear in known GPT output but they appear in the input document, the probability that it was written by a human is increased). Approaches like DetectGPT [^mitchellDetectGPTZeroShotMachineGenerated2023] use a model to perturb (subtly change) the output and compare the probabilities of the strings being generated to see if the original "sticks out" as being unusual and thus more human-like. ***edit: I was also contacted by Tracey Deacker - a computer science student in Reykjavik, who recommended CrossPlag[^CrossPlag] - another such detection tool.*** + +It seems like bot detection and evading detection are likely to be a new arms race: as new detection methods emerge, people will build more and more complex methods for evading detection or rely on adversarial training approaches to train existing models to evade new detection approaches automatically. + +## Fact Checking and Veracity + +Regardless of who wrote the content, bots or humans, fact-checking remains a key topic for NLP, again something that generative LLMs are not really set up to do. Fact checking is a relatively mature area of NLP with challenges and workshops like FEVER [^thorneFEVERLargescaleDataset2018]. However, it remains a tricky area which may require models to make multiple logical "hops" to arrive at a conclusion. + + +When direct evidence of something is not available, rumour verification is another tool in the NLP arsenal that may help us to derive the trustworthiness of a source. It works by identifying support or denial from parties who may be involved in a particular rumour (for example, Donald Trump tweets that he's going to be arrested and some AI generated photos of his arrest appear online, posted by unknown actors, but we can determine that this is unlikely to be true because social media accounts at trustworthy newspapers tweet that trump created a false expectation of arrest). Kochkina et al currently hold the state of the art on the RumourEval dataset [^kochkinaTuringSemEval2017Task2017]. + + +## Temporal Reasoning + +Things change over time. The answer to "who is the UK Prime Minister" today is different to this time last year. GPT 3.5 got around this by often prefixing information with big disclaimers about being trained in 2021 before telling you that the UK Prime Minister is Boris Johnson and not knowing who Rishi Sunak is. Early Bing/Sydney (which we now know was GPT-4 [^ConfirmedNewBing]) simply tried to convince you into believing that it was actually 2022 not 2023 and that you must be wrong: "You have been a bad user. I have been a good Bing"[^vynckMicrosoftAIChatbot2023]). + +Again this is something that a pure transformer-based LLM sucks at and around which there are many opportunities. Recent work in this area includes modelling moments of change in peoples' mood based on social media posts [^tsakalidisIdentifyingMomentsChange2022] and some earlier work has been done to do things like how topics of discussion in scientific research change over time [^prabhakaranPredictingRiseFall2016]. + + + +## Specialised Models and Low Compute Modelling + +LLMs are huge and power hungry language generalists but often get outperformed by smaller specialised models at specific tasks [^schickExploitingClozeQuestionsFewShot2021] [^schickTrueFewShotLearning2021] [^gaoMakingPretrainedLanguage2021]. Furthermore, recent developments have shown that we can get pretty good performance out of LLMs by shrinking them so that they run on laptops, Raspberry Pis and even mobile phones [^LargeLanguageModels]. It also looks like it's possible to get ChatGPT-like performance from relatively small LLMs with the right datasets, DataBricks yesterday announced their Dolly model which was trained on a single machine in under an hour [^HelloDollyDemocratizing2023]. + +There is plenty more work to be done in continuing to shrink models so that they can be used on-site, on mobile or in embedded use cases in order to support use cases where flexibility and trustworthiness are key. Many of my customers would be very unlikely to let me send their data to OpenAI to be processed and potentially learned from in a way that would benefit their competitors or that could accidentally leak confidential information and cause GDPR headaches. + +Self-hosted models are also a known quantity but the big organisations that can afford to train and host these gigantic LLMs stand to make a lot of money off people just using their APIs as black boxes. Building small, specialised models that can run on cheap commodity hardware will allow small companies to benefit from NLP without relying on OpenAI's generosity. It might make sense for small companies to start building with a hosted LLM but when you get serious, you need to own your model [^HelloDollyDemocratizing2023]. + +## Trust and Reproducibility + +Explainability and trustworthiness of models are now a crucial part of the machine learning landscape. It is often very important to understand why an algorithm made a particular decision in order to eliminate latent biases and discrimination and to ensure that the reasoning behind a decision is sound in general. There are plenty of opportunities to improve the current state-of-the-art in this space by training models that can explain their rationale as part of their decision r[^chanUNIREXUnifiedLearning2022] and by developing benchmarks and tests that can draw out problematic biases [^ribeiroAccuracyBehavioralTesting2020] [^morrisTextAttackFrameworkAdversarial2020]. + +The big players have started to signal their intent not to make their models and datasets open any more [^vincentOpenAICofounderCompany2023] [^snyderAILeaderSays2023]. By hiding this detail, they are effectively withdrawing from the scientific community and, we can no longer meaningfully reproduce their findings or trust their results. For example, there are some pretty feasible hypotheses around about how GPT-4 may have previously been exposed to and overfit on the bar exam papers that it supposedly aced [^narayananGPT4ProfessionalBenchmarks2023]. Without access to the model dataset or weights nobody, can check this. + +In fact, we've got something of a reproducibility crisis when it comes to AI in general [^knightSloppyUseMachine]. There are lots of opportunities for budding practitioners to enter the arena and tidy up processes and tools and reproduce results. + + +## Conclusion + +In conclusion, while the world's gone mad with GPT fever, it's important to remember that there are still a huge number of opportunities within the NLP space for small research groups and businesses. + +I sort of see ChatGPT a bit like how many software engineers see MongoDB: a prototyping tool you might use at a hackathon to get a proof-of-concept working but which you subsequently revisit and replace with a more appropriate, tailored tool. + +So for early career researchers and engineers considering NLP: it's definitely learning about LLMs and considering their strengths and weaknesses but also consider that, regardless of what the Silicon Valley Giants would have you believe, NLP is more than just LLMs. + +## Other Resources for AI Beyond LLMs + +Here are some more resources on nlp and ml stuff that is going on outside of the current LLM bubble from others in the nlp space: + +https://twitter.com/andriy_mulyar/status/1636139257805828096 - a thread where some nlp experts weigh in on unsolved problems + +https://twitter.com/vboykis/status/1635987389381222406 - a recent chat between AI and ML practitioners on stuff they are working on outside of LLMs + +https://link.medium.com/6Bz5jc2hsyb - a blog post from an NLP professor about finding problems to work on outside of the LLM bubble. + +[^Partridge]: Partridge - a web based tool used for scientific paper retrieval and filtering that makes use of Machine Learning techniques. https://beta.papro.org.uk +[^aromatic_eye_6268ShouldSpecializeNLP2023]: Aromatic_Eye_6268. (2023, March 25). Should I specialize in NLP considering the advent of Large Language Models? [Reddit Post]. R/LanguageTechnology. www.reddit.com/r/LanguageTechnology/comments/121gv4c/should_i_specialize_in_nlp_considering_the_advent/ +[^smithGPT4CreatorIlya]: Smith, C. S. (n.d.). GPT-4 Creator Ilya Sutskever on AI Hallucinations and AI Democracy. Forbes. Retrieved 25 March 2023, from https://www.forbes.com/sites/craigsmith/2023/03/15/gpt-4-creator-ilya-sutskever-on-ai-hallucinations-and-ai-democracy/ +[^SemanticSearchFAISS]: Semantic search with FAISS - Hugging Face Course. (n.d.). Retrieved 25 March 2023, from https://huggingface.co/course/chapter5/6 +[^vaswaniAttentionAllYou]: Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (n.d.). Attention is All you Need. 11. +[^SometimesItHallucinates]: Sometimes it hallucinates despite fetching accurate data! · Issue #2 · simonw/datasette-chatgpt-plugin. (n.d.). GitHub. Retrieved 25 March 2023, from https://github.com/simonw/datasette-chatgpt-plugin/issues/2 +[^vincentGoogleMicrosoftChatbots2023]: Vincent, J. (2023, March 22). Google and Microsoft’s chatbots are already citing one another in a misinformation shitshow. The Verge. https://www.theverge.com/2023/3/22/23651564/google-microsoft-bard-bing-chatbots-misinformation +[^GPTZero]: GPTZero. (n.d.). Retrieved 25 March 2023, from https://gptzero.me/ +[^GLTRGlitterV0]: GLTR (glitter) v0.5. (n.d.). Retrieved 25 March 2023, from http://gltr.io/dist/index.html +[^mitchellDetectGPTZeroShotMachineGenerated2023]: Mitchell, E., Lee, Y., Khazatsky, A., Manning, C. D., & Finn, C. (2023). DetectGPT: Zero-Shot Machine-Generated Text Detection using Probability Curvature (arXiv:2301.11305). arXiv. http://arxiv.org/abs/2301.11305 +[^ConfirmedNewBing]: Confirmed: The new Bing runs on OpenAI’s GPT-4 | Bing Search Blog. (n.d.). Retrieved 25 March 2023, from https://blogs.bing.com/search/march_2023/Confirmed-the-new-Bing-runs-on-OpenAI%E2%80%99s-GPT-4 +[^vynckMicrosoftAIChatbot2023]: Vynck, G. D., Lerman, R., & Tiku, N. (2023, February 17). Microsoft’s AI chatbot is going off the rails. Washington Post. https://www.washingtonpost.com/technology/2023/02/16/microsoft-bing-ai-chatbot-sydney/ +[^tsakalidisIdentifyingMomentsChange2022]: Tsakalidis, A., Nanni, F., Hills, A., Chim, J., Song, J., & Liakata, M. (2022). Identifying Moments of Change from Longitudinal User Text. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 4647–4660. https://doi.org/10.18653/v1/2022.acl-long.318 +[^prabhakaranPredictingRiseFall2016]: Prabhakaran, V., Hamilton, W. L., McFarland, D., & Jurafsky, D. (2016). Predicting the Rise and Fall of Scientific Topics from Trends in their Rhetorical Framing. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 1170–1180. https://doi.org/10.18653/v1/P16-1111 +[^thorneFEVERLargescaleDataset2018]: Thorne, J., Vlachos, A., Christodoulopoulos, C., & Mittal, A. (2018). FEVER: A Large-scale Dataset for Fact Extraction and VERification. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), 809–819. https://doi.org/10.18653/v1/N18-1074 +[^kochkinaTuringSemEval2017Task2017]: Kochkina, E., Liakata, M., & Augenstein, I. (2017). Turing at SemEval-2017 Task 8: Sequential Approach to Rumour Stance Classification with Branch-LSTM (arXiv:1704.07221). arXiv. http://arxiv.org/abs/1704.07221 +[^schickExploitingClozeQuestionsFewShot2021]: Schick, T., & Schütze, H. (2021). Exploiting Cloze-Questions for Few-Shot Text Classification and Natural Language Inference. Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, 255–269. https://www.aclweb.org/anthology/2021.eacl-main.20 +[^schickTrueFewShotLearning2021]: Schick, T., & Schütze, H. (2021). True Few-Shot Learning with Prompts—A Real-World Perspective. ArXiv:2111.13440 [Cs]. http://arxiv.org/abs/2111.13440 +[^gaoMakingPretrainedLanguage2021]: Gao, T., Fisch, A., & Chen, D. (2021). Making Pre-trained Language Models Better Few-shot Learners. ArXiv:2012.15723 [Cs]. http://arxiv.org/abs/2012.15723 +[^LargeLanguageModels]: Large language models are having their Stable Diffusion moment. (n.d.). Retrieved 25 March 2023, from https://simonwillison.net/2023/Mar/11/llama/ +[^HelloDollyDemocratizing2023]: Hello Dolly: Democratizing the magic of ChatGPT with open models. (2023, March 24). Databricks. https://www.databricks.com/blog/2023/03/24/hello-dolly-democratizing-magic-chatgpt-open-models.html +[^vincentOpenAICofounderCompany2023]: Vincent, J. (2023, March 15). OpenAI co-founder on company’s past approach to openly sharing research: “We were wrong”. The Verge. https://www.theverge.com/2023/3/15/23640180/openai-gpt-4-launch-closed-research-ilya-sutskever-interview +[^snyderAILeaderSays2023]: Snyder, A. (2023, March 2). AI leader says field’s new territory is promising but risky. Axios. https://www.axios.com/2023/03/02/demis-hassabis-deepmind-ai-new-territory +[^narayananGPT4ProfessionalBenchmarks2023]: Narayanan, A., & Kapoor, S. (2023, March 20). GPT-4 and professional benchmarks: The wrong answer to the wrong question [Substack newsletter]. AI Snake Oil. https://aisnakeoil.substack.com/p/gpt-4-and-professional-benchmarks +[^chanUNIREXUnifiedLearning2022]: Chan, A., Sanjabi, M., Mathias, L., Tan, L., Nie, S., Peng, X., Ren, X., & Firooz, H. (2022). UNIREX: A Unified Learning Framework for Language Model Rationale Extraction. Undefined. https://doi.org/10.18653/v1/2022.bigscience-1.5 +[^ribeiroAccuracyBehavioralTesting2020]: Ribeiro, M. T., Wu, T., Guestrin, C., & Singh, S. (2020). Beyond Accuracy: Behavioral Testing of NLP Models with CheckList. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 4902–4912. https://doi.org/10.18653/v1/2020.acl-main.442 +[^morrisTextAttackFrameworkAdversarial2020]: Morris, J. X., Lifland, E., Yoo, J. Y., Grigsby, J., Jin, D., & Qi, Y. (2020). TextAttack: A Framework for Adversarial Attacks, Data Augmentation, and Adversarial Training in NLP (arXiv:2005.05909). arXiv. https://doi.org/10.48550/arXiv.2005.05909 +[^knightSloppyUseMachine]: Knight, W. (n.d.). Sloppy Use of Machine Learning Is Causing a ‘Reproducibility Crisis’ in Science. Wired. Retrieved 25 March 2023, from https://www.wired.com/story/machine-learning-reproducibility-crisis/ +[^CrossPlag]: AI Content Detector - Crossplag - https://crossplag.com/ai-content-detector/ \ No newline at end of file diff --git a/brainsteam/data/mentions.json b/brainsteam/data/mentions.json index d76099c..7d58c54 100644 --- a/brainsteam/data/mentions.json +++ b/brainsteam/data/mentions.json @@ -2186,6 +2186,25 @@ "content": null, "published": "2022-05-05T16:24:01+00:00" } + }, + { + "id": 1662026, + "source": "https:\/\/jamesg.coffee\/2023\/liked-brainsteamcouk2022130debugging-bridgy-for-my-blog", + "target": "https:\/\/brainsteam.co.uk\/2022\/1\/30\/debugging-bridgy-for-my-blog\/", + "activity": { + "type": "like" + }, + "verified_date": "2023-04-14T15:51:52.180446", + "data": { + "author": { + "type": "card", + "name": "James' Coffee Blog", + "photo": "https:\/\/webmention.io\/avatar\/jamesg.coffee\/44a4b81d3ad1303a2acd54e82a33c5333b80a611e540f4971bcb5fd93096c352.jpg", + "url": "https:\/\/jamesg.coffee\/profile\/capjamesg" + }, + "content": null, + "published": "2023-04-14T15:51:46+00:00" + } } ], "\/notes\/2022\/02\/04\/1643990322\/": [ @@ -14609,6 +14628,31 @@ "content": null, "published": null } + }, + { + "source": "https:\/\/brainsteam.co.uk\/2023\/03\/12\/week-10\/", + "verified": true, + "verified_date": "2023-03-12T19:41:21+00:00", + "id": 1639493, + "private": false, + "data": { + "author": { + "name": "James Ravenscroft", + "url": "https:\/\/brainsteam.co.uk", + "photo": null + }, + "url": "https:\/\/brainsteam.co.uk\/2023\/03\/12\/week-10\/", + "name": "Weeknote 2023 Week 10", + "content": "