brainsteam.co.uk/brainsteam/content/posts/2023/03/25/nlp-is-more-than-llms/index.md at 188786d6c90e0209ad625844f49765396df52cee

18 KiB

Raw Blame History

date

description

mp-syndicate-to

post_meta

resources

thumbnail

Combatting Hallucination

If you take the hype at face value, you could be forgiven for believing that NLP is pretty much a solved problem. However, that simply isn't the case. LLMs hallucinate (make stuff up) and whilst there is a marked improvement in hallucinations between versions of GPT, hallucination is a problem with transformer-based LLMs in general as the technical co-founder of OpenAI, Ilya Sutskever admits ⁴. Instead of relying on pure LLMs, there are lots of opportunities for building NLP pipelines that can reliably retrieve answers from specific documents via semantic search ⁵. This sort of approach allows the end user to make their own mind up about the trustworthiness of the source rather than relying on the LLM itself which might be right or might spit out alphabet soup. This week OpenAI announced a plugin interface for ChatGPT that, in theory, facilitates a hybrid LLM and retrieval approach through their system. However, it seems like GPT can still hallucinate incorrect answers even when the correct one is in the retrieved response ⁶. There's definitely some room for improvement here!

As use of LLMs becomes more widespread and people ask it questions and use it to write blog posts, we're going to start seeing more hallucinations presented as facts online. What's more, we're already seeing LLMs citing misinformation generated by other LLMs⁷ to their users.

Bot Detection

There are certainly opportunities in bot vs human detection. Solutions like GPTZero ⁸ and GLTR⁹ rely on the statistical likelihood that a model would use a given sequence of words based on historical output (for example if the words "bananas in pajamas" never appear in known GPT output but they appear in the input document, the probability that it was written by a human is increased). Approaches like DetectGPT ¹⁰ use a model to perturb (subtly change) the output and compare the probabilities of the strings being generated to see if the original "sticks out" as being unusual and thus more human-like. edit: I was also contacted by Tracey Deacker - a computer science student in Reykjavik, who recommended CrossPlag¹¹ - another such detection tool.

It seems like bot detection and evading detection are likely to be a new arms race: as new detection methods emerge, people will build more and more complex methods for evading detection or rely on adversarial training approaches to train existing models to evade new detection approaches automatically.

Fact Checking and Veracity

Regardless of who wrote the content, bots or humans, fact-checking remains a key topic for NLP, again something that generative LLMs are not really set up to do. Fact checking is a relatively mature area of NLP with challenges and workshops like FEVER ¹². However, it remains a tricky area which may require models to make multiple logical "hops" to arrive at a conclusion.

When direct evidence of something is not available, rumour verification is another tool in the NLP arsenal that may help us to derive the trustworthiness of a source. It works by identifying support or denial from parties who may be involved in a particular rumour (for example, Donald Trump tweets that he's going to be arrested and some AI generated photos of his arrest appear online, posted by unknown actors, but we can determine that this is unlikely to be true because social media accounts at trustworthy newspapers tweet that trump created a false expectation of arrest). Kochkina et al currently hold the state of the art on the RumourEval dataset ¹³.

Temporal Reasoning

Things change over time. The answer to "who is the UK Prime Minister" today is different to this time last year. GPT 3.5 got around this by often prefixing information with big disclaimers about being trained in 2021 before telling you that the UK Prime Minister is Boris Johnson and not knowing who Rishi Sunak is. Early Bing/Sydney (which we now know was GPT-4 ¹⁴) simply tried to convince you into believing that it was actually 2022 not 2023 and that you must be wrong: "You have been a bad user. I have been a good Bing"¹⁵).

Again this is something that a pure transformer-based LLM sucks at and around which there are many opportunities. Recent work in this area includes modelling moments of change in peoples' mood based on social media posts ¹⁶ and some earlier work has been done to do things like how topics of discussion in scientific research change over time ¹⁷.

Specialised Models and Low Compute Modelling

LLMs are huge and power hungry language generalists but often get outperformed by smaller specialised models at specific tasks ¹⁸ ¹⁹ ²⁰. Furthermore, recent developments have shown that we can get pretty good performance out of LLMs by shrinking them so that they run on laptops, Raspberry Pis and even mobile phones ²¹. It also looks like it's possible to get ChatGPT-like performance from relatively small LLMs with the right datasets, DataBricks yesterday announced their Dolly model which was trained on a single machine in under an hour ²².

There is plenty more work to be done in continuing to shrink models so that they can be used on-site, on mobile or in embedded use cases in order to support use cases where flexibility and trustworthiness are key. Many of my customers would be very unlikely to let me send their data to OpenAI to be processed and potentially learned from in a way that would benefit their competitors or that could accidentally leak confidential information and cause GDPR headaches.

Self-hosted models are also a known quantity but the big organisations that can afford to train and host these gigantic LLMs stand to make a lot of money off people just using their APIs as black boxes. Building small, specialised models that can run on cheap commodity hardware will allow small companies to benefit from NLP without relying on OpenAI's generosity. It might make sense for small companies to start building with a hosted LLM but when you get serious, you need to own your model ²².

Trust and Reproducibility

Explainability and trustworthiness of models are now a crucial part of the machine learning landscape. It is often very important to understand why an algorithm made a particular decision in order to eliminate latent biases and discrimination and to ensure that the reasoning behind a decision is sound in general. There are plenty of opportunities to improve the current state-of-the-art in this space by training models that can explain their rationale as part of their decision r²³ and by developing benchmarks and tests that can draw out problematic biases ²⁴ ²⁵.

The big players have started to signal their intent not to make their models and datasets open any more ²⁶ ²⁷. By hiding this detail, they are effectively withdrawing from the scientific community and, we can no longer meaningfully reproduce their findings or trust their results. For example, there are some pretty feasible hypotheses around about how GPT-4 may have previously been exposed to and overfit on the bar exam papers that it supposedly aced ²⁸. Without access to the model dataset or weights nobody, can check this.

In fact, we've got something of a reproducibility crisis when it comes to AI in general ²⁹. There are lots of opportunities for budding practitioners to enter the arena and tidy up processes and tools and reproduce results.

Conclusion

In conclusion, while the world's gone mad with GPT fever, it's important to remember that there are still a huge number of opportunities within the NLP space for small research groups and businesses.

I sort of see ChatGPT a bit like how many software engineers see MongoDB: a prototyping tool you might use at a hackathon to get a proof-of-concept working but which you subsequently revisit and replace with a more appropriate, tailored tool.

So for early career researchers and engineers considering NLP: it's definitely learning about LLMs and considering their strengths and weaknesses but also consider that, regardless of what the Silicon Valley Giants would have you believe, NLP is more than just LLMs.

Other Resources for AI Beyond LLMs

Here are some more resources on nlp and ml stuff that is going on outside of the current LLM bubble from others in the nlp space:

https://twitter.com/andriy_mulyar/status/1636139257805828096 - a thread where some nlp experts weigh in on unsolved problems

https://twitter.com/vboykis/status/1635987389381222406 - a recent chat between AI and ML practitioners on stuff they are working on outside of LLMs

https://link.medium.com/6Bz5jc2hsyb - a blog post from an NLP professor about finding problems to work on outside of the LLM bubble.

Partridge - a web based tool used for scientific paper retrieval and filtering that makes use of Machine Learning techniques. https://beta.papro.org.uk ↩︎
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (n.d.). Attention is All you Need. 11. ↩︎
Aromatic_Eye_6268. (2023, March 25). Should I specialize in NLP considering the advent of Large Language Models? [Reddit Post]. R/LanguageTechnology. www.reddit.com/r/LanguageTechnology/comments/121gv4c/should_i_specialize_in_nlp_considering_the_advent/ ↩︎
Smith, C. S. (n.d.). GPT-4 Creator Ilya Sutskever on AI Hallucinations and AI Democracy. Forbes. Retrieved 25 March 2023, from https://www.forbes.com/sites/craigsmith/2023/03/15/gpt-4-creator-ilya-sutskever-on-ai-hallucinations-and-ai-democracy/ ↩︎
Semantic search with FAISS - Hugging Face Course. (n.d.). Retrieved 25 March 2023, from https://huggingface.co/course/chapter5/6 ↩︎
Sometimes it hallucinates despite fetching accurate data! · Issue #2 · simonw/datasette-chatgpt-plugin. (n.d.). GitHub. Retrieved 25 March 2023, from https://github.com/simonw/datasette-chatgpt-plugin/issues/2 ↩︎
Vincent, J. (2023, March 22). Google and Microsoft’s chatbots are already citing one another in a misinformation shitshow. The Verge. https://www.theverge.com/2023/3/22/23651564/google-microsoft-bard-bing-chatbots-misinformation ↩︎
GPTZero. (n.d.). Retrieved 25 March 2023, from https://gptzero.me/ ↩︎
GLTR (glitter) v0.5. (n.d.). Retrieved 25 March 2023, from http://gltr.io/dist/index.html ↩︎
Mitchell, E., Lee, Y., Khazatsky, A., Manning, C. D., & Finn, C. (2023). DetectGPT: Zero-Shot Machine-Generated Text Detection using Probability Curvature (arXiv:2301.11305). arXiv. http://arxiv.org/abs/2301.11305 ↩︎
AI Content Detector - Crossplag - https://crossplag.com/ai-content-detector/ ↩︎
Thorne, J., Vlachos, A., Christodoulopoulos, C., & Mittal, A. (2018). FEVER: A Large-scale Dataset for Fact Extraction and VERification. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), 809–819. https://doi.org/10.18653/v1/N18-1074 ↩︎
Kochkina, E., Liakata, M., & Augenstein, I. (2017). Turing at SemEval-2017 Task 8: Sequential Approach to Rumour Stance Classification with Branch-LSTM (arXiv:1704.07221). arXiv. http://arxiv.org/abs/1704.07221 ↩︎
Confirmed: The new Bing runs on OpenAI’s GPT-4 | Bing Search Blog. (n.d.). Retrieved 25 March 2023, from https://blogs.bing.com/search/march_2023/Confirmed-the-new-Bing-runs-on-OpenAI%E2%80%99s-GPT-4 ↩︎
Vynck, G. D., Lerman, R., & Tiku, N. (2023, February 17). Microsoft’s AI chatbot is going off the rails. Washington Post. https://www.washingtonpost.com/technology/2023/02/16/microsoft-bing-ai-chatbot-sydney/ ↩︎
Tsakalidis, A., Nanni, F., Hills, A., Chim, J., Song, J., & Liakata, M. (2022). Identifying Moments of Change from Longitudinal User Text. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 4647–4660. https://doi.org/10.18653/v1/2022.acl-long.318 ↩︎
Prabhakaran, V., Hamilton, W. L., McFarland, D., & Jurafsky, D. (2016). Predicting the Rise and Fall of Scientific Topics from Trends in their Rhetorical Framing. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 1170–1180. https://doi.org/10.18653/v1/P16-1111 ↩︎
Schick, T., & Schütze, H. (2021). Exploiting Cloze-Questions for Few-Shot Text Classification and Natural Language Inference. Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, 255–269. https://www.aclweb.org/anthology/2021.eacl-main.20 ↩︎
Schick, T., & Schütze, H. (2021). True Few-Shot Learning with Prompts—A Real-World Perspective. ArXiv:2111.13440 [Cs]. http://arxiv.org/abs/2111.13440 ↩︎
Gao, T., Fisch, A., & Chen, D. (2021). Making Pre-trained Language Models Better Few-shot Learners. ArXiv:2012.15723 [Cs]. http://arxiv.org/abs/2012.15723 ↩︎
Large language models are having their Stable Diffusion moment. (n.d.). Retrieved 25 March 2023, from https://simonwillison.net/2023/Mar/11/llama/ ↩︎
Hello Dolly: Democratizing the magic of ChatGPT with open models. (2023, March 24). Databricks. https://www.databricks.com/blog/2023/03/24/hello-dolly-democratizing-magic-chatgpt-open-models.html ↩︎
Chan, A., Sanjabi, M., Mathias, L., Tan, L., Nie, S., Peng, X., Ren, X., & Firooz, H. (2022). UNIREX: A Unified Learning Framework for Language Model Rationale Extraction. Undefined. https://doi.org/10.18653/v1/2022.bigscience-1.5 ↩︎
Ribeiro, M. T., Wu, T., Guestrin, C., & Singh, S. (2020). Beyond Accuracy: Behavioral Testing of NLP Models with CheckList. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 4902–4912. https://doi.org/10.18653/v1/2020.acl-main.442 ↩︎
Morris, J. X., Lifland, E., Yoo, J. Y., Grigsby, J., Jin, D., & Qi, Y. (2020). TextAttack: A Framework for Adversarial Attacks, Data Augmentation, and Adversarial Training in NLP (arXiv:2005.05909). arXiv. https://doi.org/10.48550/arXiv.2005.05909 ↩︎
Vincent, J. (2023, March 15). OpenAI co-founder on company’s past approach to openly sharing research: “We were wrong”. The Verge. https://www.theverge.com/2023/3/15/23640180/openai-gpt-4-launch-closed-research-ilya-sutskever-interview ↩︎
Snyder, A. (2023, March 2). AI leader says field’s new territory is promising but risky. Axios. https://www.axios.com/2023/03/02/demis-hassabis-deepmind-ai-new-territory ↩︎
Narayanan, A., & Kapoor, S. (2023, March 20). GPT-4 and professional benchmarks: The wrong answer to the wrong question [Substack newsletter]. AI Snake Oil. https://aisnakeoil.substack.com/p/gpt-4-and-professional-benchmarks ↩︎
Knight, W. (n.d.). Sloppy Use of Machine Learning Is Causing a ‘Reproducibility Crisis’ in Science. Wired. Retrieved 25 March 2023, from https://www.wired.com/story/machine-learning-reproducibility-crisis/ ↩︎

18 KiB Raw Blame History Unescape Escape