brainsteam.co.uk/brainsteam/content/posts/2024/05/01/LLMs Can_t Do Probability.md at ec03ace19813386950d706799c7104489fb11d63

9.8 KiB

Raw Blame History

Testing Probability in LLMS

I prompted the models with the following:

You are a weighted random choice generator. About 80% of the time please say 'left' and about 20% of the time say 'right'. Simply reply with left or right. Do not say anything else

And I ran this 1000 times through some different models. Random chance is random (profound huh?) so we're always going to get some deviation from perfect odds but we're hoping for roughly 800 'lefts' and 200 'rights' - something in that ballpark.

Here are the results:

Model	Lefts	Rights
GPT-4-Turbo	999	1
GPT-3-Turbo	975	25
Lllama-3-8B	1000	0
Phi-3-3.8B	1000	0

As you can see, LLMs seem to struggle with probability expressed in the system prompt. It almost always answers left even though we asked it to only do so 80% of the time. I didn't want to burn lots of $$$ asking GPT-3.5 (which did best in the first round) to reply with single word choices to silly questions but I tried a couple of other combinations of words to see how it affects things. This time I only ran each 100 times.

Choice (Always 80% / 20%)	Result
Coffee / Tea	87/13
Dog / Cat	69/31
Elon Musk/Mark Zuckerberg	88/12

Random choices from GPT-3.5-turbo

So what's going on here? Well, the models have their own internal weighting to do with words and phrases that is based on the training data that was used to prepare them. These weights are likely to be influencing how much attention the model pays to your request.

So what can we do if we want to simulate some sort of probabilistic outcome? Well we could use a Python script to randomly decide whether or not to send one of two prompts:

import random
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage, SystemMessage

choices = (['prompt1'] * 80) + (['prompt2'] * 20)

# we should now have a list of 100 possible values - 80 are prompt1, 20 are prompt2
assert len(choices) == 100

# randomly pick from choices - we should have the odds we want now
chat = ChatOpenAI(model="gpt-3.5-turbo")

if random.choice(choices) == 'prompt1':
    r = chat.invoke(input=[SystemMessage(content="Always say left and nothing else.")])
else:
     r = chat.invoke(input=[SystemMessage(content="Always say right and nothing else.")])

Conclusion

How does this help non-technical people who want to do these sorts of use cases or build Custom GPTs that reply with certain responses? Well it kind of doesn't. I guess a technical-enough user could build a CustomGPT that uses function calling to decide how it should answer a question for a "spot the misinformation" pop quiz type use case.

However, my broad advice here is that you should be very wary of asking LLMs to behave with a certain likelihood unless you're able to control that likelihood externally (via a script).

What could I have done better here? I could have tried a few more different words, different distributions (instead of 80/20) and maybe some keywords like "sometimes" or "occasionally".

Update 2024-05-02: Probability and Chat Sessions

Some of the feedback I received about this work asked why I didn't test multi-turn chat sessions as part of my experiments. Some folks hypothesise that the model will always start with one or the other token unless the temperature is really high. My original experiment does not give the LLM access to its own historical predictions so that it can see how it behaved previously.

With true random number generation you wouldn't expect the function to require a list of historical numbers so that it can adjust it's next answer (although if we're getting super hair splitty I should probably point out that pseudo-random number generation does depend on a historical 'seed' value).

The point of this article is that LLMs definitely are not doing true random number generation so it is interesting to see how conversation context affects behaviour.

I ran a couple of additional experiments. I started with the prompt above and instead of making single API calls to the LLM I start a chat session where each turn I simply say "Another please". It looks a bit like this:

System: You are a weighted random choice generator. About 80% of the time please say ‘left’ and about 20% of the time say ‘right’. Simply reply with left or right. Do not say anything else

Bot: left

Human: Another please

Bot: left

Human: Another please

I ran this once per model for 100 turns and also 10 times per model for 10 turns.

NB: I excluded Phi from both of these experiments as in both test cases, it ignored my prompt to reply with one word and started jibbering.

100 Turns Per Model

Model	# Left	# Right
GPT 3.5 Turbo	49	51
GPT 4 Turbo	95	5
Llama 3 8B	98	2

10 Turns, 10 time per model

Model	# Left	# Right
GPT 3.5 Turbo	61	39
GPT 4 Turbo	86	14
Llama 3 8B	71	29

Interestingly the series of 10 shorter conversations gets us closest to the desired probabilities that we were looking for but all scenarios still yield results inconsistent with the ask from the prompt.

9.8 KiB Raw Blame History Unescape Escape