brainsteam.co.uk/brainsteam/content/posts/2023/11/19/Parsing Ingredient Strings ...

12 KiB

categories date draft tags title type url
Data Science
Software Development
2023-11-19 18:13:40 false
django
gastronaut
python
Parsing Ingredient Strings with SpaCy PhraseMatcher posts /2023/11/19/parsing-ingredient-strings-with-spacy-phrasematcher/

As part of my work on Gastronaut, I'm building a form that allows users to create recipes and which will attempt to parse ingredients lists and find a suitable stock photo for each item the user adds to their recipe. As well as being cute and decorative, this step is important for later when we want to normalise ingredient quantities.

What we're looking to do is take a string like "2lbs flour" and turn it into structured data - a json representation might look like this:

{
   "ingredient":"flour",
   "unit":"lbs",
   "quantity":"2"
}

We can then do whatever we need with this structured data - like using the ingredient to look up a thumbnail in another system or generate a link or reference to the new recipe so that people looking for "banana" can find all the recipes that use them.

Building a Parser

There are a few options for parsing these strings. If you're feeling frivolous and want to crack a walnut with a sledgehammer, you could probably get OpenAI's GPT to parse these strings with a single API call and a prompt. However, I wanted to approach this problem with a more proportional technique.

I'm using Spacy along with Spacy's PhraseMatcher functionality which basically looks for a list of possible words and phrases. Once we've installed Spacy, we make a long list of units and we tell Spacy about them:

import spacy
from spacy.matcher import PhraseMatcher

# Load the English language model
nlp = spacy.load("en_core_web_sm")


# Define a list of known units
known_units = [
    "grams", 
    "g", "kg", 
    "kilos", 
    "kilograms", 
    # ... many more missing for brevity
    "lbs",
    "cup",
    "cups",
    "tablespoons",
    "teaspoons"]

# Initialize the pattern matcher
matcher = PhraseMatcher(nlp.vocab)

# Add the unit patterns to the matcher
matcher.add("UNITS", [nlp(x) for x in known_units])

Now we can write a function to use this matcher along with a bit of logic to figure out what's what and structure it into the json format I outlined above.

# Function to parse ingredient strings
def parse_ingredient(ingredient):
    doc = nlp(ingredient)
    matches = matcher(doc)
    
    quantity = None
    unit = None
    ingredient_name = []


    for token in doc:
        if token.text in ['(',')']:
            continue
        if not quantity and token.pos_ == 'NUM':
            quantity = token.text
        elif unit is None and any(match_id == token.i for _, match_id, _ in matches):
            unit = token.text
        else:
            ingredient_name.append(token.text)

    return {
        "quantity": quantity,
        "unit": unit,
        "ingredient": " ".join(ingredient_name)
    }
  • First, we allow spacy to parse the document and then use our matcher to try and find unit matches.
  • Then we iterate through each token in the document, ignoring brackets - we might want to expand this to other punctuation.
  • If the token is a number and we don't have a quantity yet, assume it's the quantity.
  • We assume quantity can only defined once per ingredient string and therefore once we've found it, we don't consider any more tokens for that role.
  • Otherwise, We check to see whether the current was matched to a unit.
  • Last but not least, if the token can't be matched to a quantity or a unit then we assume it's part of the ingredient name.

This approach works if the items are in a different order too (which would completely throw a regular expression off) e.g. Milk (1 cup) rather than 1 cup milk.

Making it More Robust

This logic is not perfect but it should cover most reasonable use cases where someone enters an ingredient following .

However, if I want to improve the parsing performance and variety of things that we want to be able to understand in the future, I could train a custom NER model inside spacy. I will likely write about doing exactly that at some point in the future.

Adding Some Tests

Since this is going to operate an API endpoint in my recipe app, I want to be relatively sure it will work reliably for a few different examples. I'm building a Django app so I've set up a test case using the Django testing framework

from django.test import TestCase
import spacy
from spacy.matcher import PhraseMatcher
from recipe_app.nlp.ingredients import parse_ingredient 

class ParseIngredientTestCase(TestCase):

    def test_parse_ingredient(self):
        test_cases = [
            ("4 bananas", {"quantity": "4", "unit": None, "ingredient": "bananas"}),
            ("200g sugar", {"quantity": "200", "unit": "g", "ingredient": "sugar"}),
            ("1 stock cube", {"quantity": "1", "unit": None, "ingredient": "stock cube"}),
            ("1/2 tbsp flour", {"quantity": "1/2", "unit": "tbsp", "ingredient": "flour"}),
            ("3 lbs ground beef", {"quantity": "3", "unit": "lbs", "ingredient": "ground beef"}),
            ("2.5 oz chocolate chips", {"quantity": "2.5", "unit": "oz", "ingredient": "chocolate chips"}),
            ("5 kg potatoes", {"quantity": "5", "unit": "kg", "ingredient": "potatoes"}),
            ("1 cup milk", {"quantity": "1", "unit": "cup", "ingredient": "milk"}),
            ("2 tablespoons olive oil", {"quantity": "2", "unit": "tablespoons", "ingredient": "olive oil"}),
            ("1/4 pound sliced ham", {"quantity": "1/4", "unit": "pound", "ingredient": "sliced ham"}),
            ("2 liters water", {"quantity": "2", "unit": "liters", "ingredient": "water"}),
            ("750 ml orange juice", {"quantity": "750", "unit": "ml", "ingredient": "orange juice"}),
            ("3 teaspoons salt", {"quantity": "3", "unit": "teaspoons", "ingredient": "salt"}),
            ("milk (1 cup)", {"quantity": "1", "unit": "cup", "ingredient": "milk"}),
            ("tomatoes (3 pieces)", {"quantity": "3", "unit": "pieces", "ingredient": "tomatoes"}),
            ("pasta (200g)", {"quantity": "200", "unit": "g", "ingredient": "pasta"}),
        ]

        for ingredient, expected in test_cases:
            parsed = parse_ingredient(ingredient)
            self.assertEqual(parsed, expected)

This code tests a few different variations and options and includes some examples where there is no unit. I also used American english spellings of some of the units for variety.

Building an Endpoint

I'm using HTMX for doing asynchronous interaction between the web page and the form. In my Form class, I set hx-get on the ingredient form field so that whenever the value changes it makes a request to an ingredient_parser endpoint:

from django import forms, urls

from recipe_app.models import Recipe, RecipeIngredient, Ingredient
   
class IngredientForm(forms.Form):
    ingredient = forms.CharField(widget=forms.TextInput(attrs={
        'hx-get': urls.reverse_lazy('ingredient_parser'),
        'hx-trigger': "change",
        "hx-target":"this",
        "hx-swap": "outerHTML"
        }))

Then I have a view class defined which grabs the value from the ingredient form, parses it and responds with a little HTML snippet which HTMX swaps in on the frontend. In the future I will look up a stock image per ingredient but for now I've got a mystery food picture:

class IngredientAutocomplete(View):

    def get(self, request, *args, **kwargs):
        
        ingredient = None
        form_key = None

        for key in request.GET:
            if key.startswith("recipeingredient"):
                ingredient = request.GET[key]
                form_key = key
                break

        if ingredient is None:
            return JsonResponse({}) # TODO: make this respond in a better way
        else:
            ing = parse_ingredient(ingredient)
            ing['raw_text'] = ingredient
            ing['form_key'] = form_key
            ing['thumbnail'] = static("images/food/mystery_food.png")
            return render(request, "partial/autocomplete/ingredients.html", context=ing)

The for loop over the request.GET items is needed because each time we add a new ingredient to the form, the field gets a slightly different name. e.g. recipeingredient_1, recipeingredient_2 and so on.

Putting it All Together

I recorded a video of the form that I've built where I add an ingredient and the response gets populated.