12 KiB
categories | date | draft | tags | title | type | url | |||||
---|---|---|---|---|---|---|---|---|---|---|---|
|
2023-11-19 18:13:40 | false |
|
Parsing Ingredient Strings with SpaCy PhraseMatcher | posts | /2023/11/19/parsing-ingredient-strings-with-spacy-phrasematcher/ |
As part of my work on Gastronaut, I'm building a form that allows users to create recipes and which will attempt to parse ingredients lists and find a suitable stock photo for each item the user adds to their recipe. As well as being cute and decorative, this step is important for later when we want to normalise ingredient quantities.
What we're looking to do is take a string like "2lbs flour" and turn it into structured data - a json representation might look like this:
{ "ingredient":"flour", "unit":"lbs", "quantity":"2" }
We can then do whatever we need with this structured data - like using the ingredient
to look up a thumbnail in another system or generate a link or reference to the new recipe so that people looking for "banana" can find all the recipes that use them.
Building a Parser
There are a few options for parsing these strings. If you're feeling frivolous and want to crack a walnut with a sledgehammer, you could probably get OpenAI's GPT to parse these strings with a single API call and a prompt. However, I wanted to approach this problem with a more proportional technique.
I'm using Spacy along with Spacy's PhraseMatcher functionality which basically looks for a list of possible words and phrases. Once we've installed Spacy, we make a long list of units and we tell Spacy about them:
import spacy from spacy.matcher import PhraseMatcher # Load the English language model nlp = spacy.load("en_core_web_sm") # Define a list of known units known_units = [ "grams", "g", "kg", "kilos", "kilograms", # ... many more missing for brevity "lbs", "cup", "cups", "tablespoons", "teaspoons"] # Initialize the pattern matcher matcher = PhraseMatcher(nlp.vocab) # Add the unit patterns to the matcher matcher.add("UNITS", [nlp(x) for x in known_units])
Now we can write a function to use this matcher along with a bit of logic to figure out what's what and structure it into the json format I outlined above.
# Function to parse ingredient strings def parse_ingredient(ingredient): doc = nlp(ingredient) matches = matcher(doc) quantity = None unit = None ingredient_name = [] for token in doc: if token.text in ['(',')']: continue if not quantity and token.pos_ == 'NUM': quantity = token.text elif unit is None and any(match_id == token.i for _, match_id, _ in matches): unit = token.text else: ingredient_name.append(token.text) return { "quantity": quantity, "unit": unit, "ingredient": " ".join(ingredient_name) }
- First, we allow spacy to parse the document and then use our matcher to try and find unit matches.
- Then we iterate through each token in the document, ignoring brackets - we might want to expand this to other punctuation.
- If the token is a number and we don't have a quantity yet, assume it's the quantity.
- We assume
quantity
can only defined once per ingredient string and therefore once we've found it, we don't consider any more tokens for that role. - Otherwise, We check to see whether the current was matched to a unit.
- Last but not least, if the token can't be matched to a quantity or a unit then we assume it's part of the ingredient name.
This approach works if the items are in a different order too (which would completely throw a regular expression off) e.g. Milk (1 cup) rather than 1 cup milk.
Making it More Robust
This logic is not perfect but it should cover most reasonable use cases where someone enters an ingredient following .
However, if I want to improve the parsing performance and variety of things that we want to be able to understand in the future, I could train a custom NER model inside spacy. I will likely write about doing exactly that at some point in the future.
Adding Some Tests
Since this is going to operate an API endpoint in my recipe app, I want to be relatively sure it will work reliably for a few different examples. I'm building a Django app so I've set up a test case using the Django testing framework
from django.test import TestCase import spacy from spacy.matcher import PhraseMatcher from recipe_app.nlp.ingredients import parse_ingredient class ParseIngredientTestCase(TestCase): def test_parse_ingredient(self): test_cases = [ ("4 bananas", {"quantity": "4", "unit": None, "ingredient": "bananas"}), ("200g sugar", {"quantity": "200", "unit": "g", "ingredient": "sugar"}), ("1 stock cube", {"quantity": "1", "unit": None, "ingredient": "stock cube"}), ("1/2 tbsp flour", {"quantity": "1/2", "unit": "tbsp", "ingredient": "flour"}), ("3 lbs ground beef", {"quantity": "3", "unit": "lbs", "ingredient": "ground beef"}), ("2.5 oz chocolate chips", {"quantity": "2.5", "unit": "oz", "ingredient": "chocolate chips"}), ("5 kg potatoes", {"quantity": "5", "unit": "kg", "ingredient": "potatoes"}), ("1 cup milk", {"quantity": "1", "unit": "cup", "ingredient": "milk"}), ("2 tablespoons olive oil", {"quantity": "2", "unit": "tablespoons", "ingredient": "olive oil"}), ("1/4 pound sliced ham", {"quantity": "1/4", "unit": "pound", "ingredient": "sliced ham"}), ("2 liters water", {"quantity": "2", "unit": "liters", "ingredient": "water"}), ("750 ml orange juice", {"quantity": "750", "unit": "ml", "ingredient": "orange juice"}), ("3 teaspoons salt", {"quantity": "3", "unit": "teaspoons", "ingredient": "salt"}), ("milk (1 cup)", {"quantity": "1", "unit": "cup", "ingredient": "milk"}), ("tomatoes (3 pieces)", {"quantity": "3", "unit": "pieces", "ingredient": "tomatoes"}), ("pasta (200g)", {"quantity": "200", "unit": "g", "ingredient": "pasta"}), ] for ingredient, expected in test_cases: parsed = parse_ingredient(ingredient) self.assertEqual(parsed, expected)
This code tests a few different variations and options and includes some examples where there is no unit. I also used American english spellings of some of the units for variety.
Building an Endpoint
I'm using HTMX for doing asynchronous interaction between the web page and the form. In my Form class, I set hx-get
on the ingredient form field so that whenever the value changes it makes a request to an ingredient_parser
endpoint:
from django import forms, urls from recipe_app.models import Recipe, RecipeIngredient, Ingredient class IngredientForm(forms.Form): ingredient = forms.CharField(widget=forms.TextInput(attrs={ 'hx-get': urls.reverse_lazy('ingredient_parser'), 'hx-trigger': "change", "hx-target":"this", "hx-swap": "outerHTML" }))
Then I have a view class defined which grabs the value from the ingredient form, parses it and responds with a little HTML snippet which HTMX swaps in on the frontend. In the future I will look up a stock image per ingredient but for now I've got a mystery food picture:
class IngredientAutocomplete(View): def get(self, request, *args, **kwargs): ingredient = None form_key = None for key in request.GET: if key.startswith("recipeingredient"): ingredient = request.GET[key] form_key = key break if ingredient is None: return JsonResponse({}) # TODO: make this respond in a better way else: ing = parse_ingredient(ingredient) ing['raw_text'] = ingredient ing['form_key'] = form_key ing['thumbnail'] = static("images/food/mystery_food.png") return render(request, "partial/autocomplete/ingredients.html", context=ing)
The for loop over the request.GET items is needed because each time we add a new ingredient to the form, the field gets a slightly different name. e.g. recipeingredient_1, recipeingredient_2 and so on.
Putting it All Together
I recorded a video of the form that I've built where I add an ingredient and the response gets populated.