brainsteam.co.uk/brainsteam/content/posts/2023/11/19/Parsing Ingredient Strings ...

271 lines
12 KiB
Markdown

---
categories:
- Data Science
- Software Development
date: '2023-11-19 18:13:40'
draft: false
tags:
- django
- gastronaut
- python
title: Parsing Ingredient Strings with SpaCy PhraseMatcher
type: posts
---
<!-- wp:paragraph -->
<p>As part of my work on <a href="https://brainsteam.co.uk/2023/11/13/gastronaut-fediverse-recipe-app/" data-type="post" data-id="380">Gastronaut</a>, I'm building a form that allows users to create recipes and which will attempt to parse ingredients lists and find a suitable stock photo for each item the user adds to their recipe. As well as being cute and decorative, this step is important for later when we want to normalise ingredient quantities.</p>
<!-- /wp:paragraph -->
<!-- wp:paragraph -->
<p>What we're looking to do is take a string like "2lbs flour" and turn it into structured data - a json representation might look like this:</p>
<!-- /wp:paragraph -->
<!-- wp:enlighter/codeblock {"language":"json"} -->
<pre class="EnlighterJSRAW" data-enlighter-language="json" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">{
"ingredient":"flour",
"unit":"lbs",
"quantity":"2"
}</pre>
<!-- /wp:enlighter/codeblock -->
<!-- wp:paragraph -->
<p>We can then do whatever we need with this structured data - like using the <code>ingredient</code> to look up a thumbnail in another system or generate a link or reference to the new recipe so that people looking for "banana" can find all the recipes that use them.</p>
<!-- /wp:paragraph -->
<!-- wp:heading {"level":3} -->
<h3 class="wp-block-heading">Building a Parser</h3>
<!-- /wp:heading -->
<!-- wp:paragraph -->
<p>There are a few options for parsing these strings. If you're feeling frivolous and want to crack a walnut with a sledgehammer, you could probably get OpenAI's GPT to parse these strings with a single API call and a prompt. However, I wanted to approach this problem with a more proportional technique.</p>
<!-- /wp:paragraph -->
<!-- wp:paragraph -->
<p>I'm using Spacy along with Spacy's <a href="https://spacy.io/api/phrasematcher">PhraseMatcher</a> functionality which basically looks for a list of possible words and phrases. Once we've <a href="https://spacy.io/usage">installed Spacy</a>, we make a long list of units and we tell Spacy about them:</p>
<!-- /wp:paragraph -->
<!-- wp:enlighter/codeblock {"language":"python"} -->
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">import spacy
from spacy.matcher import PhraseMatcher
# Load the English language model
nlp = spacy.load("en_core_web_sm")
# Define a list of known units
known_units = [
"grams",
"g", "kg",
"kilos",
"kilograms",
# ... many more missing for brevity
"lbs",
"cup",
"cups",
"tablespoons",
"teaspoons"]
# Initialize the pattern matcher
matcher = PhraseMatcher(nlp.vocab)
# Add the unit patterns to the matcher
matcher.add("UNITS", [nlp(x) for x in known_units])</pre>
<!-- /wp:enlighter/codeblock -->
<!-- wp:paragraph -->
<p>Now we can write a function to use this matcher along with a bit of logic to figure out what's what and structure it into the json format I outlined above.</p>
<!-- /wp:paragraph -->
<!-- wp:enlighter/codeblock {"language":"python"} -->
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group=""># Function to parse ingredient strings
def parse_ingredient(ingredient):
doc = nlp(ingredient)
matches = matcher(doc)
quantity = None
unit = None
ingredient_name = []
for token in doc:
if token.text in ['(',')']:
continue
if not quantity and token.pos_ == 'NUM':
quantity = token.text
elif unit is None and any(match_id == token.i for _, match_id, _ in matches):
unit = token.text
else:
ingredient_name.append(token.text)
return {
"quantity": quantity,
"unit": unit,
"ingredient": " ".join(ingredient_name)
}
</pre>
<!-- /wp:enlighter/codeblock -->
<!-- wp:list -->
<ul><!-- wp:list-item -->
<li>First, we allow spacy to parse the document and then use our matcher to try and find unit matches. </li>
<!-- /wp:list-item -->
<!-- wp:list-item -->
<li>Then we iterate through each token in the document, ignoring brackets - we might want to expand this to other punctuation.</li>
<!-- /wp:list-item -->
<!-- wp:list-item -->
<li>If the token is a number and we don't have a quantity yet, assume it's the quantity.</li>
<!-- /wp:list-item -->
<!-- wp:list-item -->
<li>We assume <code>quantity</code> can only defined once per ingredient string and therefore once we've found it, we don't consider any more tokens for that role. </li>
<!-- /wp:list-item -->
<!-- wp:list-item -->
<li>Otherwise, We check to see whether the current was matched to a unit.</li>
<!-- /wp:list-item -->
<!-- wp:list-item -->
<li>Last but not least, if the token can't be matched to a quantity or a unit then we assume it's part of the ingredient name.</li>
<!-- /wp:list-item --></ul>
<!-- /wp:list -->
<!-- wp:paragraph -->
<p>This approach works if the items are in a different order too (which would completely throw a regular expression off) e.g. Milk (1 cup) rather than 1 cup milk.</p>
<!-- /wp:paragraph -->
<!-- wp:heading {"level":3} -->
<h3 class="wp-block-heading">Making it More Robust</h3>
<!-- /wp:heading -->
<!-- wp:paragraph -->
<p>This logic is not perfect but it should cover most reasonable use cases where someone enters an ingredient following <quantity> <unit> <ingredient>. </p>
<!-- /wp:paragraph -->
<!-- wp:paragraph -->
<p>However, if I want to improve the parsing performance and variety of things that we want to be able to understand in the future, I could train a custom NER model inside spacy. I will likely write about doing exactly that at some point in the future.</p>
<!-- /wp:paragraph -->
<!-- wp:paragraph -->
<p></p>
<!-- /wp:paragraph -->
<!-- wp:heading {"level":3} -->
<h3 class="wp-block-heading">Adding Some Tests</h3>
<!-- /wp:heading -->
<!-- wp:paragraph -->
<p>Since this is going to operate an API endpoint in my recipe app, I want to be relatively sure it will work reliably for a few different examples. I'm building a Django app so I've set up a test case using the Django testing framework</p>
<!-- /wp:paragraph -->
<!-- wp:enlighter/codeblock {"language":"python"} -->
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">from django.test import TestCase
import spacy
from spacy.matcher import PhraseMatcher
from recipe_app.nlp.ingredients import parse_ingredient
class ParseIngredientTestCase(TestCase):
def test_parse_ingredient(self):
test_cases = [
("4 bananas", {"quantity": "4", "unit": None, "ingredient": "bananas"}),
("200g sugar", {"quantity": "200", "unit": "g", "ingredient": "sugar"}),
("1 stock cube", {"quantity": "1", "unit": None, "ingredient": "stock cube"}),
("1/2 tbsp flour", {"quantity": "1/2", "unit": "tbsp", "ingredient": "flour"}),
("3 lbs ground beef", {"quantity": "3", "unit": "lbs", "ingredient": "ground beef"}),
("2.5 oz chocolate chips", {"quantity": "2.5", "unit": "oz", "ingredient": "chocolate chips"}),
("5 kg potatoes", {"quantity": "5", "unit": "kg", "ingredient": "potatoes"}),
("1 cup milk", {"quantity": "1", "unit": "cup", "ingredient": "milk"}),
("2 tablespoons olive oil", {"quantity": "2", "unit": "tablespoons", "ingredient": "olive oil"}),
("1/4 pound sliced ham", {"quantity": "1/4", "unit": "pound", "ingredient": "sliced ham"}),
("2 liters water", {"quantity": "2", "unit": "liters", "ingredient": "water"}),
("750 ml orange juice", {"quantity": "750", "unit": "ml", "ingredient": "orange juice"}),
("3 teaspoons salt", {"quantity": "3", "unit": "teaspoons", "ingredient": "salt"}),
("milk (1 cup)", {"quantity": "1", "unit": "cup", "ingredient": "milk"}),
("tomatoes (3 pieces)", {"quantity": "3", "unit": "pieces", "ingredient": "tomatoes"}),
("pasta (200g)", {"quantity": "200", "unit": "g", "ingredient": "pasta"}),
]
for ingredient, expected in test_cases:
parsed = parse_ingredient(ingredient)
self.assertEqual(parsed, expected)
</pre>
<!-- /wp:enlighter/codeblock -->
<!-- wp:paragraph -->
<p>This code tests a few different variations and options and includes some examples where there is no unit. I also used American english spellings of some of the units for variety.</p>
<!-- /wp:paragraph -->
<!-- wp:heading {"level":3} -->
<h3 class="wp-block-heading">Building an Endpoint</h3>
<!-- /wp:heading -->
<!-- wp:paragraph -->
<p>I'm using HTMX for doing asynchronous interaction between the web page and the form. In my Form class, I set <code>hx-get</code> on the ingredient form field so that whenever the value changes it makes a request to an <code>ingredient_parser</code> endpoint:</p>
<!-- /wp:paragraph -->
<!-- wp:enlighter/codeblock {"language":"python"} -->
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">from django import forms, urls
from recipe_app.models import Recipe, RecipeIngredient, Ingredient
class IngredientForm(forms.Form):
ingredient = forms.CharField(widget=forms.TextInput(attrs={
'hx-get': urls.reverse_lazy('ingredient_parser'),
'hx-trigger': "change",
"hx-target":"this",
"hx-swap": "outerHTML"
}))
</pre>
<!-- /wp:enlighter/codeblock -->
<!-- wp:paragraph -->
<p>Then I have a view class defined which grabs the value from the ingredient form, parses it and responds with a little HTML snippet which HTMX swaps in on the frontend. In the future I will look up a stock image per ingredient but for now I've got a mystery food picture:</p>
<!-- /wp:paragraph -->
<!-- wp:enlighter/codeblock {"language":"python"} -->
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">class IngredientAutocomplete(View):
def get(self, request, *args, **kwargs):
ingredient = None
form_key = None
for key in request.GET:
if key.startswith("recipeingredient"):
ingredient = request.GET[key]
form_key = key
break
if ingredient is None:
return JsonResponse({}) # TODO: make this respond in a better way
else:
ing = parse_ingredient(ingredient)
ing['raw_text'] = ingredient
ing['form_key'] = form_key
ing['thumbnail'] = static("images/food/mystery_food.png")
return render(request, "partial/autocomplete/ingredients.html", context=ing)</pre>
<!-- /wp:enlighter/codeblock -->
<!-- wp:paragraph -->
<p>The for loop over the request.GET items is needed because each time we add a new ingredient to the form, the field gets a slightly different name. e.g. recipeingredient_1, recipeingredient_2 and so on.</p>
<!-- /wp:paragraph -->
<!-- wp:heading {"level":3} -->
<h3 class="wp-block-heading">Putting it All Together</h3>
<!-- /wp:heading -->
<!-- wp:paragraph -->
<p>I recorded a video of the form that I've built where I add an ingredient and the response gets populated.</p>
<!-- /wp:paragraph -->
<!-- wp:embed {"url":"https://youtu.be/_l0_Lxwm4TY","type":"video","providerNameSlug":"youtube","responsive":true,"className":"wp-embed-aspect-4-3 wp-has-aspect-ratio"} -->
<figure class="wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-4-3 wp-has-aspect-ratio"><div class="wp-block-embed__wrapper">
https://youtu.be/_l0_Lxwm4TY
</div></figure>
<!-- /wp:embed -->
<!-- wp:paragraph -->
<p></p>
<!-- /wp:paragraph -->