<p>As part of my work on <ahref="https://brainsteam.co.uk/2023/11/13/gastronaut-fediverse-recipe-app/"data-type="post"data-id="380">Gastronaut</a>, I'm building a form that allows users to create recipes and which will attempt to parse ingredients lists and find a suitable stock photo for each item the user adds to their recipe. As well as being cute and decorative, this step is important for later when we want to normalise ingredient quantities.</p>
<!-- /wp:paragraph -->
<!-- wp:paragraph -->
<p>What we're looking to do is take a string like "2lbs flour" and turn it into structured data - a json representation might look like this:</p>
<p>We can then do whatever we need with this structured data - like using the <code>ingredient</code> to look up a thumbnail in another system or generate a link or reference to the new recipe so that people looking for "banana" can find all the recipes that use them.</p>
<!-- /wp:paragraph -->
<!-- wp:heading {"level":3} -->
<h3class="wp-block-heading">Building a Parser</h3>
<!-- /wp:heading -->
<!-- wp:paragraph -->
<p>There are a few options for parsing these strings. If you're feeling frivolous and want to crack a walnut with a sledgehammer, you could probably get OpenAI's GPT to parse these strings with a single API call and a prompt. However, I wanted to approach this problem with a more proportional technique.</p>
<!-- /wp:paragraph -->
<!-- wp:paragraph -->
<p>I'm using Spacy along with Spacy's <ahref="https://spacy.io/api/phrasematcher">PhraseMatcher</a> functionality which basically looks for a list of possible words and phrases. Once we've <ahref="https://spacy.io/usage">installed Spacy</a>, we make a long list of units and we tell Spacy about them:</p>
matcher.add("UNITS", [nlp(x) for x in known_units])</pre>
<!-- /wp:enlighter/codeblock -->
<!-- wp:paragraph -->
<p>Now we can write a function to use this matcher along with a bit of logic to figure out what's what and structure it into the json format I outlined above.</p>
<preclass="EnlighterJSRAW"data-enlighter-language="python"data-enlighter-theme=""data-enlighter-highlight=""data-enlighter-linenumbers=""data-enlighter-lineoffset=""data-enlighter-title=""data-enlighter-group=""># Function to parse ingredient strings
def parse_ingredient(ingredient):
doc = nlp(ingredient)
matches = matcher(doc)
quantity = None
unit = None
ingredient_name = []
for token in doc:
if token.text in ['(',')']:
continue
if not quantity and token.pos_ == 'NUM':
quantity = token.text
elif unit is None and any(match_id == token.i for _, match_id, _ in matches):
unit = token.text
else:
ingredient_name.append(token.text)
return {
"quantity": quantity,
"unit": unit,
"ingredient": " ".join(ingredient_name)
}
</pre>
<!-- /wp:enlighter/codeblock -->
<!-- wp:list -->
<ul><!-- wp:list-item -->
<li>First, we allow spacy to parse the document and then use our matcher to try and find unit matches. </li>
<!-- /wp:list-item -->
<!-- wp:list-item -->
<li>Then we iterate through each token in the document, ignoring brackets - we might want to expand this to other punctuation.</li>
<!-- /wp:list-item -->
<!-- wp:list-item -->
<li>If the token is a number and we don't have a quantity yet, assume it's the quantity.</li>
<!-- /wp:list-item -->
<!-- wp:list-item -->
<li>We assume <code>quantity</code> can only defined once per ingredient string and therefore once we've found it, we don't consider any more tokens for that role. </li>
<!-- /wp:list-item -->
<!-- wp:list-item -->
<li>Otherwise, We check to see whether the current was matched to a unit.</li>
<!-- /wp:list-item -->
<!-- wp:list-item -->
<li>Last but not least, if the token can't be matched to a quantity or a unit then we assume it's part of the ingredient name.</li>
<!-- /wp:list-item --></ul>
<!-- /wp:list -->
<!-- wp:paragraph -->
<p>This approach works if the items are in a different order too (which would completely throw a regular expression off) e.g. Milk (1 cup) rather than 1 cup milk.</p>
<!-- /wp:paragraph -->
<!-- wp:heading {"level":3} -->
<h3class="wp-block-heading">Making it More Robust</h3>
<!-- /wp:heading -->
<!-- wp:paragraph -->
<p>This logic is not perfect but it should cover most reasonable use cases where someone enters an ingredient following <quantity><unit><ingredient>. </p>
<!-- /wp:paragraph -->
<!-- wp:paragraph -->
<p>However, if I want to improve the parsing performance and variety of things that we want to be able to understand in the future, I could train a custom NER model inside spacy. I will likely write about doing exactly that at some point in the future.</p>
<!-- /wp:paragraph -->
<!-- wp:paragraph -->
<p></p>
<!-- /wp:paragraph -->
<!-- wp:heading {"level":3} -->
<h3class="wp-block-heading">Adding Some Tests</h3>
<!-- /wp:heading -->
<!-- wp:paragraph -->
<p>Since this is going to operate an API endpoint in my recipe app, I want to be relatively sure it will work reliably for a few different examples. I'm building a Django app so I've set up a test case using the Django testing framework</p>
<p>This code tests a few different variations and options and includes some examples where there is no unit. I also used American english spellings of some of the units for variety.</p>
<!-- /wp:paragraph -->
<!-- wp:heading {"level":3} -->
<h3class="wp-block-heading">Building an Endpoint</h3>
<!-- /wp:heading -->
<!-- wp:paragraph -->
<p>I'm using HTMX for doing asynchronous interaction between the web page and the form. In my Form class, I set <code>hx-get</code> on the ingredient form field so that whenever the value changes it makes a request to an <code>ingredient_parser</code> endpoint:</p>
<p>Then I have a view class defined which grabs the value from the ingredient form, parses it and responds with a little HTML snippet which HTMX swaps in on the frontend. In the future I will look up a stock image per ingredient but for now I've got a mystery food picture:</p>
<p>The for loop over the request.GET items is needed because each time we add a new ingredient to the form, the field gets a slightly different name. e.g. recipeingredient_1, recipeingredient_2 and so on.</p>
<!-- /wp:paragraph -->
<!-- wp:heading {"level":3} -->
<h3class="wp-block-heading">Putting it All Together</h3>
<!-- /wp:heading -->
<!-- wp:paragraph -->
<p>I recorded a video of the form that I've built where I add an ingredient and the response gets populated.</p>