brainsteam.co.uk/brainsteam/content/posts/2024/12/penparse-update.md

61 lines
5.8 KiB
Markdown
Raw Normal View History

2024-12-23 14:05:58 +00:00
---
2024-12-23 14:06:29 +00:00
date: 2024-12-23 14:04:49+00:00
2024-12-23 14:05:58 +00:00
description: A short update about my handwriting OCR project PenParse (formerly AnnoMemo)
2024-12-23 14:06:29 +00:00
preview: /social/9918440fba3786be318d76daffc796607384aa6f788e2e6c141f7af4d9b3ff75.png
2024-12-23 14:05:58 +00:00
tags:
2024-12-23 14:06:29 +00:00
- post
- annomemo
- penparse
- python
- softeng
title: Working on PenParse
type: posts
url: /2024/12/23/penparse-update
2024-12-23 14:05:58 +00:00
---
I'm currently working on PenParse (initially I called this [AnnoMemo](https://brainsteam.co.uk/2024/11/3/03-annomemo-telegram-bot/) ). It's a system for transcribing photos of handwritten notes into markdown notes and then adding them automatically to a personal knowledge management (PKM) app like [[Obsidian]] or [[Joplin]] or [[Memos]]. In other words it's a [Handwriting Text Recognition](https://en.wikipedia.org/wiki/Handwriting_recognition) (HTR) tool.
## Motivation for the project
Why work on this project? Well firstly, I love writing with pen and paper and also working with digital PKM apps. However, I am not a fan of typing up my notes or dictating my notes out loud to a speech-to-text program. The convenience of being able to take a photo of my notes and have them appear in obsidian is seductive. Furthermore, there seems to be appetite for it, per the recently announced initiative from Joplin around[integrating handwritten text recognition into their app](https://joplinapp.org/news/20241217-project-4-htr/).
Now is a good time to build this tool too. Local multi-modal language models like [Qwen2 VL](Qwen/Qwen2-VL-2B-Instruct) have gotten good enough at this task that we don't necessarily need to send any data (i.e. photos of your inner-most thoughts) to OpenAI or Anthropic for processing. Models like Qwen2 VL 2B can now run on an (admittedly high end) consumer laptop or graphics card and process a page of text in a few seconds.
The stakes of a model making a mistake are pretty low but the task is laborious, the sweet spot for this kind of tool. Effectively we're transcribing handwritten text that's going to end up as a note in Obsidian or a blog draft. If we store the extracted text next to the image of the handwritten note you can quickly check if something looks wrong and verify it by checking the image.
I'm also interested in building a completely opt-in, optional, voluntary HTR dataset which could be used to train smaller, more efficient HTR models, potentially opening up the possibility of running this pipeline completely locally on lower-end machines in the future. ***I cannot stress this enough: Participation in this dataset would be completely optional and opt in.*** I won't ever build a tool that gobbles up your innermost thoughts automatically without consent and you don't have to trust me, you'll be able to read the source code of the application and see that it's true. I plan on adding some friction to the process of submitting to this dataset so that people don't accidentally end up doing it. If that means we don't collect many samples then meh, so be it!
I've written in more detail about the project and my philosophy and choices [over on my digital garden](https://notes.jamesravey.me/Projects/PenParse).
## Progress so far
### Web App
I've taken the proof-of-concept that I built [a few weeks ago](https://brainsteam.co.uk/2024/11/3/03-annomemo-telegram-bot/) and started to build a web app around it using Django and Celery for running the image processing in the background. I've deliberately chosen to decouple the specific models from the application itself to give self-hosters the choice to use a third party API if they want to and don't have the capability to run the model locally.
I've built out a simple dashboard where you can upload image files and analysis progress. You can see the image status, when it was last updated and if the scanning was successful you can see a snippet of the text.
![screenshot of penparse dashboard showing two example uploaded images and their corresponding statuses, The leftmost one was successfully scanned and a snippet of the text is shown in the card. The rightmost one failed to parse and it's status is marked "error"](https://media.jamesravey.me/i/c6e6a193-7888-49be-a20d-b94a032f36da.jpg)
You can click 'View' on a card to see the full content. This view shows the full content that was extracted and also the image as it was scanned so that you can compare and contrast. You can also copy the note content to the clipboard.
![a view from a document, the scanned image is presented on the right and the extracted text on the left. You can click to copy the content to clipboard and there are other options for export or delete.](https://media.jamesravey.me/i/946daf42-1c69-4532-8793-cb02d43a840e.jpg)
### API
I'm working on adding a REST api to the app using [django rest framework](https://www.django-rest-framework.org/tutorial/quickstart/), this will allow me to code up some plugins for common PKM apps fairly quickly. Users can currently authenticate via an API key.
## Plans and Next Steps
My next steps are likely to be writing some rudimentary plugins for Obsidian and possibly Joplin to prove the end-to-end concept and I'd also like to re-implement the Telegram bot functionality I built out in my proof-of-concept.
The flow would be:
1. User writes their note in their notebook
2. User opens up their smartphone and opens Telegram (or other chat app) or simply navigates to their PenParse instance from their browser
3. User snaps a photo of their note and uploads it to PenParse
4. Photo is processed in the background and information is extracted
5. Note + original photo are synced to the PKM app
6. User opens PKM app in their smartphone or on their laptop and the note text is visible.
2024-12-23 14:06:29 +00:00
I am building the project in [my private Forgejo instance](https://git.jamesravey.me/ravenscroftj/PenParse). I'm interested in feedback and comments and if anyone is keen on getting involved, let me know, PRs welcome!