diff --git a/brainsteam/content/posts/2025/02/personal-archive-hoarder.md b/brainsteam/content/posts/2025/02/personal-archive-hoarder.md new file mode 100644 index 0000000..cf81460 --- /dev/null +++ b/brainsteam/content/posts/2025/02/personal-archive-hoarder.md @@ -0,0 +1,101 @@ +--- +title: "Building a Personal Archive With Hoarder" +date: 2025-02-15T14:02:40Z +draft: true +description: How to self-host a personal archive of web content even if stuff get's taken down +url: /2025/2/15/personal-archive-hoarder +type: posts +mp-syndicate-to: +- https://brid.gy/publish/mastodon +- https://brid.gy/publish/twitter +tags: + - technology + - internet + - society +--- + + +In this day and age, what with *gestures at everything* it's important to preserve and record information that may be removed from the internet, lost or forgotten. I've recently been using [Hoarder](https://hoarder.app/) to create a self-hosted personal archive of web content that I've found interesting or useful. Hoarder is an open source project that runs on your own server and allows you to search, filter and tag web content. Crucially, it also takes a full copy of web content and stores it locally so that you can access it even if the original site goes down. + +## A Brief Review of Hoarder + +Hoarder runs a headless version of Chrome (i.e. it doesn't actually open windows up on your server, it just simulates them) and uses this to download content from sites. In the case of paywalled content (maybe you want to save down a copy of a newspaper article for later reference), it can work with [SingleFile](https://github.com/gildas-lormeau/SingleFile) which is a browser plugin for Chrome and Firefox (including mobile) that sends a full copy of what you are currently looking at in your browser to Hoarder. That means that even if you are looking at something that you had to log in to get to, you can save it to Hoarder without having to share any credentials with the app. + +Hoarder optionally includes some AI Features which you can enable or disable depending on your disposition. These features allow hoarder to automatically generate tags for the content you save and optionally generate a summary of any articles you save down too. By default, Hoarder works with OpenAI APIs and, they recommend using `gpt-4o-mini`. However, I've found that Hoarder will play nicely with my [LiteLLM and OpenWebUI setup](https://brainsteam.co.uk/2024/07/08/ditch-that-chatgpt-subscription-moving-to-pay-as-you-go-ai-usage-with-open-web-ui/) meaning that I can generate summaries and tags for bookmarks on my own server using small language models, minimal electricity, no water and without Sam Altman knowing what I've bookmarked. + +The web app is pretty good. It provides full-text search over the pages you have bookmarked and filtering by tag. It also allows you to create lists or 'feeds' which are based on sets of tags you are interested in. Once you click in to an article you can see the cached content and optionally generate a summary of the page. You can manually add tags and, you can also highlight and annotate the page inside hoarder. + +Hoarder also has [an Android app](https://play.google.com/store/apps/details?id=app.hoarder.hoardermobile&hl=en_GB) which allows you to access your bookmarked content from your phone. The app is still a bit bare-bones and does not appear to let you see the cached/saved content yet, but I imagine it will get better with time. + +Hoarder is a fast-evolving project that has only turns 1 year old in the next couple of weeks. It has a single lead maintainer who is doing a pretty stellar job given that it's his side-gig. + +## Setting Up Hoarder + +I primarily use docker and docker-compose for my self-hosted apps. I followed [the developer-provided instructions](https://docs.hoarder.app/Installation/docker) to get hoarder up and running. Then, in the `.env` file we provide some slightly different values for the openai api base URL, key and the inference models we want to use. + +By default, Hoarder will pull down the page, attempt to extract and simplify the content and then throw away the original content. If you want Hoarder to keep a full copy of the original content with all bells and whistles, set `CRAWLER_FULL_PAGE_ARCHIVE` to `true` in your `.env` file. This will take up more disk space but means that you will have more authentic copies of the original data. + +You'll probably want to set up a HTTP reverse proxy to forward requests to hoarder to the right container. I use Caddy because it is super easy and has built in lets-encrypt support: + +```Caddyfile +hoarder.yourdomain.example { + reverse_proxy localhost:3011 +} +``` + +Once that's all set up, you can log in for the first time. Navigate to user settings and go to API Keys, you'll need to generate a key for browser integration. + +## Configuring SingleFile and the Hoarder Browser Extension + +I have both [SingleFile](https://github.com/gildas-lormeau/SingleFile) and [Hoarder Official Extension](https://addons.mozilla.org/en-GB/firefox/addon/hoarder-app/?utm_source=addons.mozilla.org&utm_medium=referral&utm_content=search) installed in Firefox. Both extensions have their place in my workflow, but you might find that your mileage varies. By default, I'll click through into the Hoarder extension which has tighter integration with the server and knows if I already bookmarked a page. If I'm logged into a paywalled page or, I had to click through a load of cookie banners and close a load of ads, I'll use SingleFile. + +For the Hoarder Extension, click on the extension and then simply enter the base URL of your new instance and paste your API key when prompted. The next time you click the button it will try to hoard whatever you have open in that tab. + +For SingleFile you can follow the guidance [here](https://docs.hoarder.app/next/Guides/singlefile/). Essentially you'll want to right click on the extension icon and go to 'Manage Extension' and then open Preferences, expand Destination and then enter the API URL (`https://YOUR_SERVER_ADDRESS/api/v1/bookmarks/singlefile`), your API key (which you generated above and used for the Hoarder extension) and then set `data field name` to `file` and `URL field name` to `url`. + +![A screenshot of the preferences page for the SingleFile extension showing the fields described in the guide linked above](https://media.jamesravey.me/i/25368228-8d6c-48c2-aa05-1f0c2aa6a255.jpg). + +Once you've done this, the next time you click the SinglePage extension icon, it should work through multiple steps to save the current page including any supporting images to Hoarder. + +## Adding Self-Hosted AI Tags and Summaries with LiteLLM + +I already have a litellm instance configured, you can refer to [my earlier post](https://brainsteam.co.uk/2024/07/08/ditch-that-chatgpt-subscription-moving-to-pay-as-you-go-ai-usage-with-open-web-ui/) for hints and tips on how to get this working. + +```shell +OPENAI_BASE_URL=https://litellm.yourdomain.example +OPENAI_API_KEY= +INFERENCE_TEXT_MODEL="qwen2.5:14b" +INFERENCE_IMAGE_MODEL=gpt-4o-mini +``` + +I also found that there is a quirk of LiteLLM which means that you have to use `ollama_chat` as the model prefix in your config rather than `ollama` to enable error-free JSON and model 'tool usage'. Here's an excerpt of my LiteLLM config yaml: + +```yaml + - model_name: gpt-4o-mini + litellm_params: + model: openai/gpt-4o-mini + api_key: "os.environ/OPENAI_API_KEY" + - model_name: qwen2.5:14b + litellm_params: + drop_params: true + model: ollama_chat/qwen2.5:14b + api_base: http://ollama:11434 +``` + +I don't have any local multi-modal models that both a) work with the LiteLLM and b) actually do a good job of answering prompts, so I still rely on `gpt-4o-mini` for vision based tasks within hoarder. + +## Migrating from Linkding + +I have been using Linkding for bookmarking and personal archiving until very recently but I wanted to try Hoarder because I'm easily distracted by shiny things. There is no official path for migrating from Linkding to Hoarder as far as I can tell but I was able to use Linkding's RSS feed feature for this purpose. + +First, I logged into my linkding instance and navigated to Settings > Integrations and grabbed the RSS feed link for All bookmarks + +![a screenshot of the linkding integration settings page with a big red arrow pointing to the All Bookmarks link](https://media.jamesravey.me/i/32f2ee31-75f0-428e-9099-ee7254ca5c9a.jpg). + +Then, I opened up Hoarder's User Settings > RSS Subscriptions and added my feed as a subscription there. I clicked "Fetch Now" to trigger an initial import. + +![A screenshot of the hoarder RSS Subscriptions page showing my linkding instance link](https://media.jamesravey.me/i/e62fb5e6-85af-4f25-8be6-88e5bbe74a8c.jpg) + +## Conclusion + +Hoarder is a pretty cool tool and, it's been easy to get up and running with. It's quickly evolving despite the size of the team behind it and it provides an impressive and easy user experience already. In order to become even more useful for me personally, I'd love to see better annotation support in-app both via the desktop web experience and via the mobile app. I'd also love to see a mobile app with features for reading articles in-app rather than opening things in the browser. Also, it would be great if we could export cached content as an ebook so that I can read bookmarked content on my kindle or my kobo. \ No newline at end of file