<p>In the world of AI assistants, subscription services like <ahref="https://openai.com/index/chatgpt-plus/">ChatGPT Plus</a>, <ahref="https://www.anthropic.com/news/claude-pro">Claude Pro</a> and <ahref="https://one.google.com/about/ai-premium/">Google One AI </a>have become increasingly popular amongst knowledge workers. However these subscription services may not be the most cost-effective or flexible solution for everyone and the not-insignificant fees encourage users to stick to one model that the've already paid for rather than trying out different options.</p>
<!-- /wp:paragraph -->
<!-- wp:paragraph -->
<p>I previously wrote about<ahref="https://brainsteam.co.uk/2024/04/20/self-hosting-llama-3-on-a-home-server/"> how you can self-host Llama3</a> on a machine with an older graphics card using open source tools. In this post, I demonstrate how to expand this setup to allow you to interact with OpenAI, Anthropic, Google and others alongside your local models via a single, self-hosted UI.</p>
<!-- /wp:paragraph -->
<!-- wp:heading {"level":3} -->
<h3class="wp-block-heading">Why ditch my subscription?</h3>
<!-- /wp:heading -->
<!-- wp:paragraph -->
<p>Hosting your own AI user interface allows for more granular cost control, potentially saving money for lighter users of commercial models (are you really getting $20/mo of usage out of ChatGPT?). It grants you greater flexibility in choosing and comparing different AI models without worrying about which subscription to spend your $20 on this month or forking out for more than one at a time. Additionally, commercial AI providers usually provide <ahref="https://www.reddit.com/r/ChatGPT/comments/1d80onp/pro_tip_the_chatgpt_api_is_still_up_and_working/">a more stable experience via their APIs </a>since business users, the target audience of the API offerings, tend to have more leverage than consumers when it comes to service-level-agreements. API terms & conditions around data usage are normally better too (within ChatGPT, data collection is opt-out and you lose functionality when you do it) .</p>
<!-- /wp:paragraph -->
<!-- wp:paragraph -->
<p>Self-hosting your chat interface with Open Web UI brings enhtanced privacy through a local UI, the capability to run prompts through multiple models simultaneously, and the freedom to use different models for different tasks without being locked into a single provider. You can build your own bespoke catalogue of commercial and local models (<ahref="https://brainsteam.co.uk/2024/04/20/self-hosting-llama-3-on-a-home-server/">like Llama-3</a>) and access them and their outputs all from one place. If you're already familiar with docker and docker-compose you can have something up and running in 15 minutes.</p>
<!-- /wp:paragraph -->
<!-- wp:heading {"level":3} -->
<h3class="wp-block-heading">When might this not work for me?</h3>
<!-- /wp:heading -->
<!-- wp:paragraph -->
<p>If you are a particularly heavy user of the service that you're subscribed to and you're not keen on trying other models you may find that moving to pay-as-you-go is more expensive for you. The <ahref="https://www.penguin.co.uk/articles/2020/09/book-length-debate-fiction-long-novels">average length of a fiction book</a> is something like 60-90k words. It costs about $5 to have GPT-4o read 16 books (input/prompt) and $15 to have it write 16 books worth of output. If you're spending all day every day chatting to Jippy then you might find that you end up spending more than the $20/mo on API usage. </p>
<!-- /wp:paragraph -->
<!-- wp:paragraph -->
<p>If cost is your primary driver, you should factor in the cost of hosting your server too. If you are already running a homelab or a cheap VPS you might be able to run the web UI at "no extra charge" but if you need to spin up a new server just for hosting your web UI that's going to cut into your subscription savings.</p>
<!-- /wp:paragraph -->
<!-- wp:paragraph -->
<p>If privacy is your primary driver, you should be aware that this approach still involves sending data to third parties. If you want to avoid that all together you'll need to go fully self hosted.</p>
<!-- /wp:paragraph -->
<!-- wp:paragraph -->
<p>Finally, fair warning: this process is technical, and you'll need to be familiar (or willing to learn about) Docker, yaml config files and APIs.</p>
<!-- /wp:paragraph -->
<!-- wp:paragraph -->
<p></p>
<!-- /wp:paragraph -->
<!-- wp:heading {"level":3} -->
<h3class="wp-block-heading">The Setup</h3>
<!-- /wp:heading -->
<!-- wp:paragraph -->
<p>This setup builds on my previous post about Llama 3. Previously we used Open Web UI to provide a web app that allows you to talk to the LLM and we used Ollama to host Llama 3 locally and provide a system that Open Web UI could send requests to. </p>
Web Traffic (Open Internet): Represented by an arrow entering the local network boundary.
Caddy: A web server handling incoming web traffic from the internet with arrows pointing to "chat.example.com" (Open Web UI) and "api.example.com" (LiteLLM).
Open Web UI: Handles self-hosted LLM traffic (LLama 3/Gemma 2) and forwards it to "Ollama". Also manages commercial LLM traffic (GPT-4/Claude) and routes it to "LiteLLM".
Ollama: Connected to Open Web UI, manages the self-hosted LLM traffic.
LiteLLM: Manages commercial LLM traffic and logs information including API credentials, interacting with "PostgreSQL" and "Third Party APIs".
PostgreSQL: Handles logging and API credential storage, connected to LiteLLM.
Third Party APIs: Interacts with LiteLLM for external API calls.
The local network boundary is highlighted with a green dashed line, encapsulating all internal components." class="wp-image-3160"/></figure>
<!-- /wp:image -->
<!-- wp:paragraph -->
<p>This setup has five main components components:</p>
<!-- /wp:paragraph -->
<!-- wp:list {"ordered":true} -->
<ol><!-- wp:list-item -->
<li>Ollama - a system for running local language models on your computer, using your GPU or Apple silicon to provide reasonable response speeds</li>
<!-- /wp:list-item -->
<!-- wp:list-item -->
<li>Open Web UI, an open source frontend for chatting with LLMs that supports Ollama but also any old OpenAI-compatible API endpoint.</li>
<!-- /wp:list-item -->
<!-- wp:list-item -->
<li>LiteLLM Proxy which allows us to input our API keys and provides a single service that Open Web UI can call to access a whole bunch of commercial AI models</li>
<!-- /wp:list-item -->
<!-- wp:list-item -->
<li>PostgreSQL - a database server which LiteLLM will use to store data about API usage and costs.</li>
<!-- /wp:list-item -->
<!-- wp:list-item -->
<li>Caddy - used for reverse proxying HTTP traffic from the open web and routing it to Open Web UI or LiteLLM.</li>
<!-- /wp:list-item --></ol>
<!-- /wp:list -->
<!-- wp:paragraph -->
<p>The resulting docker-compose file will look something like this:</p>
# everything below this line is optional and can be commented out
ollama:
image: ollama/ollama
restart: always
ports:
- 11434:11434
volumes:
- ./ollama:/root/.ollama
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
</pre>
<!-- /wp:enlighter/codeblock -->
<!-- wp:paragraph -->
<p>Note that Ollama is optional in this setup and we can deploy without it. That might be helpful if you want to take advantage of being able to switch between different commercial APIs but don't want to run local models (or perhaps don't have the hardware for it). If you want to turn off Ollama, comment it out or remove it. You'll also need to remove the Ollama line </p>
<!-- /wp:paragraph -->
<!-- wp:paragraph -->
<p>We also need some support files. </p>
<!-- /wp:paragraph -->
<!-- wp:heading {"level":4} -->
<h4class="wp-block-heading">Caddyfile</h4>
<!-- /wp:heading -->
<!-- wp:paragraph -->
<p>The caddyfile is used to route incoming web traffic to your services. Edit the file <code>caddy/Caddyfile</code>. Assuming you have set up DNS as required, You can do something like this:</p>
<h4class="wp-block-heading">Secrets in <code>.env.docker</code></h4>
<!-- /wp:heading -->
<!-- wp:paragraph -->
<p>We need to create a docker env file which contains our API keys for the services we want to use. </p>
<!-- /wp:paragraph -->
<!-- wp:paragraph -->
<p>Here we also define a "master" API key which we can use to authenticate Open Web UI against LiteLLM and also to log in to LiteLLM and see the API call stats.</p>
<p>We need to create a very basic <code>config.yaml</code> file which LiteLLM will read to tell it which external models you want to allow users of your web ui to access.</p>
- model_name: claude-3-opus ### RECEIVED MODEL NAME ###
litellm_params: # all params accepted by litellm.completion() - https://docs.litellm.ai/docs/completion/input
model: claude-3-opus-20240229 ### MODEL NAME sent to `litellm.completion()` ###
api_key: "os.environ/ANTHROPIC_API_KEY" # does os.getenv("AZURE_API_KEY_EU")
- model_name: gpt-4o
litellm_params:
model: openai/gpt-4o
api_key: "os.environ/OPENAI_API_KEY"
- model_name: groq-llama-70b
litellm_params:
api_base: https://api.groq.com/openai/v1
api_key: "os.environ/GROQ_API_KEY"
model: openai/llama3-70b-8192</pre>
<!-- /wp:enlighter/codeblock -->
<!-- wp:paragraph -->
<p>In the example above we tell LiteLLM that we want to connect to Anthropic, OpenAI and Groq and we will allow access to Claude Opus, GPT-4o and Llama 70B respectively. The <code>api_key</code> directives tell LiteLLM to grab the values from named environment variable <code>os.environ/ENV_VAR_NAME</code> so we can customise our env vars to whatever makes sense. </p>
<!-- /wp:paragraph -->
<!-- wp:paragraph -->
<p>LiteLLM uses the prefixes on model names to know which API it needs to use e.g. it sees <code>claude</code> and knows to use the Anthropic API client. We can also use <code>openai/</code> to instruct LiteLLM to load any model that supports OpenAI-compatible endpoints out of the box.</p>
<!-- /wp:paragraph -->
<!-- wp:heading -->
<h2class="wp-block-heading">First Run Setup</h2>
<!-- /wp:heading -->
<!-- wp:heading {"level":4} -->
<h4class="wp-block-heading">Start Docker</h4>
<!-- /wp:heading -->
<!-- wp:paragraph -->
<p>Ok now that we've created all the config files we can start our docker-compose cluster. You can run <code>docker-compose up -d</code> to bring all the services online. </p>
<!-- /wp:paragraph -->
<!-- wp:paragraph -->
<p></p>
<!-- /wp:paragraph -->
<!-- wp:heading {"level":4} -->
<h4class="wp-block-heading">Log in to LiteLLM</h4>
<!-- /wp:heading -->
<!-- wp:paragraph -->
<p>Let's start by testing that our LiteLLM setup works. You should be able to navigate to litellm via the caddy subdomain (eg.. <ahref="https://api.example.com">https://api.example.com</a>) or via <ahref="http://localhost:8080/ui">http://localhost:8080/ui</a> to get to the LiteLLM UI. If you're not running on your current machine you can also use your LAN IP address instead. You'll need to enter <code>admin</code> as the username and whatever value you used for <code>LITELLM_MASTER_KEY</code> above as the password (including sk-). If all goes well you should see a list of API keys which initially only contains your master key:</p>
<figureclass="wp-block-image size-large"><imgsrc="/media/image-1024x523_d60ee748.png"alt="Screenshot of the LiteLLM dashboard interface. The main section displays information for the "DefaultTeam."KeypropertiesshownincludeaKeyAlias"NotSet,"aSecretKeypartiallyblurred,SpendinUSDas$0.0000,BudgetasUnlimited,Modelslabeledas"all-proxy-models,"andTPM/RPM(TransactionsperMinute/RequestsperMinute)LimitsbothsettoUnlimited.Iconsforediting,copying,anddeletingthekeyarealsovisible.TotalSpendisshownatthetopas$0.0000.TheleftsidebarmenuincludeslinksforAPIKeys,TestKey,Models,Usage,Teams,InternalUsers,Logging&Alerts,Caching,Budgets,RouterSettings,Admin,APIReference,andModelHub.A“CreateNewKey”buttonisatthebottom.Thetop-rightcornerincludesoptionsfor"Getenterpriselicense"andan"admin"button."class="wp-image-3169"/></figure>
<p>If you navigate to the "Models" tab you should also see the models that you enabled in your config.yaml and if all has gone well, pricing information should have been pulled through from the API</p>
<figureclass="wp-block-image size-large"><imgsrc="/media/image-2-1024x488_1ec17dd2.png"alt="Screenshot of the LiteLLM dashboard's "Models"section.Theheadershowsoptions:AllModels,AddModel,/health,Models,ModelAnalytics,andModelRetrySettings.ThefilterissetbyPublicModelNameto"claude-3-5-sonnet-20240620."Thetablelistspublicmodeldetailsincluding:
- Icons for editing, copying, and deleting appear next to each model.
2.**Model Name**: "claude-3-opus"
- **Provider**: openai
- **Input Price**: 15.00
- **Output Price**: 75.00
- **Status**: Config Model
- Icons for editing, copying, and deleting appear.
3.**Model Name**: "gpt-4"
- **Provider**: openai
- **Input Price**: 30.00
- **Output Price**: 60.00
- **Status**: Config Model
- Icons for editing, copying, and deleting appear.
4.**Model Name**: "gpt-4o"
- **Provider**: openai
- **Input Price**: 5.00
- **Output Price**: 15.00
- **Status**: Config Model
- Icons for editing, copying, and deleting appear.
5.**Model Name**: "groq-llama-70b"
- **Provider**: openai
- **API Base**: https://api.groq.com
- **Status**: Config Model
- Icons for editing, copying, and deleting.
The right corner of the page shows "Get enterprise license" and "admin" links. The last refresh date and time are displayed as 08/07/2024, 08:57:00. The left sidebar includes links for API Keys, Test Key, Models, Usage, Teams, Internal Users, Logging & Alerts, Caching, Budgets, Router Settings, Admin, API Reference, and Model Hub." class="wp-image-3170"/></figure>
<!-- /wp:image -->
<!-- wp:heading {"level":4} -->
<h4class="wp-block-heading">Create Open Web UI Account</h4>
<!-- /wp:heading -->
<!-- wp:paragraph -->
<p>Next we need to create our Open Web UI account. Go to your chat subdomain <ahref=" https://chat.example.com/ ">https://chat.example.com/</a> or <ahref="http://localhost:3000">http://localhost:3000</a> and follow the registration wizard. Open Web UI will <ahref="https://docs.openwebui.com/getting-started/#how-to-install-">automatically treat the first ever user to sign up as the administrator</a>. </p>
<!-- /wp:paragraph -->
<!-- wp:heading {"level":4} -->
<h4class="wp-block-heading">Connect Open Web UI to LiteLLM</h4>
<!-- /wp:heading -->
<!-- wp:paragraph -->
<p>Once we're signed in we need to connect Open Web UI to LiteLLM so that it can use the models. Click on your profile picture > Admin Panel > Settings and open the Connections tab:</p>
<figureclass="wp-block-image size-large"><imgsrc="/media/image-3-1024x524_141fb2cb.png"alt="Screenshot of the Admin Panel interface with a dark theme. There are two primary tabs at the top: "Dashboard"and"Settings."The"Settings"tabisselected.TheleftsidebarcontainslinksforGeneral,Users,Connections,Models,Documents,WebSearch,Interface,Audio,Images,Pipelines,andDatabase.
The main section displays configurations for two APIs:
1.**OpenAI API**
- URL: http://litellm:4000/v1
- WebUI requests to: 'http://litellm:4000/v1/models'
- Includes a masked API key field with an eye icon to reveal the key, a plus icon to add a new key, and a refresh icon. The toggle switch for this API is turned on.
2.**Ollama API**
- URL: http://ollama:11434
- A clickable link for help: "Trouble accessing Ollama? Click here for help."
- Features include a plus icon to add a new key and a refresh icon. The toggle switch for this API is also turned on." class="wp-image-3171"/></figure>
<!-- /wp:image -->
<!-- wp:paragraph -->
<p>Since everything is running inside docker-compose we can address litellm (and Ollama if you enabled it) using their container names. In the OpenAI API box enter the path to your litellm API endpoint - should be <code>http://litellm:4000/v1</code> and in the API key box enter your master password from <code>LITELLM_MASTER_KEY</code>. Click the button to test the connection and hopefully you'll get a green toast notification to say that it worked.</p>
<!-- /wp:paragraph -->
<!-- wp:paragraph -->
<p>Ollama should already be hooked up if you turned it on. If you didn't enable ollama but want to now, make sure the container is started (<code>docker-compose up -d ollama</code>) and then enter <code>http://ollama:11434</code> in the url field).</p>
<!-- /wp:paragraph -->
<!-- wp:paragraph -->
<p><strong>Question: If Open Web UI lets us connect multiple OpenAI endpoints, why do we need LiteLLM?</strong></p>
<!-- /wp:paragraph -->
<!-- wp:paragraph -->
<p>Open Web UI won't talk directly to models that don't use OpenAI compatible endpoints (e.g. Anthropic Claude or Google Gemini). LiteLLM also lets you be specific about which models you want to pass through to OWUI. For example, you can hide the more expensive ones so that you don't burn credits too quickly. Finally, LiteLLM gives you centralised usage/cost analytics which saves you opening all of your providers' API consoles and manually tallying up your totals.</p>
<!-- /wp:paragraph -->
<!-- wp:heading -->
<h2class="wp-block-heading">Testing it Out</h2>
<!-- /wp:heading -->
<!-- wp:paragraph -->
<p>Congratulations if you got this far, you now have a working self-hosted chat proxy system. Time to try out some prompts - use the drop down menu at the top of the window to select the model you want to use.</p>
<figureclass="wp-block-image size-large"><imgsrc="/media/image-1-1024x481_5e1f8014.png"alt="Screenshot of a coding chat interface where the user requests a Python script with CLI options using the `click` module. The response provides a detailed guide, including installation instructions (`pip install click`) and a code snippet for processing an input file and saving the result to an output file. The left sidebar displays recent activity and categories, including today's and previous days' items like "FirmsEmbraceAI"and"RecipeIngredientsTranscription."Thefootercontainsamessageinputfieldanduserinformation,"James.""class="wp-image-3155"/></figure>
<figureclass="wp-block-image size-large"><imgsrc="/media/image-4-1024x184_f4d02995.png"alt="Screenshot of a chat interface with three dropdown options: "gemma2:latest,""gpt-4,"and"claude-3-5-sonnet-20240620."Auserqueryisdisplayedasking,"howmanybooksis1milliontokens?""class="wp-image-3172"/></figure>
<p>You can even use the multi-modal features of the model from this view by uploading images and documents. Open Web UI will automatically pass them through as required.</p>
<figureclass="wp-block-image size-large"><imgsrc="/media/image-5-1024x325_aa729448.png"alt="Screenshot of a chat interface where the user uploads an image of a recipe and requests transcription of the ingredients into a Markdown list. The response from "gpt-4o"providesaMarkdownformattedlistoftherecipeingredients."class="wp-image-3173"/></figure>
<figureclass="wp-block-image size-large"><imgsrc="/media/litellm_usage-1024x482_cff1ce2f.png"alt="A bar chart breaking down spend by model in litellm's web ui"class="wp-image-3153"/></figure>
<p>Congratulations, now you've got a fully self-hosted AI chat interface set up. You can load up your API keys with a few $ each and track how much you are spending from LiteLLM's control panel.</p>
<!-- /wp:paragraph -->
<!-- wp:paragraph -->
<p>If you enabled ollama, you can stick to cheap/self-hosted models by default and switch to a more powerful commercial model for specific use cases.</p>
<!-- /wp:paragraph -->
<!-- wp:paragraph -->
<p>If you found this tutorial useful, please consider subscribing to my RSS feed, my <ahref="https://jamesravey.medium.com/">newsletter on medium</a> or <ahref="https://fosstodon.org/@jamesravey">following me on mastodon</a> or the fediverse:</p>