<h4class="wp-block-heading">Self-hosting Llama 3 as your own ChatGPT replacement service using a 10 year old graphics card and open source components.</h4>
<!-- /wp:heading -->
<!-- wp:paragraph -->
<p>Last week Meta<ahref="https://ai.meta.com/blog/meta-llama-3/"> launched Llama 3</a>, the latest in their open source LLM series. Llama 3 is particularly interesting because the 8 billion parameter model, which is small enough to run on a laptop, <ahref="https://news.ycombinator.com/item?id=40084699">performs as well as models 10x bigger than it</a>. The responses it provides are as good as GPT-4 for many use cases.</p>
<!-- /wp:paragraph -->
<!-- wp:paragraph -->
<p>I finally decided that this was motivation enough to dig out my old Nvidia Titan X card from the loft and slot it into my home server so that I could stand up a ChatGPT clone on my home network. In this post I explain some of the pros and cons of self-hosting llama 3 and provide configuration and resources to help you do it too.</p>
<!-- /wp:paragraph -->
<!-- wp:heading {"level":3} -->
<h3class="wp-block-heading">How it works</h3>
<!-- /wp:heading -->
<!-- wp:paragraph -->
<p>The model is served by <ahref="https://ollama.com/">Ollama</a> which is a GPU-enabled open source service for running LLMs as a service. Ollama makes heavy use of llama.cpp, t<ahref="https://brainsteam.co.uk/2023/09/30/turbopilot-obit/">he same tech that I used to build turbopilot</a> around 1 year ago. The frontend is powered by <ahref="https://github.com/open-webui/open-webui">OpenWebUI</a> which provides a ChatGPT-like user experience for interacting with Ollama models.</p>
<!-- /wp:paragraph -->
<!-- wp:paragraph -->
<p>I use <ahref="https://docs.docker.com/compose/">docker compose</a> to run the two services and wire them together and I've got a Caddy web server set up to let in traffic from the outside world.</p>
<figureclass="wp-block-image size-large"><imgsrc="/media/Ollama-1024x350_319adaca.png"alt="Drawing of the setup as described above. Caddy brokers comms with the outside world over https and feeds messages to OpenWebUI"class="wp-image-2430"/></figure>
<p>My setup is running on a cheap and cheerful AMD CPU and Motherboard package and a 10 year old Nvidia Titan X card (much better GPUS are available on Ebay for around £150. The RTX 3060 with 12GB VRAM would be a great choice). My server has 32GB RAM but this software combo uses a lot less than that. You could probably get away with 16GB and run it smoothly or possibly even 8GB at a push. </p>
<!-- /wp:paragraph -->
<!-- wp:paragraph -->
<p>You could buy <ahref="https://www.scan.co.uk/products/3xs-a520-home-bundle-amd-ryzen-5-5500-asus-tuf-a520-gaming-plus-wifi-16gb-ddr4-amd-wraith-stealth">this bundle</a> and a used RTX3060 on Ebay or a brand new one for around £250 and have a functional ChatGPT replacement in your house for less than £500.</p>
<!-- /wp:paragraph -->
<!-- wp:heading -->
<h2class="wp-block-heading">Pros and Cons of Llama 3</h2>
<!-- /wp:heading -->
<!-- wp:paragraph -->
<p>Llama 3 8B truly is a huge step forward for open source alternatives to relying on APIS from OpenAI, Anthropic and their peers. I am still in the early stages of working with my self-hosted Llama 3 instance but so far I'm finding that it is just as capable as GPT-4 in many arenas.</p>
<!-- /wp:paragraph -->
<!-- wp:heading {"level":4} -->
<h4class="wp-block-heading">Pro: Price</h4>
<!-- /wp:heading -->
<!-- wp:paragraph -->
<p>Self-hosting Llama 3 with Ollama and OpenWebUI is free-ish except for any initial investment you need to make for hardware and then electricity consumption. ChatGPT plus is currently $20/month but techies are likely also burning a similar amount in API calls too. I already had all the components for this build lying around the house but if I bought them 2nd hand it would take around 1 year for them to pay for themselves. That said, I could massively increase my API consumption through my self-hosted models since it's effectively "free".</p>
<!-- /wp:paragraph -->
<!-- wp:heading {"level":4} -->
<h4class="wp-block-heading">Pro: Privacy</h4>
<!-- /wp:heading -->
<!-- wp:paragraph -->
<p>A huge advantage of this approach is that you're not sending your data to an external company to be mined. The consumer version of ChatGPT that most people use is heavily data mined to improve OpenAI's models and anything that you type in may end up in their corpus. Ollama runs entirely on your machine and never sends data back to any third party company.</p>
<!-- /wp:paragraph -->
<!-- wp:heading {"level":4} -->
<h4class="wp-block-heading">Pro: Energy Consumption and Carbon Footprint</h4>
<!-- /wp:heading -->
<!-- wp:paragraph -->
<p>Another advantage is that since Llama 3:8B is small and it runs on a single GPU it uses a lot less energy to run than <ahref="https://www.reddit.com/r/aipromptprogramming/comments/1212kmm/according_to_chatgpt_a_single_gpt_query_consumes/">an average query to ChatGPT</a>. My Titan X card consumes about 250 watts at max load but RTX 3060 cards only require 170 watts to run. Again, I had all the components lying around so I didn't buy anything new to make this server and indeed it means I won't be throwing away components that would otherwise become e-waste.</p>
<!-- /wp:paragraph -->
<!-- wp:heading {"level":4} -->
<h4class="wp-block-heading">Con: Speed on old hardware</h4>
<!-- /wp:heading -->
<!-- wp:paragraph -->
<p>Self-hosting Llama 3 8B on a Titan X is a little slower than ChatGPT but is still perfectly serviceable. It would almost certainly be faster on RTX 3 and 4 series cards.</p>
<p>The biggest missing feature for me is currently multi-modal support. I use GPT-4 to do handwriting recognition and transcription for me and current gen open source models aren't quite up to this yet. However, given the superb quality of Llama 3, I have no doubt that a similarly brilliant open multi-modal model is just around the corner.</p>
<!-- /wp:paragraph -->
<!-- wp:heading {"level":4} -->
<h4class="wp-block-heading">Con: Training Transparency</h4>
<!-- /wp:heading -->
<!-- wp:paragraph -->
<p>Although Llama 3's weights are free to download, the training corpus content is unknown. The model was built by Meta and thus is likely to have been trained on a large amount of user generated content and copyrighted content. Hosted third party models like ChatGPT are likely to be equally problematic in this regard but.</p>
<!-- /wp:paragraph -->
<!-- wp:paragraph -->
<p></p>
<!-- /wp:paragraph -->
<!-- wp:heading {"level":3} -->
<h3class="wp-block-heading">Setting up Llama 3 with Ollama and OpenWebUI</h3>
<!-- /wp:heading -->
<!-- wp:paragraph -->
<p>Once you have the hardware assembled and the operating system installed, the fiddliest part is configuring Docker and Nvidia correctly. </p>
<!-- /wp:paragraph -->
<!-- wp:heading {"level":4} -->
<h4class="wp-block-heading">Ubuntu</h4>
<!-- /wp:heading -->
<!-- wp:paragraph -->
<p>If you're on Ubuntu, you'll need to install docker first. I recommend using <ahref="https://docs.docker.com/engine/install/ubuntu/">the guide from Docker themselves</a> which installs the latest and greatest packages. Then follow<ahref="https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#installing-with-apt"> this guide</a> to install the nvidia runtime. Then you will want to verify that it's all set up using the checking step below.</p>
<!-- /wp:paragraph -->
<!-- wp:heading {"level":4} -->
<h4class="wp-block-heading">Unraid</h4>
<!-- /wp:heading -->
<!-- wp:paragraph -->
<p>I actually run Unraid on my home server rather than Ubuntu. To get things running there, simply install the <ahref="https://forums.unraid.net/topic/98978-plugin-nvidia-driver/">unraid nvidia plugin</a> through the community apps page and make sure to stop and start docker before trying out the step below.</p>
<!-- /wp:paragraph -->
<!-- wp:heading {"level":4} -->
<h4class="wp-block-heading">Checking the Docker and Nvidia Setup (All OSes)</h4>
<!-- /wp:heading -->
<!-- wp:paragraph -->
<p>To make sure that Docker and Nvidia are installed properly and able to talk to each other you can run:</p>
<!-- /wp:paragraph -->
<!-- wp:code -->
<preclass="wp-block-code"><code> docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi</code></pre>
<!-- /wp:code -->
<!-- wp:paragraph -->
<p>This runs the nvidia-smi status utility which should show what your GPU is currently doing but crucially it's doing so from inside docker which means that nvidia's container runtime is all set up to pass through the nvidia drivers to whatever you're running inside your container. You should see something like this:</p>
<figureclass="wp-block-image size-full"><imgsrc="/media/image-2_c9a9cd31.png"alt="A screenshot of nvidia-smi output which shows the GPU name, how much power it is drawing, how much VRAM is in use and any processes using the card."class="wp-image-2431"/></figure>
<p>Create a new directory and a new empty text file called <code>docker-compose.yml</code>. In that file paste the following:</p>
<!-- /wp:paragraph -->
<!-- wp:code -->
<preclass="wp-block-code"><code>ersion: "3.0"
services:
ui:
image: ghcr.io/open-webui/open-webui:main
restart: always
ports:
- 3011:8080
volumes:
- ./open-webui:/app/backend/data
environment:
# - "ENABLE_SIGNUP=false"
- "OLLAMA_BASE_URL=http://ollama:11434"
ollama:
image: ollama/ollama
restart: always
ports:
- 11434:11434
volumes:
- ./ollama:/root/.ollama
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
</code></pre>
<!-- /wp:code -->
<!-- wp:paragraph -->
<p>We define the two services and we provide both with volume mounts to enable them to persist data to disk (such as models you downloaded and your chat history).</p>
<!-- /wp:paragraph -->
<!-- wp:paragraph -->
<p>For now we leave ENABLE_SIGNUP commented out so that you can create an account in the web ui but later we can come back and turn that off so that internet denizens can't sign up to use your chat.</p>
<!-- /wp:paragraph -->
<!-- wp:heading {"level":3} -->
<h3class="wp-block-heading">Turn on Ollama</h3>
<!-- /wp:heading -->
<!-- wp:paragraph -->
<p>First we will turn on ollama and test it. Start by running <code>docker-compose up -d ollama</code>. (Depending on which version of docker you are running you might need to run <code>docker compose</code> rather than <code>docker-compose</code>). This will start just the ollama model server. We can interact with the model server by running an interactive chat session and downloading the model:</p>
<!-- /wp:paragraph -->
<!-- wp:code -->
<preclass="wp-block-code"><code>docker-compose exec ollama ollama run llama3:8b</code></pre>
<!-- /wp:code -->
<!-- wp:paragraph -->
<p>In this command the first <code>ollama</code> refers to the container and <code>ollama run llama3:8b</code> is the command that will be executed inside the container. If all goes well you will see the server burst into action and download the llama3 model if this is the first time you've run it. You'll then be presented with an interactive prompt where you'll be able to chat to the model.</p>
<figureclass="wp-block-image size-full"><imgsrc="/media/image-3_da9c54d3.png"alt="Screenshot showing the interactive prompt. I have entered hello and the model has responded "Helloit'snicetomeetyou.IStheresomethingIcanhelpyouwithorwouldyouliketochat?""class="wp-image-2432"/></figure>
<p>You can press CTRL+D to quit and move on to the next step.</p>
<!-- /wp:paragraph -->
<!-- wp:heading {"level":3} -->
<h3class="wp-block-heading">Turn on the Web UI</h3>
<!-- /wp:heading -->
<!-- wp:paragraph -->
<p>Now we will start up the web ui. Run <code>docker-compose up -d ui</code>. Now open up your browser and go to http://localhost:3011/ to see the web ui. You will need to register for an account and log in. After which you will be able to interact with the model like so:</p>
<figureclass="wp-block-image size-large is-resized"><imgsrc="/media/image-1-1024x566_16fed72a.png"alt="A screenshot of the web ui. I have asked the model what noise a fox makes."class="wp-image-2429"style="width:760px;height:auto"/></figure>
<p>If you want to be able to chat to your models from the outside world you might want to stand up a reverse proxy to your server. If you're new to self hosting and you're not sure about how to do this, a safer option is probably to use <ahref="https://tailscale.com/">Tailscale</a> to build a VPN which you can use to securely connect to your home network without risking exposing your systems to the public and/or hackers.</p>
<!-- /wp:paragraph -->
<!-- wp:paragraph -->
<p></p>
<!-- /wp:paragraph -->
<!-- wp:heading -->
<h2class="wp-block-heading">Conclusion</h2>
<!-- /wp:heading -->
<!-- wp:paragraph -->
<p>Llama 3 is a tremendously powerful model that is useful for a whole bunch of use cases including summarisation, creative brainstorming, code copiloting and more. The quality of the responses are in line with GPT-4 and it runs on much older, smaller hardware. Self-hosting Llama 3 won't be for everyone and it's quite technically involved. However, for AI geeks like me, running my own ChatGPT clone at home for next-to-nothing was too good an experiment to miss out on.</p>