As of today, I am deprecating/archiving turbopilot, my experimental LLM runtime for code assistant type models. In this post I'm going to dive a little bit into why I built it, why I'm stopping work on it and what you can do now.
In April I got COVID over the easter break and I had to stay home for a bit. After the first couple of days I started to get restless. I needed a project to dive into while I was cooped up at home. It just so happened that people were starting to get excited about running large language models on their home computers after ggerganov published [llama.cpp]. Lots of people were experimenting with asking llama to generate funny stories but I wanted to do something more practical and useful to me.
I started to play around with a project called fauxpilot. This touted itself as an open source alternative to Github Copilot that could run the Salesforce Codegen models locally on your machine. However, I found it a bit tricky to get running, and it didn't do any kind of quantization or optimization which meant that you could only run models on your GPU if you have enough VRAM and also if you have a recent enough GPU. At the time I had an Nvidia Titan X from 2015 and it didn't support new enough versions of CUDA to allow me to run the models I wanted to run. I also found the brilliant vscode-fauxpilot which is an experimental vscode plugin for getting autocomplete suggestions from fauxpilot into the IDE.
This gave me an itch to scratch and a relatively narrow scope within which to build a proof-of-concept. Could I quantize a code generation model and run it using ggerganov's runtime? Could I open up local code-completion to people who don't have the latest and greatest nvidia chips? I set out to find out...
What were the main challenges during the PoC?
I was able to whip up a proof of concept over the course of a couple of days, and I was pretty pleased with that. The most difficult thing for me was finding my way around the GGML library and how I could use it to build a computation graph for a transformer model built in PyTorch. This is absolutely not a criticism of ggerganov's work but more a statement about how coddled we are these days as developers who use these high-level Python libraries to abstract away all the work that's going on whenever we build out these complex models. Eventually, I found a way to cheat by using a script written by moyix to convert the Codegen models to run in a model architecture already supported by the ggml example code. This meant that I didn't need to spend several days figuring out how to code up the compute graph and helped me get my POC together quickly.
Once I'd figured out the model, it was just a case of quantizing it and running the example code, then I made use of CrowCPP to provide a lightweight HTTP server over the top. I reverse engineered the fauxpilot code to figure out what the REST interface needed to look like and started crafting.
When I typed in my IDE and got those first code suggestions back, I got that magical tingly feeling from making something work.
How successful was the PoC?
Once I had my PoC I added a readme, some badges and some CI pipelines for docker images, Mac pkgs and so on. Then I shared my project on Twitter, Reddit and Mastodon. I was surprised at how much attention it got, I accumulated about 2.5k stars on github in the first couple of days, then it slowed down to about 100 stars a day for the rest of the week. I think it helped a lot that Clem from Huggingface retweeted and replied to my tweet: