<p>The relatively recently released Phi 3.5 model series includes a mixture-of-experts model featuring 16 x 3.3 Billion parameter expert models. It activates these experts two at a time resulting in pretty good performance but only 6.6 billion parameters held in memory at once. I recently wanted to try running Phi MoE 3.5 on my macbook but was blocked from doing so using <ahref="https://brainsteam.co.uk/2024/07/08/ditch-that-chatgpt-subscription-moving-to-pay-as-you-go-ai-usage-with-open-web-ui/">my usual method</a> whilst <ahref="https://github.com/ggerganov/llama.cpp/issues/9119#issuecomment-2319393405">support is built into llama.cpp</a> and then <ahref="https://github.com/ollama/ollama/issues/6449">ollama</a>.</p>
<!-- /wp:paragraph -->
<!-- wp:paragraph -->
<p>I decided to try out another library,<ahref="https://github.com/EricLBuehler/mistral.rs"> mistral.rs,</a> which is written in the rust programming language and already supports these newer models. It required a little bit of fiddling around but I did manage to get it working and the model is relatively responsive.</p>
<!-- /wp:paragraph -->
<!-- wp:heading -->
<h2class="wp-block-heading">Getting Our Dependencies and Building Mistral.RS</h2>
<!-- /wp:heading -->
<!-- wp:paragraph -->
<p>To get started you will need to have the rust compiler toolchain installed on your macbook including <code>rustc</code> and <code>cargo</code>. The easiest way to do this is via brew: </p>
<p>Once you have both of these in place we can build the project. Since we're running on Mac, we want the compiler to make use of apple <ahref="https://developer.apple.com/metal/">Metal</a> which allows the model to use the GPU capabilities of the M-series chip to accelerate the model.</p>
<p>This command may take a couple of minutes to run. The compiled server will be saved in the <code>target/release</code> folder relative to your project folder. </p>
<!-- /wp:paragraph -->
<!-- wp:paragraph -->
<p></p>
<!-- /wp:paragraph -->
<!-- wp:heading -->
<h2class="wp-block-heading">Running the Model with Quantization</h2>
<!-- /wp:heading -->
<!-- wp:paragraph -->
<p>The <ahref="https://github.com/EricLBuehler/mistral.rs?tab=readme-ov-file#quick-examples">default instructions in the project readme</a> work but you might find it takes up a lot of memory and takes a really long time to run. That's because, by default mistral.rs does not do any quantization so running the model requires 12GB of memory. </p>
<!-- /wp:paragraph -->
<!-- wp:paragraph -->
<p>mistral.rs supports in-situ-quantisation which essentially means that the framework loads the model up and does the quantisation at run time (as opposed to requiring you to download a GGUF file that was already quantized). I recommend running the following:</p>
<preclass="EnlighterJSRAW"data-enlighter-language="bash"data-enlighter-theme=""data-enlighter-highlight=""data-enlighter-linenumbers=""data-enlighter-lineoffset=""data-enlighter-title=""data-enlighter-group="">./target/release/mistralrs-server --isq Q4_0 -i plain -m microsoft/Phi-3.5-mini-instruct -a phi3</pre>
<!-- /wp:enlighter/codeblock -->
<!-- wp:paragraph -->
<p>In this mode we use ISQ to quantize the model down to 4bit mode (<code>--isq Q4_0</code>). You should be able to chat to the model through the terminal</p>
<!-- /wp:paragraph -->
<!-- wp:heading -->
<h2class="wp-block-heading">Running as a Server</h2>
<!-- /wp:heading -->
<!-- wp:paragraph -->
<p>Mistral.rs provides a HTTP API that is compatible with OpenAI standards. To run in server mode we remove the <code>-i</code> argument and replace it with a port number to run on <code>--port 1234</code>:</p>
<preclass="EnlighterJSRAW"data-enlighter-language="bash"data-enlighter-theme=""data-enlighter-highlight=""data-enlighter-linenumbers=""data-enlighter-lineoffset=""data-enlighter-title=""data-enlighter-group="">./target/release/mistralrs-server --port 1234 --isq Q4_0 vision-plain -m microsoft/Phi-3.5-vision-instruct -a phi3v</pre>
<!-- /wp:enlighter/codeblock -->
<!-- wp:paragraph -->
<p>We still want to use ISQ but this time we swap <code>plain</code> for <code>vision-plain</code>, we swap the model name for the vision equivalent and we change the architecture <code>-a phi3</code> to <code>-a phi3v</code>.</p>
<!-- /wp:paragraph -->
<!-- wp:paragraph -->
<p>Likewise we can now interact with the model via HTTP tooling. Here's <ahref="https://github.com/EricLBuehler/mistral.rs/blob/master/docs/PHI3V.md">a response based on the example from the documentation</a>:</p>
<h2class="wp-block-heading">Running on Linux and Nvidia</h2>
<!-- /wp:heading -->
<!-- wp:paragraph -->
<p>I am still struggling to get mistral.rs to build on Linux at the moment, the docker images that are provided by the project don't seem to play ball with my systems. Once I figure this out I'll release an updated version of this blog.</p>