brainsteam.co.uk/brainsteam/content/posts/2024/09/05/Running Phi MoE 3.5 on Macb...

155 lines
8.6 KiB
Markdown
Raw Normal View History

2024-09-08 15:00:57 +01:00
---
categories:
2024-09-08 18:06:18 +01:00
- AI and Machine Learning
date: "2024-09-05 14:17:39"
2024-09-08 15:00:57 +01:00
draft: false
tags:
2024-09-08 18:06:18 +01:00
- AI
- llms
2024-09-08 15:00:57 +01:00
title: Running Phi MoE 3.5 on Macbook Pro
type: posts
2024-09-08 18:06:18 +01:00
mp-syndicate-to:
- https://brid.gy/publish/mastodon
- https://brid.gy/publish/twitter
2024-09-08 17:23:07 +01:00
url: /2024/09/05/runing-phi-moe-3-5-on-macbook-pro/
2024-09-08 15:00:57 +01:00
---
<!-- wp:paragraph -->
<p>The relatively recently released Phi 3.5 model series includes a mixture-of-experts model featuring 16 x 3.3 Billion parameter expert models. It activates these experts two at a time resulting in pretty good performance but only 6.6 billion parameters held in memory at once. I recently wanted to try running Phi MoE 3.5 on my macbook but was blocked from doing so using <a href="https://brainsteam.co.uk/2024/07/08/ditch-that-chatgpt-subscription-moving-to-pay-as-you-go-ai-usage-with-open-web-ui/">my usual method</a> whilst <a href="https://github.com/ggerganov/llama.cpp/issues/9119#issuecomment-2319393405">support is built into llama.cpp</a> and then <a href="https://github.com/ollama/ollama/issues/6449">ollama</a>.</p>
<!-- /wp:paragraph -->
<!-- wp:paragraph -->
<p>I decided to try out another library,<a href="https://github.com/EricLBuehler/mistral.rs"> mistral.rs,</a> which is written in the rust programming language and already supports these newer models. It required a little bit of fiddling around but I did manage to get it working and the model is relatively responsive.</p>
<!-- /wp:paragraph -->
<!-- wp:heading -->
<h2 class="wp-block-heading">Getting Our Dependencies and Building Mistral.RS</h2>
<!-- /wp:heading -->
<!-- wp:paragraph -->
<p>To get started you will need to have the rust compiler toolchain installed on your macbook including <code>rustc</code> and <code>cargo</code>. The easiest way to do this is via brew: </p>
<!-- /wp:paragraph -->
<!-- wp:enlighter/codeblock {"language":"bash"} -->
<pre class="EnlighterJSRAW" data-enlighter-language="bash" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">brew install rust</pre>
<!-- /wp:enlighter/codeblock -->
<!-- wp:paragraph -->
<p>You'll also need to grab the code for the project </p>
<!-- /wp:paragraph -->
<!-- wp:enlighter/codeblock {"language":"bash"} -->
<pre class="EnlighterJSRAW" data-enlighter-language="bash" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">git clone https://github.com/EricLBuehler/mistral.rs.git</pre>
<!-- /wp:enlighter/codeblock -->
<!-- wp:paragraph -->
<p>Once you have both of these in place we can build the project. Since we're running on Mac, we want the compiler to make use of apple <a href="https://developer.apple.com/metal/">Metal</a> which allows the model to use the GPU capabilities of the M-series chip to accelerate the model.</p>
<!-- /wp:paragraph -->
<!-- wp:enlighter/codeblock {"language":"bash"} -->
<pre class="EnlighterJSRAW" data-enlighter-language="bash" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">cd mistral.rs
cargo install --path mistralrs-server --features metal</pre>
<!-- /wp:enlighter/codeblock -->
<!-- wp:paragraph -->
<p>This command may take a couple of minutes to run. The compiled server will be saved in the <code>target/release</code> folder relative to your project folder. </p>
<!-- /wp:paragraph -->
<!-- wp:paragraph -->
<p></p>
<!-- /wp:paragraph -->
<!-- wp:heading -->
<h2 class="wp-block-heading">Running the Model with Quantization</h2>
<!-- /wp:heading -->
<!-- wp:paragraph -->
<p>The <a href="https://github.com/EricLBuehler/mistral.rs?tab=readme-ov-file#quick-examples">default instructions in the project readme</a> work but you might find it takes up a lot of memory and takes a really long time to run. That's because, by default mistral.rs does not do any quantization so running the model requires 12GB of memory. </p>
<!-- /wp:paragraph -->
<!-- wp:paragraph -->
<p>mistral.rs supports in-situ-quantisation which essentially means that the framework loads the model up and does the quantisation at run time (as opposed to requiring you to download a GGUF file that was already quantized). I recommend running the following:</p>
<!-- /wp:paragraph -->
<!-- wp:enlighter/codeblock {"language":"bash"} -->
<pre class="EnlighterJSRAW" data-enlighter-language="bash" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">./target/release/mistralrs-server --isq Q4_0 -i plain -m microsoft/Phi-3.5-mini-instruct -a phi3</pre>
<!-- /wp:enlighter/codeblock -->
<!-- wp:paragraph -->
<p>In this mode we use ISQ to quantize the model down to 4bit mode (<code>--isq Q4_0</code>). You should be able to chat to the model through the terminal</p>
<!-- /wp:paragraph -->
<!-- wp:heading -->
<h2 class="wp-block-heading">Running as a Server</h2>
<!-- /wp:heading -->
<!-- wp:paragraph -->
<p>Mistral.rs provides a HTTP API that is compatible with OpenAI standards. To run in server mode we remove the <code>-i</code> argument and replace it with a port number to run on <code>--port 1234</code>:</p>
<!-- /wp:paragraph -->
<!-- wp:enlighter/codeblock {"language":"bash"} -->
<pre class="EnlighterJSRAW" data-enlighter-language="bash" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">./target/release/mistralrs-server --port 1234 --isq Q4_0 plain -m microsoft/Phi-3.5-mini-instruct -a phi3</pre>
<!-- /wp:enlighter/codeblock -->
<!-- wp:paragraph -->
<p>You can then use an app like Postman or Bruno to interact with your model:</p>
<!-- /wp:paragraph -->
<!-- wp:image {"id":3999,"sizeSlug":"large","linkDestination":"none"} -->
2024-09-08 17:23:07 +01:00
<figure class="wp-block-image size-large"><img src="/media/image-1024x466_2f2e29bd.png" alt="Screenshot of a REST tooling interface. A pane on the left shows a json payload that was sent to the server containing messages to the model telling it to behave as a useful assistant and write a poem.
2024-09-08 15:00:57 +01:00
On the right is the response which contains a message and the beginning of a poem as written by the model." class="wp-image-3999"/></figure>
2024-09-08 18:06:18 +01:00
2024-09-08 15:00:57 +01:00
<!-- /wp:image -->
<!-- wp:paragraph -->
<p></p>
<!-- /wp:paragraph -->
<!-- wp:heading -->
<h2 class="wp-block-heading">Running the Vision Model</h2>
<!-- /wp:heading -->
<!-- wp:paragraph -->
<p>To run the vision model, we just need to make a couple of changes to our command line arguments:</p>
<!-- /wp:paragraph -->
<!-- wp:enlighter/codeblock {"language":"bash"} -->
<pre class="EnlighterJSRAW" data-enlighter-language="bash" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">./target/release/mistralrs-server --port 1234 --isq Q4_0 vision-plain -m microsoft/Phi-3.5-vision-instruct -a phi3v</pre>
<!-- /wp:enlighter/codeblock -->
<!-- wp:paragraph -->
<p>We still want to use ISQ but this time we swap <code>plain</code> for <code>vision-plain</code>, we swap the model name for the vision equivalent and we change the architecture <code>-a phi3</code> to <code>-a phi3v</code>.</p>
<!-- /wp:paragraph -->
<!-- wp:paragraph -->
<p>Likewise we can now interact with the model via HTTP tooling. Here's <a href="https://github.com/EricLBuehler/mistral.rs/blob/master/docs/PHI3V.md">a response based on the example from the documentation</a>:</p>
<!-- /wp:paragraph -->
<!-- wp:image {"id":4000,"sizeSlug":"large","linkDestination":"none"} -->
2024-09-08 17:23:07 +01:00
<figure class="wp-block-image size-large"><img src="/media/image-1-1024x674_477cd2fe.png" alt="Screenshot of a REST interface. A pane on the left shows a json payload that was sent to the server containing messages to the model telling it to analyse an image url.
2024-09-08 15:00:57 +01:00
On the right is the response which describes the mountain in the picture that was sent." class="wp-image-4000"/></figure>
2024-09-08 18:06:18 +01:00
2024-09-08 15:00:57 +01:00
<!-- /wp:image -->
<!-- wp:paragraph -->
<p></p>
<!-- /wp:paragraph -->
<!-- wp:heading -->
<h2 class="wp-block-heading">Running on Linux and Nvidia</h2>
<!-- /wp:heading -->
<!-- wp:paragraph -->
<p>I am still struggling to get mistral.rs to build on Linux at the moment, the docker images that are provided by the project don't seem to play ball with my systems. Once I figure this out I'll release an updated version of this blog.</p>
<!-- /wp:paragraph -->
<!-- wp:paragraph -->
<p></p>
<!-- /wp:paragraph -->
<!-- wp:paragraph -->
<p></p>
2024-09-08 18:06:18 +01:00
<!-- /wp:paragraph -->