brainsteam.co.uk/brainsteam/content/posts/2024/09/05/Running Phi MoE 3.5 on Macb...

---
categories:
  - AI and Machine Learning
date: "2024-09-05 14:17:39"
draft: false
tags:
  - AI
  - llms
title: Running Phi MoE 3.5 on Macbook Pro
type: posts
mp-syndicate-to:
  - https://brid.gy/publish/mastodon
  - https://brid.gy/publish/twitter
url: /2024/09/05/runing-phi-moe-3-5-on-macbook-pro/
---

<!-- wp:paragraph -->
<p>The relatively recently released Phi 3.5 model series includes a mixture-of-experts model featuring 16 x 3.3 Billion parameter expert models. It activates these experts two at a time resulting in pretty good performance but only 6.6 billion parameters held in memory at once. I recently wanted to try running Phi MoE 3.5 on my macbook but was blocked from doing so using <a href="https://brainsteam.co.uk/2024/07/08/ditch-that-chatgpt-subscription-moving-to-pay-as-you-go-ai-usage-with-open-web-ui/">my usual method</a> whilst <a href="https://github.com/ggerganov/llama.cpp/issues/9119#issuecomment-2319393405">support is built into llama.cpp</a> and then <a href="https://github.com/ollama/ollama/issues/6449">ollama</a>.</p>
<!-- /wp:paragraph -->

<!-- wp:paragraph -->
<p>I decided to try out another library,<a href="https://github.com/EricLBuehler/mistral.rs"> mistral.rs,</a> which is written in the rust programming language and already supports these newer models. It required a little bit of fiddling around but I did manage to get it working and the model is relatively responsive.</p>
<!-- /wp:paragraph -->

<!-- wp:heading -->
<h2 class="wp-block-heading">Getting Our Dependencies and Building Mistral.RS</h2>
<!-- /wp:heading -->

<!-- wp:paragraph -->
<p>To get started you will need to have the rust compiler toolchain installed on your macbook including <code>rustc</code> and <code>cargo</code>. The easiest way to do this is via brew: </p>
<!-- /wp:paragraph -->

<!-- wp:enlighter/codeblock {"language":"bash"} -->
<pre class="EnlighterJSRAW" data-enlighter-language="bash" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">brew install rust</pre>
<!-- /wp:enlighter/codeblock -->

<!-- wp:paragraph -->
<p>You'll also need to grab the code for the project </p>
<!-- /wp:paragraph -->

<!-- wp:enlighter/codeblock {"language":"bash"} -->
<pre class="EnlighterJSRAW" data-enlighter-language="bash" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">git clone https://github.com/EricLBuehler/mistral.rs.git</pre>
<!-- /wp:enlighter/codeblock -->

<!-- wp:paragraph -->
<p>Once you have both of these in place we can build the project. Since we're running on Mac, we want the compiler to make use of apple <a href="https://developer.apple.com/metal/">Metal</a> which allows the model to use the GPU capabilities of the M-series chip to accelerate the model.</p>
<!-- /wp:paragraph -->

<!-- wp:enlighter/codeblock {"language":"bash"} -->
<pre class="EnlighterJSRAW" data-enlighter-language="bash" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">cd mistral.rs
cargo install --path mistralrs-server --features metal</pre>
<!-- /wp:enlighter/codeblock -->

<!-- wp:paragraph -->
<p>This command may take a couple of minutes to run. The compiled server will be saved in the <code>target/release</code> folder relative to your project folder. </p>
<!-- /wp:paragraph -->

<!-- wp:paragraph -->
<p></p>
<!-- /wp:paragraph -->

<!-- wp:heading -->
<h2 class="wp-block-heading">Running the Model with Quantization</h2>
<!-- /wp:heading -->

<!-- wp:paragraph -->
<p>The <a href="https://github.com/EricLBuehler/mistral.rs?tab=readme-ov-file#quick-examples">default instructions in the project readme</a> work but you might find it takes up a lot of memory and takes a really long time to run. That's because, by default mistral.rs does not do any quantization so running the model requires 12GB of memory. </p>
<!-- /wp:paragraph -->

<!-- wp:paragraph -->
<p>mistral.rs supports in-situ-quantisation which essentially means that the framework loads the model up and does the quantisation at run time (as opposed to requiring you to download a GGUF file that was already quantized). I recommend running the following:</p>
<!-- /wp:paragraph -->

<!-- wp:enlighter/codeblock {"language":"bash"} -->
<pre class="EnlighterJSRAW" data-enlighter-language="bash" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">./target/release/mistralrs-server --isq Q4_0 -i plain -m microsoft/Phi-3.5-mini-instruct -a phi3</pre>
<!-- /wp:enlighter/codeblock -->

<!-- wp:paragraph -->
<p>In this mode we use ISQ to quantize the model down to 4bit mode (<code>--isq Q4_0</code>). You should be able to chat to the model through the terminal</p>
<!-- /wp:paragraph -->

<!-- wp:heading -->
<h2 class="wp-block-heading">Running as a Server</h2>
<!-- /wp:heading -->

<!-- wp:paragraph -->
<p>Mistral.rs provides a HTTP API that is compatible with OpenAI standards. To run in server mode we remove the <code>-i</code> argument and replace it with a port number to run on <code>--port 1234</code>:</p>
<!-- /wp:paragraph -->

<!-- wp:enlighter/codeblock {"language":"bash"} -->
<pre class="EnlighterJSRAW" data-enlighter-language="bash" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">./target/release/mistralrs-server --port 1234 --isq Q4_0 plain -m microsoft/Phi-3.5-mini-instruct -a phi3</pre>
<!-- /wp:enlighter/codeblock -->

<!-- wp:paragraph -->
<p>You can then use an app like Postman or Bruno to interact with your model:</p>
<!-- /wp:paragraph -->

<!-- wp:image {"id":3999,"sizeSlug":"large","linkDestination":"none"} -->
<figure class="wp-block-image size-large"><img src="/media/image-1024x466_2f2e29bd.png" alt="Screenshot of a REST tooling interface. A pane on the left shows a json payload that was sent to the server containing messages to the model telling it to behave as a useful assistant and write a poem.

On the right is the response which contains a message and the beginning of a poem as written by the model." class="wp-image-3999"/></figure>

<!-- /wp:image -->

<!-- wp:paragraph -->
<p></p>
<!-- /wp:paragraph -->

<!-- wp:heading -->
<h2 class="wp-block-heading">Running the Vision Model</h2>
<!-- /wp:heading -->

<!-- wp:paragraph -->
<p>To run the vision model, we just need to make a couple of changes to our command line arguments:</p>
<!-- /wp:paragraph -->

<!-- wp:enlighter/codeblock {"language":"bash"} -->
<pre class="EnlighterJSRAW" data-enlighter-language="bash" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">./target/release/mistralrs-server --port 1234 --isq Q4_0 vision-plain -m microsoft/Phi-3.5-vision-instruct -a phi3v</pre>
<!-- /wp:enlighter/codeblock -->

<!-- wp:paragraph -->
<p>We still want to use ISQ but this time we swap <code>plain</code> for <code>vision-plain</code>, we swap the model name for the vision equivalent and we change the architecture <code>-a phi3</code> to <code>-a phi3v</code>.</p>
<!-- /wp:paragraph -->

<!-- wp:paragraph -->
<p>Likewise we can now interact with the model via HTTP tooling. Here's <a href="https://github.com/EricLBuehler/mistral.rs/blob/master/docs/PHI3V.md">a response based on the example from the documentation</a>:</p>
<!-- /wp:paragraph -->

<!-- wp:image {"id":4000,"sizeSlug":"large","linkDestination":"none"} -->
<figure class="wp-block-image size-large"><img src="/media/image-1-1024x674_477cd2fe.png" alt="Screenshot of a REST interface. A pane on the left shows a json payload that was sent to the server containing messages to the model telling it to analyse an image url.

On the right is the response which describes the mountain in the picture that was sent." class="wp-image-4000"/></figure>

<!-- /wp:image -->

<!-- wp:paragraph -->
<p></p>
<!-- /wp:paragraph -->

<!-- wp:heading -->
<h2 class="wp-block-heading">Running on Linux and Nvidia</h2>
<!-- /wp:heading -->

<!-- wp:paragraph -->
<p>I am still struggling to get mistral.rs to build on Linux at the moment, the docker images that are provided by the project don't seem to play ball with my systems. Once I figure this out I'll release an updated version of this blog.</p>
<!-- /wp:paragraph -->

<!-- wp:paragraph -->
<p></p>
<!-- /wp:paragraph -->

<!-- wp:paragraph -->
<p></p>
<!-- /wp:paragraph -->
update all content 2024-09-08 15:00:57 +01:00			`---`
			`categories:`
fix things 2024-09-08 18:06:18 +01:00			`- AI and Machine Learning`
			`date: "2024-09-05 14:17:39"`
update all content 2024-09-08 15:00:57 +01:00			`draft: false`
			`tags:`
fix things 2024-09-08 18:06:18 +01:00			`- AI`
			`- llms`
update all content 2024-09-08 15:00:57 +01:00			`title: Running Phi MoE 3.5 on Macbook Pro`
			`type: posts`
fix things 2024-09-08 18:06:18 +01:00			`mp-syndicate-to:`
			`- https://brid.gy/publish/mastodon`
			`- https://brid.gy/publish/twitter`
update page slugs 2024-09-08 17:23:07 +01:00			`url: /2024/09/05/runing-phi-moe-3-5-on-macbook-pro/`
update all content 2024-09-08 15:00:57 +01:00			`---`

			`<!-- wp:paragraph -->`
			<p>The relatively recently released Phi 3.5 model series includes a mixture-of-experts model featuring 16 x 3.3 Billion parameter expert models. It activates these experts two at a time resulting in pretty good performance but only 6.6 billion parameters held in memory at once. I recently wanted to try running Phi MoE 3.5 on my macbook but was blocked from doing so using <a href="https://brainsteam.co.uk/2024/07/08/ditch-that-chatgpt-subscription-moving-to-pay-as-you-go-ai-usage-with-open-web-ui/">my usual method</a> whilst <a href="https://github.com/ggerganov/llama.cpp/issues/9119#issuecomment-2319393405">support is built into llama.cpp</a> and then <a href="https://github.com/ollama/ollama/issues/6449">ollama</a>.</p>
			`<!-- /wp:paragraph -->`

			`<!-- wp:paragraph -->`
			`<p>I decided to try out another library,<a href="https://github.com/EricLBuehler/mistral.rs"> mistral.rs,</a> which is written in the rust programming language and already supports these newer models. It required a little bit of fiddling around but I did manage to get it working and the model is relatively responsive.</p>`
			`<!-- /wp:paragraph -->`

			`<!-- wp:heading -->`
			`<h2 class="wp-block-heading">Getting Our Dependencies and Building Mistral.RS</h2>`
			`<!-- /wp:heading -->`

			`<!-- wp:paragraph -->`
			`<p>To get started you will need to have the rust compiler toolchain installed on your macbook including <code>rustc</code> and <code>cargo</code>. The easiest way to do this is via brew: </p>`
			`<!-- /wp:paragraph -->`

			`<!-- wp:enlighter/codeblock {"language":"bash"} -->`
			`<pre class="EnlighterJSRAW" data-enlighter-language="bash" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">brew install rust</pre>`
			`<!-- /wp:enlighter/codeblock -->`

			`<!-- wp:paragraph -->`
			`<p>You'll also need to grab the code for the project </p>`
			`<!-- /wp:paragraph -->`

			`<!-- wp:enlighter/codeblock {"language":"bash"} -->`
			`<pre class="EnlighterJSRAW" data-enlighter-language="bash" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">git clone https://github.com/EricLBuehler/mistral.rs.git</pre>`
			`<!-- /wp:enlighter/codeblock -->`

			`<!-- wp:paragraph -->`
			`<p>Once you have both of these in place we can build the project. Since we're running on Mac, we want the compiler to make use of apple <a href="https://developer.apple.com/metal/">Metal</a> which allows the model to use the GPU capabilities of the M-series chip to accelerate the model.</p>`
			`<!-- /wp:paragraph -->`

			`<!-- wp:enlighter/codeblock {"language":"bash"} -->`
			`<pre class="EnlighterJSRAW" data-enlighter-language="bash" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">cd mistral.rs`
			`cargo install --path mistralrs-server --features metal</pre>`
			`<!-- /wp:enlighter/codeblock -->`

			`<!-- wp:paragraph -->`
			`<p>This command may take a couple of minutes to run. The compiled server will be saved in the <code>target/release</code> folder relative to your project folder. </p>`
			`<!-- /wp:paragraph -->`

			`<!-- wp:paragraph -->`
			`<p></p>`
			`<!-- /wp:paragraph -->`

			`<!-- wp:heading -->`
			`<h2 class="wp-block-heading">Running the Model with Quantization</h2>`
			`<!-- /wp:heading -->`

			`<!-- wp:paragraph -->`
			`<p>The <a href="https://github.com/EricLBuehler/mistral.rs?tab=readme-ov-file#quick-examples">default instructions in the project readme</a> work but you might find it takes up a lot of memory and takes a really long time to run. That's because, by default mistral.rs does not do any quantization so running the model requires 12GB of memory. </p>`
			`<!-- /wp:paragraph -->`

			`<!-- wp:paragraph -->`
			`<p>mistral.rs supports in-situ-quantisation which essentially means that the framework loads the model up and does the quantisation at run time (as opposed to requiring you to download a GGUF file that was already quantized). I recommend running the following:</p>`
			`<!-- /wp:paragraph -->`

			`<!-- wp:enlighter/codeblock {"language":"bash"} -->`
			`<pre class="EnlighterJSRAW" data-enlighter-language="bash" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">./target/release/mistralrs-server --isq Q4_0 -i plain -m microsoft/Phi-3.5-mini-instruct -a phi3</pre>`
			`<!-- /wp:enlighter/codeblock -->`

			`<!-- wp:paragraph -->`
			`<p>In this mode we use ISQ to quantize the model down to 4bit mode (<code>--isq Q4_0</code>). You should be able to chat to the model through the terminal</p>`
			`<!-- /wp:paragraph -->`

			`<!-- wp:heading -->`
			`<h2 class="wp-block-heading">Running as a Server</h2>`
			`<!-- /wp:heading -->`

			`<!-- wp:paragraph -->`
			`<p>Mistral.rs provides a HTTP API that is compatible with OpenAI standards. To run in server mode we remove the <code>-i</code> argument and replace it with a port number to run on <code>--port 1234</code>:</p>`
			`<!-- /wp:paragraph -->`

			`<!-- wp:enlighter/codeblock {"language":"bash"} -->`
			`<pre class="EnlighterJSRAW" data-enlighter-language="bash" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">./target/release/mistralrs-server --port 1234 --isq Q4_0 plain -m microsoft/Phi-3.5-mini-instruct -a phi3</pre>`
			`<!-- /wp:enlighter/codeblock -->`

			`<!-- wp:paragraph -->`
			`<p>You can then use an app like Postman or Bruno to interact with your model:</p>`
			`<!-- /wp:paragraph -->`

			`<!-- wp:image {"id":3999,"sizeSlug":"large","linkDestination":"none"} -->`
update page slugs 2024-09-08 17:23:07 +01:00			`<figure class="wp-block-image size-large"><img src="/media/image-1024x466_2f2e29bd.png" alt="Screenshot of a REST tooling interface. A pane on the left shows a json payload that was sent to the server containing messages to the model telling it to behave as a useful assistant and write a poem.`
update all content 2024-09-08 15:00:57 +01:00
			`On the right is the response which contains a message and the beginning of a poem as written by the model." class="wp-image-3999"/></figure>`
fix things 2024-09-08 18:06:18 +01:00
update all content 2024-09-08 15:00:57 +01:00			`<!-- /wp:image -->`

			`<!-- wp:paragraph -->`
			`<p></p>`
			`<!-- /wp:paragraph -->`

			`<!-- wp:heading -->`
			`<h2 class="wp-block-heading">Running the Vision Model</h2>`
			`<!-- /wp:heading -->`

			`<!-- wp:paragraph -->`
			`<p>To run the vision model, we just need to make a couple of changes to our command line arguments:</p>`
			`<!-- /wp:paragraph -->`

			`<!-- wp:enlighter/codeblock {"language":"bash"} -->`
			`<pre class="EnlighterJSRAW" data-enlighter-language="bash" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">./target/release/mistralrs-server --port 1234 --isq Q4_0 vision-plain -m microsoft/Phi-3.5-vision-instruct -a phi3v</pre>`
			`<!-- /wp:enlighter/codeblock -->`

			`<!-- wp:paragraph -->`
			`<p>We still want to use ISQ but this time we swap <code>plain</code> for <code>vision-plain</code>, we swap the model name for the vision equivalent and we change the architecture <code>-a phi3</code> to <code>-a phi3v</code>.</p>`
			`<!-- /wp:paragraph -->`

			`<!-- wp:paragraph -->`
			`<p>Likewise we can now interact with the model via HTTP tooling. Here's <a href="https://github.com/EricLBuehler/mistral.rs/blob/master/docs/PHI3V.md">a response based on the example from the documentation</a>:</p>`
			`<!-- /wp:paragraph -->`

			`<!-- wp:image {"id":4000,"sizeSlug":"large","linkDestination":"none"} -->`
update page slugs 2024-09-08 17:23:07 +01:00			`<figure class="wp-block-image size-large"><img src="/media/image-1-1024x674_477cd2fe.png" alt="Screenshot of a REST interface. A pane on the left shows a json payload that was sent to the server containing messages to the model telling it to analyse an image url.`
update all content 2024-09-08 15:00:57 +01:00
			`On the right is the response which describes the mountain in the picture that was sent." class="wp-image-4000"/></figure>`
fix things 2024-09-08 18:06:18 +01:00
update all content 2024-09-08 15:00:57 +01:00			`<!-- /wp:image -->`

			`<!-- wp:paragraph -->`
			`<p></p>`
			`<!-- /wp:paragraph -->`

			`<!-- wp:heading -->`
			`<h2 class="wp-block-heading">Running on Linux and Nvidia</h2>`
			`<!-- /wp:heading -->`

			`<!-- wp:paragraph -->`
			`<p>I am still struggling to get mistral.rs to build on Linux at the moment, the docker images that are provided by the project don't seem to play ball with my systems. Once I figure this out I'll release an updated version of this blog.</p>`
			`<!-- /wp:paragraph -->`

			`<!-- wp:paragraph -->`
			`<p></p>`
			`<!-- /wp:paragraph -->`

			`<!-- wp:paragraph -->`
			`<p></p>`
fix things 2024-09-08 18:06:18 +01:00			`<!-- /wp:paragraph -->`