Convert and split CoreML model

#1
by andmev - opened

Hello @smpanaro , could you please let me know how you convert and split your Llama 3.2 into chunks? I'm going to make a similar CoreML model but for Llama 3.2 3B.
Thanks

Hey @andmev I have a forked version of lit-gpt that I’ve modified to produce these models. I’ll eventually release it but it’s fairly in flux still. The chunking is simply taking N blocks (block = norms + attn + mlp) and making that a chunk.

3B might be slightly too big to run on M1 without compression. I can try and convert one.

@smpanaro Can you share your export code?

I will eventually but it’s a mess right now and I’m in the middle of another project. I don’t want to drop it and then not support it.

Are there particular models you’re interested in converting / do you have a particular use case for Neural Engine models?

Thank you for your reply. Both Apple's official tutorial on Llama and Mistral use GPU backend, and my project does not want the inference and computation of the model to consume GPU resources, so I am very concerned about using ANE backend for model inference. @smpanaro

Hey @andmev -- I converted a 3B-Instruct here. You can run it the same as this model. It's going to be painfully slow on M1 (~1 token/second) but should be better on M2 and newer. Probably too big for phones too. (Quantization should help but would need to see how it impacts quality.)

@eoe Nice! One of the reasons I started converting these models is because I wanted to run them without putting load on the GPU. I tweet and write about this somewhat regularly -- you might find it interesting.

Hi @smpanaro , thank you. I got next results on M2 Max:

andrii@MacStudio coreml-llm-cli $ swift run -c release LLMCLI --repo-id smpanaro/Llama-3.2-3B-Instruct-CoreML --max-new-tokens 80
Building for production...
[23/23] Linking LLMCLI
Build of product 'LLMCLI' complete! (12.32s)
ModelPipeline Llama-3.2-3B-Instruct (16 chunks)
Compiling models: ****************
Loading models  : ****************
128000 35862 9606 315 14362 392 315 279 64677 76233 12791 8356 279 63494 505 902 11033 374 28532 13 578 1403 1925 9606 54453 67166 527 76233 12791 649 24453 6347 320 14804 82979 8 323 76233 12791 52412 3074 320 7098 370 3074 570 10989 3074 374 6646 311 387 315 5190 4367 323 374 15042 520 1579 4902 21237 304 13918 449 23900 20472 323 1664 19158 2692 17614 13 4997 82979 374 810 8624 47056 323 8831 311 3139 11 719 706 264 25984 261 17615 323 374 3629 1511 304 9888 11033 58943 13 7089 9606 11 1778 439 76233 12791 34929 3074 323 76233 12791 3521 

<|begin_of_text|>Several species of shrub of the genus Coffea produce the berries from which coffee is extracted. The two main species commercially cultivated are Coffea canephora (Robusta) and Coffea arabica (Arabica). Arabica is considered to be of higher quality and is grown at high altitudes in regions with mild temperatures and well-drained soil. Robusta is more disease-resistant and easier to grow, but has a harsher flavor and is often used in instant coffee blends. Other species, such as Coffea liberica and Coffea exc

Compile + Load: 66,75 sec
Generate      : 124,15 +/- 5,15 ms / token
                8,07 +/- 0,25 token / sec

hey could you help me out with running this llama3.2 on my Mac?

@Srinarayan02 you can use the CLI linked in the README. Run the command swift run -c release LLMCLI --repo-id smpanaro/Llama-3.2-3B-Instruct-CoreML.

Make sure that you install the huggingface CLI, log in to it, and have access to the official Llama 3.2 repo first.

Sign up or log in to comment