MobileLLM-R1-950M meets Apple Silicon
From Stub to Coherent MLX Model, a story-shaped release note
Getting Meta's efficient 950M parameter model running natively on Apple Silicon for fast, local inference
I’ve done a tiny bit of contributions to the llama.cpp community, including adding the first long context support for Qwen models, and my llama-gguf-optimize project. But I ran out of runway, and, besides a PR to bring it up to date with modern versions of python (that a major contributor helped me to perfect it over some days, but then it was not picked up, for whatever reason), I haven’t worked in the internals very much since. But when Meta announced their MobileLLM-R1, this interested me, though I did not yet realize it would bring me to work on quantization (in any way) again.
I use models at around this size a lot in easy-success-rate agentic roles that need tool calling; typically qwen3:4b-instruct-2507-q4_K_M or my own upload robbiemu/Salesforce_Llama-xLAM-2:8b-fc-r-q8_0 or gemma3n:e4b-it-q8_0 with a custom template. I also use thinking models but they tend to be larger still. I need a nice, small thinking model that won’t go off the rails. So, I figured I’d see if I could go much smaller with the new model. But alas, I dont see it on ollama. And I don’t see any GGUFs either. I have a MacBook Pro M3, so I benefit from MLX, and have been curious to learn more. Thus starts this investigation.
Where it lives
robbiemu/MobileLLM-R1-950M-MLX – float32 and 4-bit weights, plus every script that got us there.
This is a proof-of-concept runtime, not an official mlx-community release (that needs a PR to upstream mlx-lm).
Try it out, report issues, send PRs – the glue code is all there.
Quick start (copy/paste)
# 1. Grab the repo
git clone https://huggingface.co/robbiemu/MobileLLM-R1-950M-MLX
cd MobileLLM-R1-950M-MLX
# 2. Install (uv preferred)
uv pip install -r pyproject.toml
# 3. Float32 chat
uv run python inference.py \
--prompt "Explain quicksort in one sentence." \
--temperature 0.7 --top-p 0.9
# 4. Make 4-bit variant
uv run python custom_mlx_lm/custom_convert_2.py \
--hf-path . --mlx-path mlx_q4 --dynamic-quant --target-bpw 4.5
# 5. Load 4-bit with mlx-lm (uses our custom_loader under the hood)
uv run python -m mlx_lm.generate \
--model mlx_q4 \
--prompt "What is 2+2?" --max-tokens 32
Why it wasn’t copy-paste – the messy bits
Working through this with AI coding assistants (Gemini 2.5 Pro and OpenAI Codex) as debugging partners – they caught edge cases I missed, I caught their hallucinations, and together we got to working code.
Config vs. weights boxing match
The JSON advertises two MLP intermediate sizes, hinting at a dual-branch feed-forward. Open the safetensors, though, and you’ll find only the classic SwiGLU trio (gate_proj, up_proj, down_proj). Rather than hard-coding either guess, we let the loader peek at the key names and flip a use_dual_mlp flag on the fly. That single check saves anyone downstream from a shape-mismatch headache.
GQA head-count mismatch
Twenty-four query heads, six KV heads. If you treat that like a vanilla transformer, the very first mat-mul explodes. We simply repeat K/V four times along the head axis (mx.repeat(..., axis=1)). One line, one crash gone.
Attention scale trap
The config carries an attn_scale=0.1 that looks harmless. Feed it into MLX’s scaled_dot_product_attention and the logits collapse into white noise. We kept the optional Q/K RMSNorm (because the flag says so) but dropped the multiplier. Attention maths returned to normal, and the model started speaking English again.
RoPE “off for everyone”
no_rope_layers lists every single layer, which would leave the model position-blind. We treat an “all layers disabled” list as a config artefact and simply apply RoPE everywhere. Positional encoding restored, no philosophical debate required.
Chat-template ramble
Out of the box the template ends with “Please reason step-by-step and put your final answer within \boxed{}.” That nudges the model into verbose, looping monologues. We added CLI flags (--system, --disable-chat-template, --final-only) plus repetition and frequency penalties so you can ask for a crisp sentence instead of a three-page proof.
Sampler API foot-gun (the one that ate a night)
We masked probabilities then called mx.random.categorical – but MLX expects raw logits. The result was an endless stream of “Sort Sort Sort…”. Once we masked the logits and let categorical eat logits, the repetition stopped and actual sentences appeared.
Put together, those six course-corrections turned a stub that loaded but rambled into a model that answers questions without existential dread.
Size talk (because people ask)
| File | GB | What happened |
model.safetensors (original) | ~3.8 | Straight from Meta |
weights.npz (MLX float32) | ~3.8 | Same values, MLX layout, no zip compression |
weights.npz (4-bit mixed) | 1.26 | 153 layers @ 4-bit, 1 layer @ 8-bit, embeddings left float32 |
You might see "950M parameters" and wonder why the file is 3.8 GB, not 950 MB. That's because the original model stores each of its ~950 million parameters as a 32-bit float, which takes up 4 bytes. So, 950M params × 4 bytes/param ≈ 3.8 GB.
So, how did we shrink it to 1.26 GB? It's not just a simple division. Our mixed-precision quantization is smart about what it shrinks:
The Big Float Part (752 MiB): Some layers are too sensitive for quantization. The huge token embedding table (197M parameters alone!) is kept in full
float32to preserve the model's vocabulary and nuance. This chunk alone is ~752 MiB.The Tiny 4-bit Part (358 MiB): The vast majority of the model's workhorse layers (153 of them!) have their weights compressed to 4 bits.
The Metadata Overhead (45 MiB): Those 4-bit numbers need scaling factors to become useful again. We store one
float16scale and bias for every group of 64 weights, adding a bit of necessary overhead.
Add it all up (752 + 358 + 45), and you land right around 1.15 GiB. The final 1.26 GB on disk comes from the NPZ file format itself. It’s a huge saving, achieved by being selective about what to compress.
What's inside the repo
model.py– MLX implementation with GQA, QK-norm, RoPE gating, weight-tyinginference.py– temp/top-p, penalties, boxed-answer helpers, chat/raw togglecustom_mlx_lm/– quantiser + loader that teaches mlx-lm how to read our bit-width map. You’ll want to copy themodel.pyto the venv' folder for mlx-lm, undermodels/test_model.py&check_shape.py– sanity checks and MLP-variant inspectorweights.npz(float32) &mlx_q4/weights.npz(4-bit) – ready-to-use checkpoints
Is this the final word?
No – it's a working proof-of-concept.
Upstream mlx-lm will hopefully grow native Llama-4 text support; until then, these scripts let you run and quantise the model today.
Issues/PRs welcome – let's make it obsolete the right way.
