Particle.news
Download on the App Store

Ollama Previews MLX-Powered Mac Runtime for Faster Local AI

The preview uses Apple's MLX to cut latency on M5‑class Macs to make local agents feel faster.

Overview

  • Ollama released a preview of version 0.19 on Monday that plugs its local LLM runner into Apple’s MLX framework on Apple Silicon.
  • On M5‑series chips, the runtime uses unified memory and GPU Neural Accelerators to speed responses, with Ollama’s tests showing prefill at 1,810 vs 1,154 tokens per second and decode at 112 vs 58.
  • The update adds NVIDIA’s NVFP4 weight format, which reduces memory use while preserving accuracy so local outputs better match production inference.
  • Caching now reuses work across chats, saves snapshots at key points in prompts, and keeps shared prefixes longer, which makes coding and agent tools like OpenClaw and Claude Code respond faster.
  • The preview only accelerates Alibaba’s Qwen3.5‑35B‑A3B model and requires more than 32GB of unified memory on a Mac, and it does not address known agent risks such as prompt injection and data leakage.