Rendered at 22:24:47 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
raullen 2 days ago [-]
Built this to run coding agents locally on Apple Silicon. The main problem I kept hitting: most models fail at structured tool calling, and existing servers are slow on MLX.
Two findings from benchmarking 7 models across 5 agent frameworks:
1. Qwen family gets 100% tool calling across every framework tested. Non-Qwen models (Llama, DeepSeek-R1) vary wildly — 40% to 100% depending on framework.
2. smolagents (HuggingFace) sidesteps structured function calling entirely by using code generation. DeepSeek-R1 goes from 40% with structured FC to 100% with smolagents.
Speed-wise, MLX's unified memory means zero CPU↔GPU copies. On an M3 Ultra: Qwen3.5-9B hits 108 tok/s (vs ~41 on Ollama), Qwen 3.6 35B does 100 tok/s with only 3B active params.
The full benchmark data is in the README. Happy to discuss the MLX performance characteristics or tool calling architecture.
Two findings from benchmarking 7 models across 5 agent frameworks:
1. Qwen family gets 100% tool calling across every framework tested. Non-Qwen models (Llama, DeepSeek-R1) vary wildly — 40% to 100% depending on framework.
2. smolagents (HuggingFace) sidesteps structured function calling entirely by using code generation. DeepSeek-R1 goes from 40% with structured FC to 100% with smolagents.
Speed-wise, MLX's unified memory means zero CPU↔GPU copies. On an M3 Ultra: Qwen3.5-9B hits 108 tok/s (vs ~41 on Ollama), Qwen 3.6 35B does 100 tok/s with only 3B active params.
The full benchmark data is in the README. Happy to discuss the MLX performance characteristics or tool calling architecture.