MLX Frontend
The MLX frontend allows the Proxy Base Agent (PBA) to run inference using Large Language Models optimized for Apple Silicon (M-series chips) via the MLX framework.
Overview
MLX is a NumPy-like array framework designed by Apple for efficient machine learning on Apple Silicon. The agent.llm.frontend.mlx.MLXInference class implements the Frontend interface for MLX models.
Key Features:
- Optimized Performance: Leverages the unified memory architecture and Neural Engine (ANE) of Apple Silicon for fast local inference.
- Model Compatibility: Works with models converted to the MLX format (often available on Hugging Face hubs like
mlx-community). - KV Caching: Supports efficient Key-Value caching, including saving/loading system prompt caches to disk for faster startup (
supports_reusing_prompt_cache()returnsTrue). - PSE Integration: Seamlessly integrates with the
StructuringEnginefor constrained generation during thegenerate_stepprocess.
Usage
-
Installation: Ensure you have the necessary MLX dependencies installed. This is typically handled by installing PBA with the
[mlx]extra:You also need thepip install proxy-base-agent[mlx] # or uv pip install proxy-base-agent[mlx]mlx-lmpackage:pip install mlx-lm # or uv pip install mlx-lm -
Model Selection: During the PBA setup wizard (
python -m agent), choose a model compatible with MLX (e.g., from themlx-communityhub on Hugging Face or one you have converted locally). Select "MLX" as the inference backend when prompted. -
Configuration: The
LocalInferenceclass will automatically instantiateMLXInferencewhen an MLX model path and the MLX frontend are selected. Relevantinference_kwargs(liketemp,seed,max_tokens, caching options) passed to theAgentconstructor will be used during generation.
How it Works
- Loading:
MLXInferenceusesmlx_proxy.utils.load_modelto load the model andagent.llm.tokenizer.Tokenizer.loadfor the tokenizer. - Inference Loop: The
inference()method usesmlx_proxy.generate_step.generate_step, passing theengine.process_logitsfunction to thelogits_processorsargument and a sampler created viaengine.samplewrappingmlx_proxy.samplers.make_sampler. - Caching: It utilizes
mlx_proxy.cache.BaseCachefor KV caching and implements thesave_cache_to_fileandload_cache_from_filemethods usingsafetensorsfor persistent prompt caching.
The MLX frontend provides an efficient way to run PBA locally on Apple Silicon hardware.