MLX Frontend
The MLX frontend allows the Proxy Base Agent (PBA) to run inference using Large Language Models optimized for Apple Silicon (M-series chips) via the MLX framework.
Overview
MLX is a NumPy-like array framework designed by Apple for efficient machine learning on Apple Silicon. The agent.llm.frontend.mlx.MLXInference
class implements the Frontend
interface for MLX models.
Key Features:
- Optimized Performance: Leverages the unified memory architecture and Neural Engine (ANE) of Apple Silicon for fast local inference.
- Model Compatibility: Works with models converted to the MLX format (often available on Hugging Face hubs like
mlx-community
). - KV Caching: Supports efficient Key-Value caching, including saving/loading system prompt caches to disk for faster startup (
supports_reusing_prompt_cache()
returnsTrue
). - PSE Integration: Seamlessly integrates with the
StructuringEngine
for constrained generation during thegenerate_step
process.
Usage
-
Installation: Ensure you have the necessary MLX dependencies installed. This is typically handled by installing PBA with the
[mlx]
extra:You also need thepip install proxy-base-agent[mlx] # or uv pip install proxy-base-agent[mlx]
mlx-lm
package:pip install mlx-lm # or uv pip install mlx-lm
-
Model Selection: During the PBA setup wizard (
python -m agent
), choose a model compatible with MLX (e.g., from themlx-community
hub on Hugging Face or one you have converted locally). Select "MLX" as the inference backend when prompted. -
Configuration: The
LocalInference
class will automatically instantiateMLXInference
when an MLX model path and the MLX frontend are selected. Relevantinference_kwargs
(liketemp
,seed
,max_tokens
, caching options) passed to theAgent
constructor will be used during generation.
How it Works
- Loading:
MLXInference
usesmlx_proxy.utils.load_model
to load the model andagent.llm.tokenizer.Tokenizer.load
for the tokenizer. - Inference Loop: The
inference()
method usesmlx_proxy.generate_step.generate_step
, passing theengine.process_logits
function to thelogits_processors
argument and a sampler created viaengine.sample
wrappingmlx_proxy.samplers.make_sampler
. - Caching: It utilizes
mlx_proxy.cache.BaseCache
for KV caching and implements thesave_cache_to_file
andload_cache_from_file
methods usingsafetensors
for persistent prompt caching.
The MLX frontend provides an efficient way to run PBA locally on Apple Silicon hardware.