Documentation Index
Fetch the complete documentation index at: https://jacobpevans-docs-automation-surface.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
One envelope schema, every upstream eval tool, one public HF dataset.
mlx-benchmarks is the result-envelope contract and publisher for benchmarking MLX-quantized and locally-hosted LLMs on Apple Silicon. It is the thin glue between upstream evaluation tools (lm-eval, vllm benchmark_serving, agent-framework harnesses) and a single public HuggingFace dataset, with a Gradio viewer on top.
What it does
- Defines envelope v1 in
schema.json— the authoritative, versioned contract every published shard validates against. - Provides
mlx-bench-publish, a CLI that converts raw tool output into the envelope, validates it, and uploads to the HF dataset with content-addressed filenames (data/run-<timestamp>-<git_sha>-<suite>-<model_slug>.parquet). - Owns converters for
lm-eval,vllm benchmark_serving, and framework-eval (OpenAI / Qwen-Agent / smolagents / ADK). - Auto-detects runtime metadata (OS, chip, memory, Python, MLX, lm-eval versions) via
detect_system()so envelopes are fully reproducible without hand-curation. - Deploys a Gradio viewer to HF Spaces on every
mainpush touchingspace/.
How it fits
| Feeds into | Consumes |
|---|---|
| HF dataset, HF Space viewer | nix-ai (vllm-mlx, llama-swap), lm-eval, vllm, agent-framework SDKs |
Getting started
Bring up the inference stack
From the
nix-darwin flake: darwin-rebuild switch --flake .. This starts vllm-mlx + llama-swap on localhost:11434 via nix-ai. Or run vllm-mlx serve directly if you’re not on the Nix stack.Install and authenticate
git clone https://github.com/JacobPEvans/mlx-benchmarks && cd mlx-benchmarks && uv sync. Then export HF_TOKEN=... with write scope on the dataset namespace.Publish (dry-run first)
.venv/bin/mlx-bench-publish ./run-output/<model-dir>/results_*.json --kind lm-eval --suite reasoning --dry-run validates the envelope locally against schema.json. Drop --dry-run to push to the HF dataset.View results
Open the HF Space viewer — it auto-loads every published shard. Or
cd space && python app.py for a local copy.Related repos
nix-ai
Packages the inference stack:
vllm-mlx LaunchAgent, llama-swap, MLX module derivations. Where models actually run.nix-darwin
macOS host config. Composes
nix-ai into the system flake so benchmarks have a reproducible environment.ai-assistant-instructions
Model routing + permission policy. Tells AI clients which models to benchmark.
Source on GitHub
Schema, publisher, converters, full README,
docs/architecture.md.