model: at a local server and your data never
leaves the machine. This page is the ladder, from the one-command path to
production servers.
1. The one-command path: Ollama
Install Ollama, then:nika doctor confirms the wiring (it detects the running Ollama server and
prints the exact fix if something is off).
2. Any Hugging Face GGUF, still one command
Ollama pulls directly from the Hugging Face Hub. Any public GGUF repo works, no account needed:Q4_K_M is the sane default (quality per
GB), Q8_0 when you have RAM to spare, Q2/Q3 only when memory is tight.
The Hub’s GGUF filter lists
every compatible repo.
3. LM Studio: the visual browser
Prefer a GUI? LM Studio browses Hugging Face, downloads models with a click, and serves an OpenAI-compatible endpoint. Start its server, then:4. Servers: llama.cpp and vLLM
For shared machines and production:- llama.cpp
llama-serverserves any GGUF:model: llamacpp/<model> - vLLM serves full-precision Hub models at
datacenter throughput:
model: vllm/<hub-repo-id>
Checking what’s wired
Swapping between local and cloud
The file does not change shape. One line moves the workflow between a laptop and an API:The permits boundary applies either way: a
permits: block with no
net.http entry means the workflow cannot reach the network even if the
model could. Local model + closed permits = fully air-gapped AI work.Read next
Providers
The full catalog and how one InferRequest speaks every dialect.
First workflow
Five minutes from install to a checked, runnable file.