Local models

Local inference is Piixie’s default. The app downloads a GGUF model on first launch and runs it through a llama-server process (from llama.cpp) that it manages for you: spawned in the background on startup, stopped when the app quits.

Available models

Model	Size	Parameters	Notes
Gemma 4 E4B	~5.0 GB	4.5B effective	Default. Fast, compact, good for everyday anonymization
Gemma 4 12B	~7.1 GB	12B	Higher quality, needs more memory and a faster machine

Both are instruction-tuned Gemma 4 models quantized to Q4_K_M, downloaded from Hugging Face. Each ships with a vision projector (mmproj) file, so both models analyze rendered pages and embedded images directly. No OCR step.

Switching models

Open the model selector in settings. Models not yet downloaded show a download button with a progress bar; switching to an already-downloaded model restarts the local server with the new weights. Downloads are cached, so switching back and forth costs nothing after the first fetch.

The same selector also lists models from any remote endpoints you have configured, marked with their endpoint name. Local and remote models are interchangeable from the pipeline’s point of view.

Where models live

Platform	Location
macOS	`~/Library/Application Support/Piixie/models/`
Windows	`%APPDATA%\Piixie\models\`
Linux	`~/.config/Piixie/models/`

Deleting a model file from this folder frees the disk space; Piixie will offer to re-download it next time it’s selected.

How inference runs

llama-server exposes an OpenAI-compatible HTTP API on a local port that only Piixie talks to. Document text (and rendered images, when present) is sent to that local process, the model streams back a JSON object with PII mappings, and Piixie applies them. The whole loop stays on your machine.

Performance depends on hardware. Apple Silicon runs the E4B model comfortably; on older Intel machines, expect slower runs or consider the server setup to offload inference to a faster box.