Local models
Local inference is Piixie’s default. The app downloads a GGUF model on first launch and runs it through a llama-server process (from llama.cpp) that it manages for you: spawned in the background on startup, stopped when the app quits.
Available models
Section titled “Available models”| Model | Size | Parameters | Notes |
|---|---|---|---|
| Gemma 4 E4B | ~5.0 GB | 4.5B effective | Default. Fast, compact, good for everyday anonymization |
| Gemma 4 12B | ~7.1 GB | 12B | Higher quality, needs more memory and a faster machine |
Both are instruction-tuned Gemma 4 models quantized to Q4_K_M, downloaded from Hugging Face. Each ships with a vision projector (mmproj) file, so both models analyze rendered pages and embedded images directly. No OCR step.
Switching models
Section titled “Switching models”Open the model selector in settings. Models not yet downloaded show a download button with a progress bar; switching to an already-downloaded model restarts the local server with the new weights. Downloads are cached, so switching back and forth costs nothing after the first fetch.
The same selector also lists models from any remote endpoints you have configured, marked with their endpoint name. Local and remote models are interchangeable from the pipeline’s point of view.
Where models live
Section titled “Where models live”| Platform | Location |
|---|---|
| macOS | ~/Library/Application Support/Piixie/models/ |
| Windows | %APPDATA%\Piixie\models\ |
| Linux | ~/.config/Piixie/models/ |
Deleting a model file from this folder frees the disk space; Piixie will offer to re-download it next time it’s selected.
How inference runs
Section titled “How inference runs”llama-server exposes an OpenAI-compatible HTTP API on a local port that only Piixie talks to. Document text (and rendered images, when present) is sent to that local process, the model streams back a JSON object with PII mappings, and Piixie applies them. The whole loop stays on your machine.
Performance depends on hardware. Apple Silicon runs the E4B model comfortably; on older Intel machines, expect slower runs or consider the server setup to offload inference to a faster box.