Skip to content

Run Piixie as a local server

LLM inference is the heavy part of anonymization. A laptop can run Piixie’s local model, but a 5 GB model doing vision analysis on a long PDF will keep its fans busy. Server mode moves that burden to one machine you choose, typically a Linux box with a GPU, while the desktop UI stays exactly the same.

The document workflow is unchanged: workstations submit documents to the Piixie server over your local network, the server runs the model and returns the mappings, and the desktop app writes the anonymized copy. Nothing routes through the internet.

  • Team members on laptops that can’t comfortably run a local model
  • High document volumes where one fast GPU beats ten slow CPUs
  • Centralized control: one machine to provision, monitor, and audit
  • Air-gapped or compliance-bound networks where external APIs are off the table but a shared internal server is fine

On the machine that will host the model (GPU recommended; a recent NVIDIA card or Apple Silicon makes a large difference):

Terminal window
piixie server \
--host 0.0.0.0 \
--port 8787

The server downloads its model on first run, the same Gemma 4 models the desktop app uses, and exposes an OpenAI-compatible chat API on the given port. Pick the larger 12B model for better quality if the GPU has the memory:

Terminal window
piixie server \
--host 0.0.0.0 \
--port 8787 \
--model gemma-4-12b

Keep --host 127.0.0.1 if the server should only accept local connections (for example behind a reverse proxy that adds TLS and authentication).

On each workstation, add the server as an endpoint:

  1. Open settings and go to AI endpoints.
  2. Add an endpoint with the OpenAI chat protocol.
  3. Set the base URL to your server, for example http://gpu-box.local:8787 (no /v1 suffix; Piixie appends API paths itself). The API key field can stay empty for self-hosted endpoints.
  4. Piixie queries the endpoint for available models; enable the one the server is hosting.
  5. Pick that model in the model selector.

From then on, anonymization runs on the server. The UI behaves identically: same modes, same profiles, same streaming progress, same history. Switching back to the bundled local model is one click in the same selector.

The configuration details are the same as for any remote endpoint; see Remote endpoints.

In server mode, the document text (and rendered images, for vision analysis) is sent to the Piixie server over the LAN. It is not stored there; the server holds documents in memory for the duration of inference. The anonymized output, the history database, and the replacement tables all stay on the workstation. If the network between workstation and server isn’t trusted, put the server behind TLS.

  • The default Gemma 4 E4B model needs roughly 6 GB of memory headroom; the 12B model needs more, around 9 GB.
  • GPU memory is the limit that matters. A card that fits the whole model in VRAM gives interactive speeds; spillover to CPU is functional but slow.
  • One server handles requests from multiple workstations; documents queue when the server is busy.