Supported file formats
Piixie reads four file types and writes anonymized copies that keep the original format.
| Extension | Extraction | Rewriting |
|---|---|---|
.txt | Plain text | Line breaks preserved |
.md | Markdown source | Markdown syntax untouched |
.pdf | Text layer plus rendered pages | Text replaced in place, embedded images optionally redacted or blurred |
.docx | Document XML | Styles, structure and metadata preserved (text only; embedded images pass through unchanged) |
Format preservation
Section titled “Format preservation”Anonymization happens as targeted substring substitutions, never a full rewrite. The model returns a list of exact original strings and their replacements; Piixie applies them to the original file in its native format. A DOCX keeps its styles and tables. A PDF keeps its layout. Markdown keeps its links and headings.
PDF specifics
Section titled “PDF specifics”PDF handling is pure Go (no native dependencies), based on a conversion of the MuPDF library. Piixie extracts the text layer for detection and replaces matched strings within the PDF content streams.
Embedded images get their own pass. The model receives rendered images with page coordinates and can return bounding boxes for regions that contain PII, such as a scanned signature or a photo of an ID. Piixie then draws redaction boxes at those coordinates.
If the selected model has no vision support and a PDF contains images, Piixie asks how to proceed: continue text-only, auto-blur every embedded image without model involvement, or apply your choice to all remaining files in the queue.
DOCX specifics
Section titled “DOCX specifics”Word documents are processed through their underlying XML. Replacements land inside the runs that contained the original text, so character formatting (bold, color, font) survives, and document metadata (author, title) is anonymized along with the body. Embedded images are not currently analyzed or redacted in DOCX files; if a Word document contains sensitive images, export it to PDF first and process that.
Large documents
Section titled “Large documents”A document is processed in a single model pass, so the practical size limit is the model’s context window. Very long documents can exceed it and fail or come back incomplete; if that happens, split the file and process the parts, or switch to a model with a larger context via a remote endpoint.
Scans and image-backed documents
Section titled “Scans and image-backed documents”No OCR pass needed. The local Gemma models support vision, so a scanned PDF with no text layer is analyzed from its rendered pages directly. This avoids the classic OCR failure modes: misread characters, flattened multi-column layouts, and table structures that fall apart.
Viewing in the editor
Section titled “Viewing in the editor”The Anonymization Editor renders each format for review: PDFs are shown page by page with a selectable text layer, Word documents as a reflowed HTML reading view, and plain text as-is. The reflowed Word view is for reviewing content — the saved .docx keeps its original layout. Everything renders locally; nothing is sent anywhere to display a document.