Skip to content

Supported file formats

Piixie reads four file types and writes anonymized copies that keep the original format.

ExtensionExtractionRewriting
.txtPlain textLine breaks preserved
.mdMarkdown sourceMarkdown syntax untouched
.pdfText layer plus rendered pagesText replaced in place, embedded images optionally redacted or blurred
.docxDocument XMLStyles, structure and metadata preserved (text only; embedded images pass through unchanged)

Anonymization happens as targeted substring substitutions, never a full rewrite. The model returns a list of exact original strings and their replacements; Piixie applies them to the original file in its native format. A DOCX keeps its styles and tables. A PDF keeps its layout. Markdown keeps its links and headings.

PDF handling is pure Go (no native dependencies), based on a conversion of the MuPDF library. Piixie extracts the text layer for detection and replaces matched strings within the PDF content streams.

Embedded images get their own pass. The model receives rendered images with page coordinates and can return bounding boxes for regions that contain PII, such as a scanned signature or a photo of an ID. Piixie then draws redaction boxes at those coordinates.

If the selected model has no vision support and a PDF contains images, Piixie asks how to proceed: continue text-only, auto-blur every embedded image without model involvement, or apply your choice to all remaining files in the queue.

Word documents are processed through their underlying XML. Replacements land inside the runs that contained the original text, so character formatting (bold, color, font) survives, and document metadata (author, title) is anonymized along with the body. Embedded images are not currently analyzed or redacted in DOCX files; if a Word document contains sensitive images, export it to PDF first and process that.

A document is processed in a single model pass, so the practical size limit is the model’s context window. Very long documents can exceed it and fail or come back incomplete; if that happens, split the file and process the parts, or switch to a model with a larger context via a remote endpoint.

No OCR pass needed. The local Gemma models support vision, so a scanned PDF with no text layer is analyzed from its rendered pages directly. This avoids the classic OCR failure modes: misread characters, flattened multi-column layouts, and table structures that fall apart.

The Anonymization Editor renders each format for review: PDFs are shown page by page with a selectable text layer, Word documents as a reflowed HTML reading view, and plain text as-is. The reflowed Word view is for reviewing content — the saved .docx keeps its original layout. Everything renders locally; nothing is sent anywhere to display a document.