This runbook covers the stable single-node production profile: local CLI use, one API process, or an API plus SQLite-backed worker. Source documents, converted files, page manifests, cleaned outputs, and search indexes are private runtime data.
SQLite is the supported 1.0 storage backend. Librarian opens connections with WAL mode, foreign keys, synchronous=NORMAL, and a 5-second busy timeout. Use a persistent disk or volume and run migrations before starting API or worker processes when startup ordering matters.
librarian migrate
librarian admin db-check
librarian admin db-stats
librarian admin db-maintain
For long-lived databases, run maintenance during a quiet window. Use librarian admin db-maintain --vacuum only when compaction is worth the extra runtime.
Create both database-only and full workspace backups when operating the API with uploads:
librarian admin workspace-backup /backups/librarian-workspace-$(date +%Y%m%d%H%M%S).zip
librarian admin db-backup /backups/librarian-$(date +%Y%m%d%H%M%S).sqlite
Backup and restore paths must not be symlinks or cross symlinked parents. Workspace backup skips symlinked files under the data directory so archives do not copy targets outside the workspace.
Stop API and worker processes before restoring:
librarian admin db-restore /backups/librarian-20260522120000.sqlite --yes
librarian admin workspace-restore /backups/librarian-workspace-20260522120000.zip --yes
librarian admin db-check
Workspace restore rejects oversized manifests, duplicate archive paths, unsafe member paths, symlink archive members, excessive file counts, and archives whose expanded size exceeds the configured limit.
Markdown is the canonical structured conversion format. Built-in support covers .txt, .md, .csv, .json, .srt, .vtt, .docx, .pdf, and common OCR image formats. Optional MarkItDown support adds broader formats such as .pptx, .xlsx, .html, .rtf, .epub, and .xml.
Local conversion and import enforce configurable input limits before expensive parsing:
LIBRARIAN_MAX_SOURCE_BYTESLIBRARIAN_TEXT_MAX_INPUT_BYTESLIBRARIAN_DOCX_MAX_INPUT_BYTESLIBRARIAN_PDF_MAX_INPUT_BYTESLIBRARIAN_PDF_MAX_PAGESLIBRARIAN_API_MAX_UPLOAD_BYTESPDF extraction is page-aware. Embedded-text pages use embedded extraction; empty or scanned pages are OCRed. Long OCR jobs write <output>.pages.json manifests when conversion sidecars are enabled. These manifests record page status, source, OCR confidence, retry attempts, correction state, warning codes, and optional preserved page image paths.
Inspect page manifests without dumping raw page text:
librarian admin page-manifest ./out/report.md.pages.json --failures-only
librarian admin page-manifest ./out/report.md.pages.json --json --failures-only
API manifest inspection is available at GET /imports/page-manifest, constrained to LIBRARIAN_API_IMPORT_ROOT, and requires operational/write-scope credentials.
CLI users are trusted local operators. API callers are untrusted unless they present a configured API key. Public API binds require an API key and import root. Read-scoped keys can read documents and search results; write-scoped keys are required for operational endpoints such as config, metrics, audit, and page-manifest inspection.
Generated sidecars, reports, and page manifests are internal metadata and must not be treated as corpus input. Recursive conversion and import skip Librarian-generated metadata to avoid self-ingestion.
Archive formats are rejected by default, and common archive signatures are rejected even when renamed. Unpack archives outside Librarian after organization-approved malware scanning, then import extracted files from a controlled directory.
Librarian does not log source text or generated document content. Persisted error strings are redacted and length-capped before status APIs expose them. JSON and text logging redact common credential patterns, bearer tokens, and sk-... provider keys.
Performance depends on provider, model, document type, OCR path, and concurrency settings. Record these values when comparing runs:
LIBRARIAN_LLM_MAX_CONCURRENCYLIBRARIAN_OCR_PAGE_CONCURRENCYLIBRARIAN_OCR_LLM_CORRECTIONLIBRARIAN_OCR_ROTATION_RETRY/metrics OCR throughput, correction counts, queue wait, run-stage timing, and provider token usageFor large PDFs, measure conversion separately from processing:
time librarian convert ./large.pdf --format md --output ./large.md
time librarian ingest ./large.md
time librarian process doc_...
Run once with LIBRARIAN_OCR_LLM_CORRECTION=never to isolate extraction/OCR throughput, then run with the intended correction provider to measure final quality and cost.
Maintainers can run prompt, corpus, and throughput checks without exposing them as top-level user commands:
librarian maintainer eval examples/eval_cases.json --output eval-provider.json
librarian maintainer benchmark --paragraphs 40 --paragraph-chars 1000 --repeats 3 --output bench-provider.json
librarian maintainer corpus-eval examples/corpus_eval_cases.json --output-dir .librarian/corpus-eval --output corpus-eval-provider.json --overwrite
Do not commit provider outputs that contain private text.
Before tagging a stable release, run:
ruff check .
pyright
pytest
pip-audit --progress-spinner off --skip-editable
librarian doctor --strict
rm -rf dist
python -m build
docker build -t librarian-release-check .
The tag release workflow verifies tag/version alignment, changelog readiness, secret scanning, dependency audit, tests, type checking, wheel build, smoke installation, SBOM generation, checksums, distribution attestations, Docker build, image scan, image attestation, and GitHub release creation.
SQLite migrations live in src/librarian/storage/migrations and apply in lexical order. Each applied filename is recorded in schema_migrations.
0006_add_field.sql.