librarian

Librarian

Drop in messy documents. Get back a clean, classified, searchable library.

Librarian is a local-first parser + copy-editor + librarian in one. Hand it transcripts, PDFs, DOCX, images, or scans; it extracts the text at near-commercial fidelity, cleans it with an LLM to Chicago-Manual style without inventing or dropping a single fact, files every document under a Dewey-style classification, and makes the whole collection full-text searchable. Runs as a native Mac app, a scriptable CLI, and a FastAPI service — all on the same engine, all on your own machine.

⬇️ Download the Mac app · 🚀 Quick start · ⌨️ CLI reference · 🔌 API · 🏛️ Architecture

This is not just a PDF-to-text converter. Plenty of tools turn a PDF into a wall of text. Librarian’s job starts after extraction: it copy-edits the result, gives it a clean title and a Dewey number, writes an 80–100 word synopsis and metadata tags, and drops it into a searchable, exportable library. The extractor is best-in-class; the clean-up and organization are what make it Librarian.

Version 1.7.1 is the stable production release. Everything runs locally by default — source files and generated outputs live in a SQLite-backed workspace on your disk, and text leaves your machine only when you point cleaning, classification, or OCR-correction at an external model provider.

What it is

Surface	What you get	Best for
🖥️ Mac app	Drag files into a window. Live progress, full-text search, one-click Markdown export. The entire engine, Python runtime, and OCR tools are inside the download.	Just using it. Zero terminal.
⌨️ CLI	The whole pipeline as composable commands, every query command speaks `--json`.	Scripting, automation, bulk corpora.
🔌 API	A local FastAPI service with the same engine behind an HTTP surface.	Wiring Librarian into other tools/agents.

All three run the same engine and the same local SQLite library.

⬇️ Install in 60 seconds

The Mac app (no terminal, nothing to set up)

Download Librarian-AppleSilicon.dmg (M-series) or Librarian-Intel.dmg (Intel). If a direct link doesn’t resolve, grab the DMG from the latest release assets.
Open the DMG and drag Librarian to Applications. First launch: right-click → Open once to clear Gatekeeper.
Drop files anywhere in the window.

The app bundles the high-fidelity extraction engine and its OCR — scanned PDFs are read fully offline, no Homebrew, no PATH setup, no first-run downloads. See apps/macos for data locations, model-provider setup, and how it’s built.

The CLI / API (Python 3.12+)

python -m venv .venv && source .venv/bin/activate
pip install "nampara-librarian[all]"      # [all] pulls every optional capability
librarian doctor                          # confirm what's available

From a release wheel: pip install "nampara_librarian-1.7.1-py3-none-any.whl[all]" · From a checkout: pip install -e ".[dev,all]"

🚀 Quick start (60 seconds)

librarian init                                  # create a local workspace (./.librarian)
librarian import ./my-documents --recursive --process   # convert → clean → classify everything
librarian list                                  # see what landed, with Dewey codes + titles
librarian search "canter transitions" --details # full-text search across the library
librarian show doc_1a2b3c4d                      # one document's metadata + synopsis
librarian export doc_1a2b3c4d --format md --output clean.md

That’s the whole loop: import → search → export. By default everything runs with a built-in mock model (no network, deterministic), so you can try the mechanics instantly. Point it at a real model when you want real cleaning and classification:

export LIBRARIAN_LLM_PROVIDER=openai-compatible
export LIBRARIAN_LLM_MODEL=gpt-4.1-mini
export OPENAI_API_KEY=sk-...
librarian import ./input --recursive --format md --process

🧠 What actually happens to a document

Each file flows through five stages — and you can stop at any of them:

Extract — PDFs, DOCX, images, transcripts, and 20+ formats → clean Markdown. Tables, headings, lists, and figures are reconstructed; only the pages that need OCR get it.
Clean — an LLM copy-edits the Markdown to Chicago-Manual-of-Style prose, fixing OCR noise, line-break artifacts, and spacing without summarizing, reordering, or inventing. Source fidelity is validated, not assumed.
Classify — a Dewey-style code, a human title, an 80–100 word synopsis, and metadata tags. Recurring publications are linked into series across editions.
Search — everything is indexed for fast full-text search with facets and citation lookup.
Export — single documents as Markdown/JSON/text, or the whole library as an Open Knowledge Format bundle for handing to another agent or knowledge tool.

📄 The extraction engine

With the liteparse extra (included in [all]), Librarian extracts PDFs and images with liteparse (Apache-2.0) — reconstructed Markdown tables, headings, lists, and figure placeholders, with selective OCR and its own bundled PDFium + Tesseract (no poppler needed). The built-in pdfplumber + Tesseract path stays as a per-document fallback. Standalone images (PNG/JPG/scans) get the same treatment: Librarian converts them to a one-page PDF with Pillow (oriented upright first) and runs them through liteparse — no ImageMagick required.

Capability	How to turn it on	What it does
Engine select	`LIBRARIAN_PDF_ENGINE=auto\\|liteparse\\|legacy`	`auto` (default) uses liteparse when installed, else built-in.
Offline OCR data	`LIBRARIAN_LITEPARSE_TESSDATA_PATH=/path/to/tessdata`	Point liteparse’s OCR at local language data (the Mac app does this for you).
Higher-accuracy OCR	`LIBRARIAN_LITEPARSE_OCR_SERVER_URL=...`	Offload OCR to a Surya/EasyOCR/PaddleOCR server.
Figure → data (vision)	`LIBRARIAN_FIGURE_VISION_ENABLED=true`	A vision model describes each figure and reconstructs chart data as a Markdown table, injected next to the figure so the numbers become searchable text.
Extraction cache	on by default	Re-ingesting unchanged files skips re-extraction (keyed by content hash + engine/OCR config).
Parallel imports	`LIBRARIAN_IMPORT_CONCURRENCY=N` (default 2)	Convert/ingest several files at once; order, resume, and per-file failure isolation preserved.
Extraction timeout	`LIBRARIAN_EXTRACTION_TIMEOUT_SECONDS=N`	Bound a single document’s extraction so one pathological file can’t hang a batch.

OCR system tools (CLI/API only — the Mac app bundles these)

The built-in OCR fallback needs two system binaries on your PATH:

brew install tesseract poppler                       # macOS
sudo apt-get install -y tesseract-ocr poppler-utils  # Debian/Ubuntu

Without them, text-layer PDFs still work; scanned pages can’t be read by the fallback path. Run librarian doctor to see exactly what’s available.

Rotated scans are handled automatically: sideways or upside-down images and pages are detected with Tesseract’s orientation detection and rotated upright before OCR, so they yield real text instead of garbage (it only rotates when detection is confident, never flipping a correct page). On by default; set LIBRARIAN_OCR_AUTO_ORIENT=false to disable.

⌨️ CLI reference (every command)

All query/control commands accept --json for machine-readable output, so an agent can drive the whole pipeline without scraping tables. Run any command with --help for its full flag set.

Setup & health

Convert (no database, file → file)

Ingest, clean & classify

| Command | What it does | | — | — | | librarian import ./folder --recursive --process | The big one: convert → ingest → (optionally) clean+classify a whole tree. --manifest <path> --resume makes huge imports idempotent; --report report.json writes a full run report; exits non-zero if anything failed. | | librarian ingest transcript.txt | Ingest a single file and persist its extracted text. | | librarian process doc_... | Run cleaning + classification on an ingested document. | | librarian worker --once | Drain the durable SQLite job queue (for --process-deferred imports). |

Browse, search & export

Run the service

`librarian admin …` — operator & storage

`librarian maintainer …` — quality & release harness (source checkouts only)

⚙️ Configuration

Everything is configured by LIBRARIAN_* environment variables (or a .env file in the workspace). The essentials:

Variable	Default	What it controls
`LIBRARIAN_DATA_DIR`	`.librarian`	Where the workspace + SQLite database live.
`LIBRARIAN_LLM_PROVIDER`	`mock`	`mock` (offline, deterministic) or `openai-compatible`.
`LIBRARIAN_LLM_MODEL`	`mock-cleaner`	Model name for cleaning + classification.
`LIBRARIAN_LLM_BASE_URL`	–	Base URL for an OpenAI-compatible endpoint.
`OPENAI_API_KEY`	–	API key (env-var name configurable via `LIBRARIAN_LLM_API_KEY_ENV`).
`LIBRARIAN_LLM_MAX_CONCURRENCY`	`8`	Parallel chunk-cleaning requests.
`LIBRARIAN_PDF_ENGINE`	`auto`	Extraction engine (see the engine).
`LIBRARIAN_FIGURE_VISION_ENABLED`	`false`	Vision pass that turns charts into data tables.
`LIBRARIAN_IMPORT_CONCURRENCY`	`2`	Files converted/ingested in parallel.
`LIBRARIAN_API_KEY` / `LIBRARIAN_API_KEYS`	–	Require an API key for protected endpoints.

Optional dependency extras

The base install is lean; opt into capabilities (or take [all]):

Extra	Enables
`pdf`	Built-in PDF text extraction (`pdfplumber`).
`ocr`	Scanned/image-PDF OCR for the built-in engine (`pdf2image`, `pillow`, `pytesseract`).
`liteparse`	High-fidelity engine — tables, headings, figures, selective OCR (bundles PDFium + Tesseract).
`universal`	Broad conversion via `markitdown` (PPTX, XLSX, Outlook, …).
`otel`	OpenTelemetry tracing/metrics export.
`all`	Everything above.

🔌 API

uvicorn librarian.api.app:create_app --factory --host 127.0.0.1 --port 8080
# or simply: librarian api

Primary endpoints:

GET /health, GET /ready, GET /version
POST /documents, GET /documents, GET /documents/{id}, DELETE /documents/{id}
POST /imports, GET /imports/status, GET /imports/page-manifest
POST /runs, GET /runs, GET /runs/{id}, POST /runs/{id}/cancel, POST /runs/{id}/retry
GET /runs/{id}/events, GET /runs/{id}/events/stream
GET /documents/{id}/content, GET /documents/{id}/export?format=json|txt|md
GET /export/okf, GET /documents/{id}/okf
POST /search, POST /search/results, POST /search/facets
GET /metrics, GET /metrics/prometheus

Set LIBRARIAN_API_KEY (or LIBRARIAN_API_KEYS) to require a key via x-api-key or Authorization: Bearer …. Read-scoped keys reach document/search endpoints; operational endpoints need write scope. Full details in docs/API.md.

📂 Where your writing lives

A workspace is just a folder (.librarian/ by default):

.librarian/librarian.sqlite   ← the library: documents, cleaned text, classifications, search index
.librarian/converted/         ← Markdown/text produced by `import` (originals are never touched)

librarian import converts sources into the workspace by default; use --output-mode to place converted files new-directory, original, or subdirectory instead. Back the whole thing up with librarian admin workspace-backup.

🔒 Private by default

Librarian stores everything locally. Text is sent to a model provider only when cleaning, classification, or OCR-correction actually needs LLM work — and only to the provider you configure. With the default mock provider, nothing leaves your machine. Keep API keys in environment variables or .env, never in Git. CI runs secret scanning, dependency audit, type checking, the full test suite, a wheel smoke-install, and Docker build checks on every change.

🐳 Docker

Containerized deployment is optional and aimed at server installs — the Mac app and CLI need none of it. Images publish to ghcr.io/nampara-ai/librarian; compose/run examples live in docs/DEPLOYMENT.md.

🗂️ What’s in here

src/librarian/            The engine: ingest, pipeline, application, storage, api, cli, taxonomy
src/librarian/ingest/     Extraction adapters (liteparse, pdfplumber/Tesseract, DOCX, markitdown, …)
src/librarian/pipeline/   Chunking, cleaning, validation
src/librarian/taxonomy/   Dewey classification
apps/macos/               The native Mac app (Swift) + its build/bundle/sign scripts
docs/                     ARCHITECTURE · API · DEPLOYMENT · OPERATIONS · OKF
tests/                    The full test suite (unit + integration)
examples/corpus/          Sample documents to try the pipeline on

📚 Documentation

Start with docs/ARCHITECTURE.md. Then: API · deployment · operations runbooks · Open Knowledge Format · Mac app. Release history is in CHANGELOG.md.

🤝 Contributing & 📜 License

Contributions welcome — see CONTRIBUTING.md for setup and the quality gate, and CODE_OF_CONDUCT.md for community expectations. To report a vulnerability, follow SECURITY.md.

Licensed under the MIT License — see LICENSE. Librarian bundles or builds on third-party components under their own licenses (notably the Apache-2.0 liteparse engine); see NOTICE.

This site is open source. Improve this page.