osd.traineddata),
applying the rotation only when detection is confident so a correctly-oriented page is never
flipped. On by default for both standalone images and scanned PDF pages; disable with
LIBRARIAN_OCR_AUTO_ORIENT=false. Best-effort and graceful — if orientation can’t be determined
(e.g. too little text, OSD data unavailable), the page is left as-is.The macOS app now ships the high-fidelity liteparse engine as its default (the bundled backend
already installs the [all] extras), with fully offline OCR: the app points liteparse’s bundled
Tesseract at the same bundled eng/osd traineddata via the new
LIBRARIAN_LITEPARSE_TESSDATA_PATH setting, so scanned PDFs are read without any first-use
language-data download. librarian doctor now reports liteparse availability.
LIBRARIAN_FIGURE_VISION_ENABLED). With the liteparse
engine active, each embedded figure image is sent to a vision-capable model that returns a
description and, for charts, a reconstructed Markdown data table; the result is injected next to the
figure’s placeholder so otherwise-lost chart data becomes searchable, classifiable text. Bounded by
LIBRARIAN_FIGURE_VISION_MAX_FIGURES/_MIN_BYTES/_MAX_BYTES/_MAX_CONCURRENCY; uses
LIBRARIAN_FIGURE_VISION_MODEL (defaults to the cleaning model). Per-figure failures are swallowed
so one bad image never fails the document, and the output-affecting vision settings (model, figure
cap, and the size/length gates) fold into the extraction-cache signature so toggling them
re-extracts instead of serving stale text. Providers gained a describe_image capability (OpenAI-compatible vision content parts;
deterministic mock for tests/dry runs).liteparse engine
(liteparse, Apache-2.0). When the liteparse extra is
installed (included in [all]), PDFs and images are extracted to Markdown with reconstructed
tables, headings, lists, and figure placeholders, OCR-ing only the pages that need it, and
bundling its own PDFium + Tesseract (no poppler system binary needed for PDFs). The richer
Markdown feeds the existing cleaning/classification/OKF pipeline unchanged. LIBRARIAN_PDF_ENGINE
selects auto (default; liteparse when installed, otherwise the built-in pdfplumber + Tesseract
path), liteparse, or legacy; the built-in path remains a per-document fallback. See NOTICE
for attribution.0009_extraction_cache.sql; toggle with
LIBRARIAN_EXTRACTION_CACHE_ENABLED (default on). admin db-stats reports the extraction_cache
row count.LIBRARIAN_EXTRACTION_TIMEOUT_SECONDS (default 0, disabled) bounds a
single document’s extraction so one pathological file cannot hang a batch, raising
ExtractionTimeoutError when exceeded.Directory imports now convert/ingest files with bounded concurrency
(LIBRARIAN_IMPORT_CONCURRENCY, default 2), so the per-file extraction work overlaps instead of
running strictly one at a time. Output paths are reserved up front to stay collision-free, and
result order, manifest resume, and per-file failure isolation are preserved. Set it to 1 for
fully sequential imports; raising it speeds bulk imports but, for --process/--queue runs, also
multiplies with llm_max_concurrency, so keep it modest on rate-limited providers.
issuer, series_title, a normalized series_key, and an orderable
period. The new dewey_v5 prompt extracts issuer/series/period; the series_key is derived
deterministically by stripping date/period tokens so monthly editions converge, and falls back to
a distinctive source filename when the model gives no series (generic names like report.pdf are
ignored). Documents classified before v5 keep parsing with these fields unset.## Series Editions
heading ordered by reporting period, and carries issuer / series / series_key / period as
frontmatter extension fields. librarian export-okf --series <key-or-name-fragment> (and the
series query parameter on GET /export/okf) filters a bundle to one series.0008_classification_series.sql adds the four nullable columns and an index on
series_key.Relicensed to MIT, plus an OKF output mode in the Mac app and a small PDF cleanup. (The v1.6.0 tag was accidentally created on the v1.5.0 commit before this work merged; protected tags cannot be moved, so it is retained as inert history and superseded by 1.6.1.)
#### Page N) instead of ## Page N, so
page boundaries no longer dominate a document’s heading outline.Librarian can now emit a processed corpus as an Open Knowledge Format (OKF) v0.1 bundle — a vendor-neutral, agent- and human-readable knowledge format (spec). Turn a pile of scanned PDFs, transcripts, and documents into a portable knowledge wiki an agent can reason over.
librarian export-okf ./bundle (and GET /export/okf, GET /documents/{id}/okf) render
processed documents as conformant OKF concept files: markdown with YAML frontmatter, organized
into a Dewey-derived directory hierarchy, cross-linked to same-classification siblings, and
accompanied by generated index.md files for progressive disclosure. The bundle root declares
okf_version: "0.1". Filters: --classification-prefix, --tag, --limit; --json summary;
non-zero exit when nothing matches.description (the new dewey_v4
prompt), used as the OKF concept abstract; documents classified before v4 fall back to the first
sentence of their synopsis. Frontmatter maps title → title, the one-line abstract →
description, tags → tags, the document kind → type, with the Dewey code/label and
confidence as extension fields. There is no runtime OKF dependency — Librarian emits the format
directly. See docs/OKF.md.Makes the CLI fully scriptable, so an agent can drive bulk document processing end to end without scraping human-readable tables.
--json now covers the core query and control commands: ingest and process return the
new document_id/run_id (plus run status and chunk counts), status returns
status/stage/total_chunks/completed_chunks/failed_chunks and the event list for
polling, and list, show, and search (with or without --details) return structured
records — show and detailed search include the Dewey code, title, tags, and summary.
Output is clean, unstyled JSON suitable for piping into jq or a parser.import --recursive --process --report report.json (full JSON
report), import --manifest <path> --resume (idempotent bulk imports across restarts),
non-zero exit on any failed item, and SHA-based ingest de-duplication, the CLI is now a
complete machine-driveable surface. The README documents the automation flow.Scanned and image-based PDFs now work in the Mac app, out of the box.
Librarian.app, fully self-contained and relocated so it
depends on nothing outside the bundle. Previously a PDF with image pages failed
at upload: macOS GUI apps inherit only a bare system PATH, so the engine could
not find an OCR binary even when one was installed via Homebrew. The app now puts
its bundled OCR tools (then common Homebrew locations) on the engine’s PATH and
sets TESSDATA_PREFIX.Cleaned documents now come out of the pipeline shelf-ready: named, summarized, and tagged.
636.1 Saddle Fit and Groundwork Notes.md. The engine suggests the
name on every export (sanitized for the filesystem and HTTP headers), the Mac app uses it
when auto-saving to the destination folder, and identical or near-identical documents that
produce the same name fall into the existing “ (2)”, “ (3)” collision numbering — never
overwritten. When classification produced no title, the original filename is used as before.title, summary, tags, and suggested_stem fields; plain-text
exports remain the cleaned text verbatim.LIBRARIAN_CLASSIFICATION_PROMPT_VERSION.Closes the last silent black hole. A stale preference combination — built-in engine disabled plus an empty external server address — could survive reinstalls indefinitely; the app started no engine, sent every file to an empty address, and showed nothing wrong.
Fixes “Couldn’t reach the AI provider” failures on Macs where the app could connect but the embedded engine could not. The app’s own networking follows macOS system settings; the bundled Python runtime does not. The app now bridges both into the engine’s environment at launch:
scutil --proxy) are passed as proxy environment
variables, so VPN and proxy setups work for cleaning calls, with loopback excluded.SSL_CERT_FILE, so corporate or security-tool TLS interception
roots that the Mac trusts are also trusted by the engine.Settings becomes connect-first and idiot-proof:
The Mac app is redesigned around its real job — a pipeline, not a database browser. One window, one column, one verb: drop files, pick a destination, let it cook.
.env are migrated automatically) and are handed
to the engine through its environment, never written to disk.librarian doctor --json). Engine health appears in the footer
only when something is wrong.BackendController’s static path helpers are now nonisolated). The
Mac App workflow now builds the app on every pull request that touches apps/macos, so a
compile failure can never first surface on an immutable release tag again. The v1.1.4 tag was
burned by exactly that failure and joins v1.1.0–v1.1.2 as inert history.First fully published release of the 1.1 line, containing all 1.1.0 and 1.1.1 changes below. This repository publishes immutable releases, so assets cannot be attached after publication; the release workflow now waits for the Mac app DMG builds, collects them as workflow artifacts, and includes them — checksummed alongside the engine artifacts — in a single atomic release creation. The v1.1.1 release published with engine artifacts only (its DMG attach step was rejected by release immutability) and is superseded by this version. The v1.1.0–v1.1.2 tags remain as inert history: v1.1.2 was accidentally created on the 1.1.1 commit before this release’s pipeline fix merged, and protected tags cannot be moved.
Patch release on top of the unpublished 1.1.0. The Docker image build now upgrades base-layer packages, picking up Debian’s fix for CVE-2026-45447 (OpenSSL), which was published mid-release and blocked the image scan gate. The v1.1.0 GitHub release was never published: its tag hit a release-assembly race (fixed in this version’s workflows) and is retained as an inert tag. v1.1.1 is the first release with attached Mac app DMGs; all 1.1.0 changes below are included.
Librarian 1.1.0 introduces the native macOS app: a self-contained download with the entire engine inside. Release builds bundle a relocatable Python runtime plus the Librarian wheel in Librarian.app, launch the backend automatically on a loopback port secured by a random per-launch API key, and store data in ~/Library/Application Support/Librarian. The app offers drag-and-drop ingest, live per-run progress with expandable run events, cleaned-output viewing with classification, full-text search, Markdown export, and a backend readiness checklist — all over the same public HTTP API the CLI uses. DMG installers for Apple Silicon and Intel are built by the new macapp.yml workflow and attached to releases, with optional Developer ID signing and notarization via repository secrets, plus a download landing page under site/.
Engine and tooling changes:
LIBRARIAN_API_KEY to the spawned process and attached by the app’s API client, so other
local processes cannot read or modify the corpus over localhost.database_path so it defaults to librarian.sqlite inside data_dir. Setting only
LIBRARIAN_DATA_DIR no longer leaves the SQLite database at a working-directory-relative
.librarian/librarian.sqlite while uploads follow the configured data directory. Explicit
LIBRARIAN_DATABASE_PATH values are unchanged.workspace conversion output mode and made it the default for librarian import and
POST /imports: converted files now land under <data_dir>/converted instead of a
librarian-converted/ directory created next to the source documents. convert-dir keeps its
explicit subdirectory default but also accepts workspace.python -m librarian module entry point.librarian.maintainer, which ships with source checkouts and is excluded from release wheels.
librarian maintainer commands print an actionable message when the harness is absent.httpx2 to the dev dependencies: starlette 1.2 deprecates its httpx 1.x test-client shim
and leaves it untyped, which broke pyright on fresh installs. The lockfile now pins
starlette 1.2.1/fastapi 0.136.3 so local runs match fresh CI resolution.BLE001) for src/: every except Exception must either
re-raise or carry an inline justification stating where the error is recorded. The 11
deliberate boundary handlers are annotated; new silent swallows fail lint. A settings audit
confirmed all 71 configuration fields are read by runtime code.CONTRIBUTING.md release status and librarian maintainer command examples, and
trimmed the README Docker section to a pointer into docs/DEPLOYMENT.md.Librarian 1.0.0 is the stable release of the local-first document ingestion, cleaning, classification, and search engine. It ships a focused user CLI and FastAPI service for converting documents, importing corpora, running provenance-rich LLM cleaning, classifying outputs with Dewey-style labels, searching SQLite FTS indexes, and exporting cleaned content with optional transcript citation evidence. The release supports Markdown, text-like files, DOCX, PDFs, OCR images, and SRT/VTT transcript normalization, including page-aware PDF extraction with durable OCR page manifests for long-running jobs. Operational commands are grouped under librarian admin, while evaluation and benchmark tools are grouped under librarian maintainer so the production surface stays clear. The release workflow keeps secret scanning, dependency audit, SBOM generation, checksums, artifact attestations, wheel smoke installation, Docker build, and image scanning, while removing alpha-era mock evidence artifacts from published releases.