chunkr

Overview Link to heading

chunkr is a Rust CLI for turning an ebook library into search- and retrieval-ready text artifacts. It extracts text and metadata from Calibre-managed books, normalizes and chunks that text, generates embeddings through an HTTP provider, and inserts the resulting records into downstream systems such as Qdrant and Quickwit.

Ambition Link to heading

Build a high-throughput data pipeline for RAG (Retrieval-Augmented Generation) systems that can process and index millions of documents with precision.

What’s novel Link to heading

End-to-end normalization and chunking optimized for LLM contexts.
Native integration for Quickwit and Qdrant vector databases.
High-performance Rust implementation for massive document throughput.

Highlights Link to heading

Extract .epub and .pdf content from a Calibre library tree.
Preserve book-level metadata in JSON sidecars during extraction.
Normalize text before chunking, including Unicode cleanup and whitespace collapsing.
Produce JSONL chunk records with stable metadata fields and per-chunk offsets.
Generate embeddings through a configurable HTTP embedding provider.

Stats Link to heading

Project page: /projects/chunkr/
Primary language: Rust
Commits: 610
Created: 2026-01-29T10:17:54Z
Last updated: 2026-05-03T21:44:58Z

Links Link to heading

Repo: https://github.com/sguzman/chunkr
README: /projects/readme/chunkr/
DeepWiki: https://deepwiki.com/sguzman/chunkr/