chunkr
Overview Link to heading
chunkr is a Rust CLI for turning an ebook library into search- and retrieval-ready text artifacts. It extracts text and metadata from Calibre-managed books, normalizes and chunks that text, generates embeddings through an HTTP provider, and inserts the resulting records into downstream systems such as Qdrant and Quickwit.
Ambition Link to heading
Build a high-throughput data pipeline for RAG (Retrieval-Augmented Generation) systems that can process and index millions of documents with precision.
What’s novel Link to heading
- End-to-end normalization and chunking optimized for LLM contexts.
- Native integration for Quickwit and Qdrant vector databases.
- High-performance Rust implementation for massive document throughput.
Highlights Link to heading
- Extract
.epuband.pdfcontent from a Calibre library tree. - Preserve book-level metadata in JSON sidecars during extraction.
- Normalize text before chunking, including Unicode cleanup and whitespace collapsing.
- Produce JSONL chunk records with stable metadata fields and per-chunk offsets.
- Generate embeddings through a configurable HTTP embedding provider.
Stats Link to heading
- Project page: /projects/chunkr/
- Primary language: Rust
- Commits: 610
- Created: 2026-01-29T10:17:54Z
- Last updated: 2026-05-03T21:44:58Z
Links Link to heading
- Repo: https://github.com/sguzman/chunkr
- README: /projects/readme/chunkr/
- DeepWiki: https://deepwiki.com/sguzman/chunkr/