cfr-to-text

Overview Link to heading

Extract text from CFR XML files (Code of Federal Regulations) into plain text or JSONL.

Ambition Link to heading

A robust, industrial-grade extraction tool for the Code of Federal Regulations, stripping complex XML schemas into semantic text.

What’s novel Link to heading

  • Sophisticated CLI with TOML configuration for fine-grained control over element exclusion and whitespace normalization.
  • High-speed XML event processing capable of handling the entire US Federal database.
  • Automated output splitting and file management for massive, multi-part datasets.

Highlights Link to heading

  • --config <FILE>: Config file path (default cfr-to-text.toml)
  • --input-dir <DIR> / positional inputs
  • --recursive / --no-recursive
  • --glob <GLOB> (repeatable)
  • --output-dir <DIR> or --output <FILE>

Stats Link to heading

  • Project page: /projects/cfr-to-text/
  • Primary language: Rust
  • Commits: 5
  • Created: 2026-01-29T11:54:35Z
  • Last updated: 2026-01-29T12:49:25Z