⚖️→🔢

Words to Data

Convert Legal Documents Into Diffable Data Structures

Features

📄

Parse Legal Documents

Extract hierarchical structure from US Code titles and Public Laws in USLM XML format with full structural preservation.

📊

Hierarchical Diffing

Compute word-level differences between document versions to precisely track changes over time.

Parallel Processing

Parse multiple documents concurrently using Rayon for blazing-fast performance.

📦

JSON Serialization

All data structures implement Serde traits for easy integration with any system.

🔍

Bill Amendment Extraction

Automatically identify USC references and amending actions from bills to track legislative changes.

📜

MIT Licensed

Free and open source software. Use it in your projects, modify it, and distribute it freely.

And More

We're actively working on Python bindings, legal-specific diff algorithms, enhanced bill parsing, pre-built datasets, and congressional vote tracking.

View Full Roadmap

Installation

Add to Cargo.toml

TOML
[dependencies]
words-to-data = "0.1.1"

Build from Source

Bash
git clone https://github.com/Scronkfinkle/words-to-data
cd words_to_data
cargo build --lib --release

Quick Start Examples

Parse a US Code Document

Rust
use words_to_data::uslm::parser::parse;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let document = parse("tests/test_data/usc/2025-07-18/usc07.xml", "2025-07-18")?;

    println!("Parsed: {}", document.data.verbose_name);
    println!("USLM ID: {:?}", document.data.uslm_id);
    println!("Children: {}", document.children.len());

    Ok(())
}

Compute a Diff Between Versions

Rust
use std::fs;
use words_to_data::{diff::TreeDiff, uslm::parser::parse};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let doc_old = parse("tests/test_data/usc/2025-07-18/usc07.xml", "2025-07-18")?;
    let doc_new = parse("tests/test_data/usc/2025-07-30/usc07.xml", "2025-07-30")?;

    let diff = TreeDiff::from_elements(&doc_old, &doc_new);
    words_to_data::utils::write_json_file(&diff, "diff.json")?;
    Ok(())
}

Extract Amendments from a Bill

Rust
use words_to_data::uslm::bill_parser::parse_bill_amendments;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let data = parse_bill_amendments("tests/test_data/bills/hr-119-21.xml")?;

    println!("Bill {}: {} amendments found", data.bill_id, data.amendments.len());

    for amendment in &data.amendments {
        println!("\nAmendment at: {}", amendment.source_path);
        println!("  USC sections modified: {}", amendment.target_paths.len());
        println!("  Actions: {:?}", amendment.action_types);
    }

    Ok(())
}

Parse Multiple Documents in Parallel

Rust
use words_to_data::utils::parse_uslm_directory;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let documents = parse_uslm_directory("tests/test_data/usc/2025-07-18", "2025-07-18")?;

    println!("Parsed {} documents in parallel", documents.len());

    for doc in documents.iter().take(5) {
        println!("  - {} ({})", doc.data.verbose_name, doc.data.path);
    }

    Ok(())
}

Documentation & Resources

🔧

GitHub Repository

Source code, issue tracking for the Words to Data project.

View Source
📦

Crates.io

Download the latest version and view version history on the official Rust package registry.

View Crate

Get in Touch

Have questions, feedback, or partnership opportunities? We'd love to hear from you.

📧
contact@wordstodata.com

For technical support, please open an issue on GitHub