Tiny, fast, configurable tokenizer for Rust: sentience-tokenize

2025-08-19 Rust, Tokenizer, Iterator API, Zero-copy, Diagnostics

← Back to Projects

sentience-tokenize is a lightweight, focused tokenizer / lexical scanner for Rust.
It is ideal for small languages, config files, templating, and parser-combinators when you want clarity, stable performance, and predictable errors — without heavy dependencies.

Highlights

  • Small & fast: minimal core, no unnecessary allocations (zero-copy when possible).
  • Configurable via spec: describe tokens explicitly (identifiers, numbers, strings, whitespace, comments…).
  • Iterator-first API: stream tokens or peek without copying.
  • Robust errors: line/col positions, expected vs. found, ready-to-display diagnostics.
  • Test-friendly: deterministic output, easy snapshot / assertion testing.
  • No magic: no codegen macros; everything is explicit and maintainable.

Quick Start

Install

# Cargo.toml
[dependencies]
sentience-tokenize = "0.2.1"

Minimal usage

use sentience_tokenize::{Tokenizer, Token};
 
fn main() {
    let input = "let x = 42;";
    let mut tokenizer = Tokenizer::new(input);
 
    for token in tokenizer {
        println!("{:?}", token);
    }
}

Zero-copy scanning example

use sentience_tokenize::Tokenizer;
 
fn main() {
    let input = "foo bar baz";
    let tokenizer = Tokenizer::new(input);
 
    for token in tokenizer {
        // tokens borrow slices directly from the input
        println!("{:?}", token.lexeme());
    }
}

ℹ️ See full API docs on docs.rs:
sentience-tokenize API

Features

  • Core tokens: identifiers, numbers (int/float), strings (quote variants), whitespace, comments.
  • Custom rules: easily extend with your own patterns.
  • Diagnostics: offset, line/column, expected vs. found.
  • Stable API surface: iterator interface, zero-copy variants, testing helpers.
  • Optional features: keep core minimal, enable only what you need.

Spec

AspectRule
Ident[a-zA-Z_][a-zA-Z0-9_]*
NumberDecimal/hex literals, with optional sign and fraction
String"double quoted" or 'single quoted', with escape sequences
Whitespacespace, tab, newline
Comment// line comments, /* block comments */

The tokenizer is fully configurable: you can enable/disable token kinds as needed.

Error Reporting

When encountering unexpected input, errors include the token offset, line/col, and a human-readable message.

let input = "foo 123.abc";
let mut t = Tokenizer::new(input);
 
for tok in t {
    if let Err(err) = tok {
        eprintln!("Error: {:?}", err);
    }
}

Output:

Error at 1:4: unexpected character '.' after number

Stable API Surface

  • Tokenizer::new(&str) — create a new scanner.
  • Iterator<Item=Token> — iterate over tokens or errors.
  • Zero-copy: token.lexeme() borrows from the original string.
  • Optional helpers for testing and diagnostics.

Iterator API Example

use sentience_tokenize::Tokenizer;
 
fn main() {
    let input = "let sum = a + b;";
    for token in Tokenizer::new(input) {
        println!("{:?}", token);
    }
}

Run Tests

cargo test

Example Binary

A simple demo binary is included:

cargo run --example demo -- "let x = 5 + 10;"

Dev / Benchmark / Fuzzing

  • Dev loop: cargo check
  • Benchmark: cargo bench
  • Fuzzing: cargo fuzz run tokenizer

Why?

  • Zero allocations for common cases.
  • Explicit error reporting and control.
  • Tailored for small DSLs, interpreters, or config languages.

Background

Originally built for experimental DSLs in Sentience and reused across projects where a compact and predictable tokenizer was needed.