sentience-tokenize is a lightweight, focused tokenizer / lexical scanner for Rust.
It is ideal for small languages, config files, templating, and parser-combinators when you want clarity, stable performance, and predictable errors — without heavy dependencies.
Highlights
- Small & fast: minimal core, no unnecessary allocations (zero-copy when possible).
- Configurable via spec: describe tokens explicitly (identifiers, numbers, strings, whitespace, comments…).
- Iterator-first API: stream tokens or peek without copying.
- Robust errors: line/col positions, expected vs. found, ready-to-display diagnostics.
- Test-friendly: deterministic output, easy snapshot / assertion testing.
- No magic: no codegen macros; everything is explicit and maintainable.
Quick Start
Install
# Cargo.toml
[dependencies]
sentience-tokenize = "0.2.1"
Minimal usage
use sentience_tokenize::{Tokenizer, Token};
fn main() {
let input = "let x = 42;";
let mut tokenizer = Tokenizer::new(input);
for token in tokenizer {
println!("{:?}", token);
}
}
Zero-copy scanning example
use sentience_tokenize::Tokenizer;
fn main() {
let input = "foo bar baz";
let tokenizer = Tokenizer::new(input);
for token in tokenizer {
// tokens borrow slices directly from the input
println!("{:?}", token.lexeme());
}
}
ℹ️ See full API docs on docs.rs:
sentience-tokenize API
Features
- Core tokens: identifiers, numbers (int/float), strings (quote variants), whitespace, comments.
- Custom rules: easily extend with your own patterns.
- Diagnostics: offset, line/column, expected vs. found.
- Stable API surface: iterator interface, zero-copy variants, testing helpers.
- Optional features: keep core minimal, enable only what you need.
Spec
Aspect | Rule |
---|---|
Ident | [a-zA-Z_][a-zA-Z0-9_]* |
Number | Decimal/hex literals, with optional sign and fraction |
String | "double quoted" or 'single quoted' , with escape sequences |
Whitespace | space, tab, newline |
Comment | // line comments , /* block comments */ |
The tokenizer is fully configurable: you can enable/disable token kinds as needed.
Error Reporting
When encountering unexpected input, errors include the token offset, line/col, and a human-readable message.
let input = "foo 123.abc";
let mut t = Tokenizer::new(input);
for tok in t {
if let Err(err) = tok {
eprintln!("Error: {:?}", err);
}
}
Output:
Error at 1:4: unexpected character '.' after number
Stable API Surface
Tokenizer::new(&str)
— create a new scanner.Iterator<Item=Token>
— iterate over tokens or errors.- Zero-copy:
token.lexeme()
borrows from the original string. - Optional helpers for testing and diagnostics.
Iterator API Example
use sentience_tokenize::Tokenizer;
fn main() {
let input = "let sum = a + b;";
for token in Tokenizer::new(input) {
println!("{:?}", token);
}
}
Run Tests
cargo test
Example Binary
A simple demo binary is included:
cargo run --example demo -- "let x = 5 + 10;"
Dev / Benchmark / Fuzzing
- Dev loop:
cargo check
- Benchmark:
cargo bench
- Fuzzing:
cargo fuzz run tokenizer
Why?
- Zero allocations for common cases.
- Explicit error reporting and control.
- Tailored for small DSLs, interpreters, or config languages.
Background
Originally built for experimental DSLs in Sentience and reused across projects where a compact and predictable tokenizer was needed.