Nenad Bursać

sentience-tokenize is a lightweight, focused tokenizer / lexical scanner for Rust.
It is ideal for small languages, config files, templating, and parser-combinators when you want clarity, stable performance, and predictable errors — without heavy dependencies.

Highlights

Small & fast: minimal core, no unnecessary allocations (zero-copy when possible).
Configurable via spec: describe tokens explicitly (identifiers, numbers, strings, whitespace, comments…).
Iterator-first API: stream tokens or peek without copying.
Robust errors: line/col positions, expected vs. found, ready-to-display diagnostics.
Test-friendly: deterministic output, easy snapshot / assertion testing.
No magic: no codegen macros; everything is explicit and maintainable.

Quick Start

Install

# Cargo.toml
[dependencies]
sentience-tokenize = "0.2.1"

Minimal usage

use sentience_tokenize::{Tokenizer, Token};
 
fn main() {
    let input = "let x = 42;";
    let mut tokenizer = Tokenizer::new(input);
 
    for token in tokenizer {
        println!("{:?}", token);
    }
}

Zero-copy scanning example

use sentience_tokenize::Tokenizer;
 
fn main() {
    let input = "foo bar baz";
    let tokenizer = Tokenizer::new(input);
 
    for token in tokenizer {
        // tokens borrow slices directly from the input
        println!("{:?}", token.lexeme());
    }
}

ℹ️ See full API docs on docs.rs:
sentience-tokenize API

Features

Core tokens: identifiers, numbers (int/float), strings (quote variants), whitespace, comments.
Custom rules: easily extend with your own patterns.
Diagnostics: offset, line/column, expected vs. found.
Stable API surface: iterator interface, zero-copy variants, testing helpers.
Optional features: keep core minimal, enable only what you need.

Spec

Aspect	Rule
Ident	`[a-zA-Z_][a-zA-Z0-9_]*`
Number	Decimal/hex literals, with optional sign and fraction
String	`"double quoted"` or `'single quoted'`, with escape sequences
Whitespace	space, tab, newline
Comment	`// line comments`, `/* block comments */`

The tokenizer is fully configurable: you can enable/disable token kinds as needed.

Error Reporting

When encountering unexpected input, errors include the token offset, line/col, and a human-readable message.

let input = "foo 123.abc";
let mut t = Tokenizer::new(input);
 
for tok in t {
    if let Err(err) = tok {
        eprintln!("Error: {:?}", err);
    }
}

Output:

Error at 1:4: unexpected character '.' after number

Stable API Surface

Tokenizer::new(&str) — create a new scanner.
Iterator<Item=Token> — iterate over tokens or errors.
Zero-copy: token.lexeme() borrows from the original string.
Optional helpers for testing and diagnostics.

Iterator API Example

use sentience_tokenize::Tokenizer;
 
fn main() {
    let input = "let sum = a + b;";
    for token in Tokenizer::new(input) {
        println!("{:?}", token);
    }
}

Run Tests

cargo test

Example Binary

A simple demo binary is included:

cargo run --example demo -- "let x = 5 + 10;"

Dev / Benchmark / Fuzzing

Dev loop: cargo check
Benchmark: cargo bench
Fuzzing: cargo fuzz run tokenizer

Why?

Zero allocations for common cases.
Explicit error reporting and control.
Tailored for small DSLs, interpreters, or config languages.

Background

Originally built for experimental DSLs in Sentience and reused across projects where a compact and predictable tokenizer was needed.

Tiny, fast, configurable tokenizer for Rust: sentience-tokenize