regex.pcre Module Documentation

The regex.pcre module is a high-performance Virtual Machine (VM) based regular expression engine for V.

Key Features

Non-recursive VM: Safe execution that avoids stack overflows on complex patterns.
Zero-Allocation Search: Uses a pre-allocated Machine workspace for search operations.
Fast ASCII Path: Optimized path for characters < 128 to bypass heavy UTF-8 decoding.
Bitmap Lookups: ASCII character classes use a 128-bit bitset for $O(1)$ matching.
Instruction Merging: Consecutive character matches are merged into string blocks for faster execution.
Bitmap lookups: ASCII character classes use a 128-bit bitset for O(1) matching.
NFA Virtual Machine: Executes bytecode instructions to simulate pattern matching.
Dynamic Stack Growth: Automatically expands the backtracking stack to prevent false negatives.
Zero-Allocation Search: Reuses a pre-allocated Machine workspace for search operations.
Anchored Optimization: Patterns starting with '^' skip the scanning loop.
Prefix Skipping: Uses Boyer-Moore-like skipping for literal prefixes.

Supported Syntax

| Feature | Syntax | Description | | :--- | :--- | :--- | | Literals | abc | Matches exact characters (UTF-8 supported). | | Wildcard | . | Matches any character (excluding \n unless (?s) flag is used). | | Alternation | | | Matches the left OR right expression (e.g., cat|dog). | | Quantifiers | *, +, ? | Matches 0+, 1+, or 0-1 times. | | Lazy | *?, +?, ?? | Non-greedy versions of the above. | | Repetition | {m,n} | Matches between m and n times. {m,} for m or more. | | Groups | (...) | Capturing group. | | | (?:...) | Non-capturing group. | | | (?P<name>...) | Named capturing group. | | Anchors | ^, $ | Start/End of string (or line with (?m)). | | | \b, \B | Word boundary and Non-word boundary. | | Classes | [abc], [^abc] | Character set and Negated character set. | | | [a-z] | Range of characters. | | | \w, \W | Word / Non-word ([a-zA-Z0-9_]). | | | \d, \D | Digit / Non-digit. | | | \s, \S | Whitespace / Non-whitespace ( \t\n\r\v\f). | | | \a, \A | Lowercase / Uppercase ASCII character class. | | Flags | (?i) | Case-insensitive matching. | | | (?m) | Multiline mode (^ and $ match start/end of lines). | | | (?s) | Dot-all mode (. matches newlines). |

Structs

Regex

The compiled regular expression object.

pub struct Regex {
pub:
    pattern      string         // The original pattern
    prog         []Inst         // Compiled VM bytecode
    total_groups int            // Number of capture groups
    group_map    map[string]int // Map for named groups
}

Match

Represents the result of a successful search.

pub struct Match {
pub:
    text   string   // The full substring that matched
    start  int      // Byte index where match starts
    end    int      // Byte index where match ends
    groups []string // Text captured by each group
}

Core Functions

`compile`

Compiles a pattern into a Regex object.

fn compile(pattern string) !Regex

`find`

Finds the first match in the text. Returns none if no match is found.

fn (r Regex) find(text string) ?Match

`find_all`

Returns all non-overlapping matches in a string.

fn (r Regex) find_all(text string) []Match

`replace`

Replaces the first match in text with repl. Supports backreferences like $1, $2.

fn (r Regex) replace(text string, repl string) string

`change_stack_depth`

Updates the maximum backtracking depth for the VM. Default is 1024. Use this if your pattern is extremely complex and returns none prematurely.

fn (mut r Regex) change_stack_depth(depth int)

Named Groups Example

import regex.pcre

fn main() {
    r := pcre.compile(r'(?P<year>\d{4})-(?P<month>\d{2})')!
    m := r.find('Date: 2026-02') or { return }

    year := r.group_by_name(m, 'year')
    month := r.group_by_name(m, 'month')
    println('Year: ${year}, Month: ${month}') // Year: 2026, Month: 02
}

PCRE Compatibility Layer

To facilitate easier migration from other engines, a compatibility layer is provided:

Example:

import regex.pcre

r := pcre.new_regex(r'(\w+) (\w+)', 0)!
if m := r.match_str('hello world', 0, 0) {
    println(m.get(0)?) // "hello world"
    println(m.get(1)?) // "hello"
    println(m.get(2)?) // "world"
}

Performance Note

Here is a clear summary of the optimizations implemented in the code:

Raw Pointer Access: The VM bypasses standard array bounds checking by using unsafe pointer arithmetic for both the instruction set and the string text, significantly speeding up the hot loop.
Zero-Allocation Search: The Machine struct pre-allocates the backtracking stack and capture arrays, ensuring that running a search (finding a match) creates no new heap allocations (garbage collection pressure is zero).
Fast ASCII Path: The code checks if a byte is < 128 before decoding. If it is ASCII, it skips the expensive UTF-8 decoding logic entirely.
Bitmap Class Lookups: Character classes (like \w, \d, [a-z]) use a 128-bit bitset. Checking if an ASCII character matches a class is a single O(1) bitwise operation.
Instruction Merging: The compiler groups consecutive literal characters into a single string instruction (e.g., a, b, c becomes "abc"), reducing the number of VM cycles required.
Prefix Skipping: If a pattern starts with a literal string, the engine scans ahead for that substring (Boyer-Moore style) before initializing the VM, avoiding useless execution.
Anchored Optimization: If the pattern starts with ^, the engine only attempts a match at the start of the string (or line), skipping the character-by-character scan of the rest of the text.