Lexing and Parsing: Tokenization and AST Construction

Key Insights

Lexers and parsers form a two-stage pipeline that transforms raw text into structured data—understanding this separation makes building language tools dramatically simpler.
Recursive descent parsing maps grammar rules directly to functions, making it intuitive to implement and debug without external tooling.
Good AST design captures semantic meaning while discarding syntactic noise, enabling clean downstream processing for interpreters, compilers, or analysis tools.

Why Build Your Own Parser?

Parsers appear everywhere in software engineering. Compilers and interpreters are the obvious examples, but you’ll also find parsing logic in configuration file readers, template engines, linters, code formatters, query languages, and domain-specific languages (DSLs). Even Markdown renderers and JSON parsers follow the same fundamental patterns.

Understanding how to build a parser from scratch gives you the ability to create custom languages tailored to your domain, build better tooling, and debug mysterious syntax errors in existing tools. More practically, it demystifies what happens between typing code and executing it.

The parsing pipeline splits into two distinct phases:

Lexical analysis (lexing): Transform raw characters into tokens
Syntactic analysis (parsing): Transform tokens into an Abstract Syntax Tree (AST)

This separation of concerns makes each phase simpler to reason about and test independently.

Lexical Analysis Fundamentals

A lexer (also called a tokenizer or scanner) reads raw text and produces a stream of tokens. Three concepts matter here:

Lexeme: The actual character sequence in the source ("42", "while", "+=")
Token: A categorized lexeme with metadata (type, position, value)
Pattern: The rule that matches a lexeme to a token type (regex, character class, or explicit match)

Token types typically include keywords (if, while, return), identifiers (variable names), literals (numbers, strings), operators (+, ==, &&), and delimiters ({, ), ;).

Whitespace handling varies by language. Most lexers skip whitespace between tokens, but languages like Python require tracking indentation. Position tracking (line and column numbers) is essential for producing useful error messages.

Here’s a token definition for a simple expression language:

from enum import Enum, auto
from dataclasses import dataclass
from typing import Any

class TokenType(Enum):
    NUMBER = auto()
    IDENTIFIER = auto()
    PLUS = auto()
    MINUS = auto()
    STAR = auto()
    SLASH = auto()
    LPAREN = auto()
    RPAREN = auto()
    EOF = auto()
    ERROR = auto()

@dataclass
class Token:
    type: TokenType
    lexeme: str
    value: Any
    line: int
    column: int

Building a Lexer

A lexer scans input character by character, matching patterns and emitting tokens. The core loop advances through the source, identifies token boundaries, and constructs token objects.

Lookahead handles multi-character tokens. When you see =, you need to peek at the next character to distinguish = from ==. Most languages require only single-character lookahead, though some edge cases need more.

class Lexer:
    def __init__(self, source: str):
        self.source = source
        self.pos = 0
        self.line = 1
        self.column = 1
        self.start = 0
        self.start_column = 1
    
    def peek(self) -> str:
        if self.pos >= len(self.source):
            return '\0'
        return self.source[self.pos]
    
    def advance(self) -> str:
        char = self.peek()
        self.pos += 1
        if char == '\n':
            self.line += 1
            self.column = 1
        else:
            self.column += 1
        return char
    
    def skip_whitespace(self):
        while self.peek() in ' \t\n\r':
            self.advance()
    
    def make_token(self, token_type: TokenType, value: Any = None) -> Token:
        lexeme = self.source[self.start:self.pos]
        return Token(token_type, lexeme, value, self.line, self.start_column)
    
    def number(self) -> Token:
        while self.peek().isdigit():
            self.advance()
        
        # Handle decimals
        if self.peek() == '.' and self.source[self.pos + 1:self.pos + 2].isdigit():
            self.advance()  # consume '.'
            while self.peek().isdigit():
                self.advance()
        
        lexeme = self.source[self.start:self.pos]
        return self.make_token(TokenType.NUMBER, float(lexeme))
    
    def identifier(self) -> Token:
        while self.peek().isalnum() or self.peek() == '_':
            self.advance()
        return self.make_token(TokenType.IDENTIFIER)
    
    def next_token(self) -> Token:
        self.skip_whitespace()
        self.start = self.pos
        self.start_column = self.column
        
        if self.pos >= len(self.source):
            return self.make_token(TokenType.EOF)
        
        char = self.advance()
        
        if char.isdigit():
            return self.number()
        if char.isalpha() or char == '_':
            return self.identifier()
        
        simple_tokens = {
            '+': TokenType.PLUS,
            '-': TokenType.MINUS,
            '*': TokenType.STAR,
            '/': TokenType.SLASH,
            '(': TokenType.LPAREN,
            ')': TokenType.RPAREN,
        }
        
        if char in simple_tokens:
            return self.make_token(simple_tokens[char])
        
        return self.make_token(TokenType.ERROR, f"Unexpected character: {char}")
    
    def tokenize(self) -> list[Token]:
        tokens = []
        while True:
            token = self.next_token()
            tokens.append(token)
            if token.type == TokenType.EOF:
                break
        return tokens

Testing this lexer:

lexer = Lexer("3 + 42 * x")
for token in lexer.tokenize():
    print(f"{token.type.name}: '{token.lexeme}'")

# Output:
# NUMBER: '3'
# PLUS: '+'
# NUMBER: '42'
# STAR: '*'
# IDENTIFIER: 'x'
# EOF: ''

Parsing Theory: Grammars and Precedence

Context-free grammars (CFGs) formally describe language syntax. BNF (Backus-Naur Form) notation expresses these grammars concisely. Each rule defines how a non-terminal symbol expands into terminals and other non-terminals.

Here’s a grammar for our expression language:

expression  → term (('+' | '-') term)*
term        → factor (('*' | '/') factor)*
factor      → NUMBER | IDENTIFIER | '(' expression ')'

This grammar encodes operator precedence through its structure. Multiplication and division bind tighter than addition and subtraction because term handles * and /, while expression handles + and -. The parser descends through expression → term → factor, so multiplicative operators get grouped first.

Recursive descent parsing translates each grammar rule into a function. The function consumes tokens matching its rule and returns an AST node. This approach is straightforward to implement, debug, and extend.

Recursive Descent Parsing

First, define AST nodes that capture the semantic structure:

from abc import ABC
from dataclasses import dataclass

class Expr(ABC):
    pass

@dataclass
class NumberExpr(Expr):
    value: float
    token: Token

@dataclass
class IdentifierExpr(Expr):
    name: str
    token: Token

@dataclass
class BinaryExpr(Expr):
    left: Expr
    operator: Token
    right: Expr

@dataclass
class GroupingExpr(Expr):
    expression: Expr

Now the parser, with each grammar rule becoming a method:

class Parser:
    def __init__(self, tokens: list[Token]):
        self.tokens = tokens
        self.pos = 0
    
    def peek(self) -> Token:
        return self.tokens[self.pos]
    
    def previous(self) -> Token:
        return self.tokens[self.pos - 1]
    
    def is_at_end(self) -> bool:
        return self.peek().type == TokenType.EOF
    
    def advance(self) -> Token:
        if not self.is_at_end():
            self.pos += 1
        return self.previous()
    
    def check(self, token_type: TokenType) -> bool:
        if self.is_at_end():
            return False
        return self.peek().type == token_type
    
    def match(self, *types: TokenType) -> bool:
        for token_type in types:
            if self.check(token_type):
                self.advance()
                return True
        return False
    
    def consume(self, token_type: TokenType, message: str) -> Token:
        if self.check(token_type):
            return self.advance()
        raise SyntaxError(f"{message} at line {self.peek().line}")
    
    def parse(self) -> Expr:
        return self.expression()
    
    # expression → term (('+' | '-') term)*
    def expression(self) -> Expr:
        expr = self.term()
        
        while self.match(TokenType.PLUS, TokenType.MINUS):
            operator = self.previous()
            right = self.term()
            expr = BinaryExpr(expr, operator, right)
        
        return expr
    
    # term → factor (('*' | '/') factor)*
    def term(self) -> Expr:
        expr = self.factor()
        
        while self.match(TokenType.STAR, TokenType.SLASH):
            operator = self.previous()
            right = self.factor()
            expr = BinaryExpr(expr, operator, right)
        
        return expr
    
    # factor → NUMBER | IDENTIFIER | '(' expression ')'
    def factor(self) -> Expr:
        if self.match(TokenType.NUMBER):
            token = self.previous()
            return NumberExpr(token.value, token)
        
        if self.match(TokenType.IDENTIFIER):
            token = self.previous()
            return IdentifierExpr(token.lexeme, token)
        
        if self.match(TokenType.LPAREN):
            expr = self.expression()
            self.consume(TokenType.RPAREN, "Expected ')' after expression")
            return GroupingExpr(expr)
        
        raise SyntaxError(f"Expected expression at line {self.peek().line}")

AST Design and Construction

The AST captures semantic meaning while discarding syntactic details like parentheses (captured implicitly through tree structure) and whitespace. Each node type represents a distinct language construct.

Good AST design principles:

Preserve source locations for error messages and tooling
Use specific node types rather than generic “node with children” structures
Capture semantic information like resolved operator precedence

Here’s the complete pipeline in action:

def parse_expression(source: str) -> Expr:
    lexer = Lexer(source)
    tokens = lexer.tokenize()
    parser = Parser(tokens)
    return parser.parse()

def print_ast(expr: Expr, indent: int = 0) -> None:
    prefix = "  " * indent
    match expr:
        case NumberExpr(value, _):
            print(f"{prefix}Number({value})")
        case IdentifierExpr(name, _):
            print(f"{prefix}Identifier({name})")
        case BinaryExpr(left, op, right):
            print(f"{prefix}Binary({op.lexeme})")
            print_ast(left, indent + 1)
            print_ast(right, indent + 1)
        case GroupingExpr(inner):
            print(f"{prefix}Group")
            print_ast(inner, indent + 1)

# Parse and visualize
ast = parse_expression("3 + 42 * x")
print_ast(ast)

# Output:
# Binary(+)
#   Number(3.0)
#   Binary(*)
#     Number(42.0)
#     Identifier(x)

Notice how 42 * x becomes a subtree of the addition, correctly representing that multiplication binds tighter.

Practical Considerations and Next Steps

Error recovery makes parsers production-ready. When encountering an error, synchronize to a known state (like the next statement) rather than aborting completely. This lets you report multiple errors in one pass.

Testing strategies for lexers and parsers:

Test individual tokens in isolation
Test grammar rules with minimal inputs
Test precedence with expressions like 1 + 2 * 3
Test error cases and verify error messages include line numbers

Parser generators like ANTLR, tree-sitter, or PEG libraries generate parsers from grammar specifications. They’re valuable for complex languages but add dependencies and reduce control. Start with hand-written parsers to understand the fundamentals, then evaluate generators for larger projects.

From here, you can extend this foundation to build interpreters (walk the AST and evaluate), compilers (transform AST to target code), or analysis tools (traverse AST to find patterns). The lexer-parser-AST pipeline remains the same regardless of what you build on top.