> cat backus-naur-form:-syntax-specification-for-language-design.md

Backus-Naur Form: Syntax Specification for Language Design

📅

The specification of programming language syntax requires a notation that is both mathematically precise and practically useful for parser generation. Backus-Naur Form (BNF), developed independently by John Backus and Peter Naur in the late 1950s, provides exactly this capability through a context-free grammar notation that has become the standard for describing language syntax.

Basic Notation

BNF uses a simple set of symbols to define grammar rules:

  • ::= means “is defined as”
  • | means “or” (alternative productions)
  • <symbol> denotes a non-terminal (to be further defined)
  • Literal text appears without brackets

A complete grammar consists of production rules that recursively define the language structure.

Fundamental Example

Consider a simple arithmetic expression grammar:

expression::=termexpression+termexpressionterm\langle\text{expression}\rangle ::= \langle\text{term}\rangle \mid \langle\text{expression}\rangle + \langle\text{term}\rangle \mid \langle\text{expression}\rangle - \langle\text{term}\rangle

term::=factortermfactorterm/factor\langle\text{term}\rangle ::= \langle\text{factor}\rangle \mid \langle\text{term}\rangle * \langle\text{factor}\rangle \mid \langle\text{term}\rangle / \langle\text{factor}\rangle

factor::=number(expression)\langle\text{factor}\rangle ::= \langle\text{number}\rangle \mid ( \langle\text{expression}\rangle )

number::=digitnumberdigit\langle\text{number}\rangle ::= \langle\text{digit}\rangle \mid \langle\text{number}\rangle \langle\text{digit}\rangle

digit::=0123456789\langle\text{digit}\rangle ::= 0 \mid 1 \mid 2 \mid 3 \mid 4 \mid 5 \mid 6 \mid 7 \mid 8 \mid 9

This grammar captures operator precedence through the hierarchical structure: multiplication and division bind more tightly than addition and subtraction, whilst parentheses override precedence.

Language Definition

The same notation scales to define complete programming languages. A simplified variable assignment might be specified as:

statement::=assignmentif-statementwhile-statement\langle\text{statement}\rangle ::= \langle\text{assignment}\rangle \mid \langle\text{if-statement}\rangle \mid \langle\text{while-statement}\rangle

assignment::=identifier=expression\langle\text{assignment}\rangle ::= \langle\text{identifier}\rangle = \langle\text{expression}\rangle

identifier::=letteridentifierletteridentifierdigit\langle\text{identifier}\rangle ::= \langle\text{letter}\rangle \mid \langle\text{identifier}\rangle \langle\text{letter}\rangle \mid \langle\text{identifier}\rangle \langle\text{digit}\rangle

letter::=abczABZ\langle\text{letter}\rangle ::= a \mid b \mid c \mid \ldots \mid z \mid A \mid B \mid \ldots \mid Z

The recursive definition of identifier\langle\text{identifier}\rangle allows arbitrarily long variable names, whilst the alternatives in statement\langle\text{statement}\rangle permit different statement types within the same syntactic category.

Extended Backus-Naur Form

Extended BNF (EBNF) introduces additional operators to reduce verbosity:

  • [ ] denotes optional elements
  • { } denotes zero or more repetitions
  • ( ) groups alternatives
  • + denotes one or more repetitions

The arithmetic grammar becomes more concise in EBNF:

expression=term,{(+),term};\text{expression} = \text{term}, \{ (+ \mid -), \text{term} \} ;

term=factor,{(/),factor};\text{term} = \text{factor}, \{ (* \mid /) , \text{factor} \} ;

factor=number(,expression,);\text{factor} = \text{number} \mid (, \text{expression}, ) ;

number=digit,{digit};\text{number} = \text{digit}, \{ \text{digit} \} ;

digit=0123456789;\text{digit} = 0 \mid 1 \mid 2 \mid 3 \mid 4 \mid 5 \mid 6 \mid 7 \mid 8 \mid 9 ;

Real-World Example: Standard ML

The Standard ML definition provides an exemplary use of BNF for complete language specification. The formal grammar covers pattern matching, type expressions, and functional constructs with mathematical precision:

dec::=val tyvarseqvalbind\langle\text{dec}\rangle ::= \text{val } \langle\text{tyvarseq}\rangle \langle\text{valbind}\rangle fun tyvarseqfvalbind\mid \text{fun } \langle\text{tyvarseq}\rangle \langle\text{fvalbind}\rangle type typbind\mid \text{type } \langle\text{typbind}\rangle datatype datbind\mid \text{datatype } \langle\text{datbind}\rangle dec;dec\mid \langle\text{dec}\rangle ; \langle\text{dec}\rangle

This excerpt demonstrates how BNF captures both concrete syntax (val, fun) and recursive structure (declarations can be sequenced with semicolons). The complete specification runs to hundreds of such production rules, yet remains precise enough for implementation and proof.

Parser Generation

BNF grammars serve as direct input to parser generators such as Yacc, Bison, and ANTLR. The formal structure enables automatic generation of parsing tables and recursive descent parsers, eliminating hand-coded parsing logic for most languages.

Modern parser generators often accept EBNF variants, translating the concise notation into the finite state automata required for efficient parsing.

Limitations and Extensions

Pure BNF describes only context-free languages, which cannot express certain constraints such as variable declaration before use or type consistency. These semantic properties require additional specification methods:

  • Attribute grammars augment productions with semantic actions
  • Two-phase compilation separates syntactic parsing from semantic analysis
  • Constraint languages specify additional well-formedness conditions

Despite these limitations, BNF remains the foundation for syntax specification across virtually all programming languages, from C’s formal grammar in the ISO standard to the ECMAScript specification for JavaScript.

The notation’s enduring success stems from its mathematical foundation in formal language theory combined with immediate practical utility for compiler construction—a rare convergence of theoretical elegance and engineering pragmatism.