Syntax Specification Regular Expressions 1 Phases of Compilation 2 Syntax Analysis Syntax: Websters definition: 1 a : the way in which linguistic elements (as words) are put together to form constituents (as phrases or clauses)
The syntax of a programming language Describes its form i.e. Organization of tokens (elements) Formal notation Context Free Grammars (CFGs) 3 Review: Formal definition of tokens A set of tokens is a set of strings over an alphabet {read, write, +, -, *, /, :=, 1, 2, , 10, , 3.45e-3, }
A set of tokens is a regular set that can be defined by comprehension using a regular expression For every regular set, there is a deterministic finite automaton (DFA) that can recognize it i.e. determine whether a string belongs to the set or not Scanners extract tokens from source code in the same way DFAs determine membership 4 Regular Expressions A regular expression (RE) is: A single character
The empty string, The concatenation of two regular expressions Notation: RE1 RE2 (i.e. RE1 followed by RE2) The union of two regular expressions Notation: RE1 | RE2 The closure of a regular expression Notation: RE* * is known as the Kleene star
* represents the concatenation of 0 or more strings Non-null enumeration Notation: RE+ represents all non-null concatenations of RE (1 or more times) 5 Regular Expressions Basics Let alphabet ={a,b} (means a and b are its only letters) a*=(, a, aa, aaa, ...} (ab)*=(, ab, abab, ababab, ...} a b=(a, , b, bb, bb, ...} (a b)*= all strings containing as and bs
(a*b*)*=(ab*)*= all strings containing as and bs a*b*={aibj| i >=0, j>=0) 6 Building Regular Expressions Regular Expressions as Language * while loop iterates 0 or more times concatenation uv sequential; first u, then v uv OR select from one or the other or both
7 Description Regular Expression Let ={a,b} all expressions over this alphabet Strings with exactly one a b*ab* exactly two as one or more as even number of as b*ab*ab* (b*ab*)* or (a b)*a (a b)*
(b*ab*ab*)* even number of as and exactly one b (aa)*b(aa)* (aa)*ab(aa)*a odd number of as (b*ab*ab*)*b*ab* that dont contain aa (b ab)*( a) 8 Regular Expression Description Same alphabet
(aa)* even number of as (a b) (a b) (a b) (a b) all strings of length 4 ((a b) (a b) (a b) (a b))* strings of length divisible by 4 (aa)* ((a b) (a b) (a b) (a b))* strings of as of length divisible by 4 9 Token Definition Example
Numeric literals in Pascal Definition of the token unsigned_number digit 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 unsigned_integer digit digit* | digit+ unsigned_number unsigned_integer ( ( . unsigned_integer ) | ) ( ( e ( + | | ) unsigned_integer ) | ) Recursion is not allowed in Regular Expressions! 10 Exercise digit 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 unsigned_integer digit digit*
unsigned_number unsigned_integer ( ( . unsigned_integer ) | ) ( ( e ( + | | ) unsigned_integer ) | ) Regular expression for Decimal numbers number 11 Exercise digit 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 unsigned_integer digit digit* unsigned_number unsigned_integer ( ( . unsigned_integer ) | ) ( ( e ( + | | ) unsigned_integer ) | )
Regular expression for Decimal numbers number ( + | | ) unsigned_integer ( ( unsigned_integer ) | ) 12 Exercise digit 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 unsigned_integer digit digit* unsigned_number unsigned_integer ( ( . unsigned_integer ) | ) ( ( e ( + | | ) unsigned_integer ) | ) Regular expression for
Identifiers identifier 13 Exercise digit 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 unsigned_integer digit digit* unsigned_number unsigned_integer ( ( . unsigned_integer ) | ) ( ( e ( + | | ) unsigned_integer ) | ) Regular expression for Identifiers identifier letter ( letter | digit | )*
letter a | b | c | | z 14