当前位置：网站首页>Compilation Principle -- syntax analysis

Compilation Principle -- syntax analysis

2022-07-03 21:00:00 【raindayinrain】

Syntax analysis

 First introduce the basic concepts ,
 Then introduce the technology suitable for manual implementation ,
 Finally, the algorithm for automation tools is introduced ．
 And discuss how to extend parsing methods , To recover from common errors ．
 When designing languages ,
 Every programming language has a set of precise rules to describe the grammatical structure of well constructed programs ．
 The syntax constructed by programming language can be used 2.2 Introduce context free grammar or BNF Representation to describe ．
１． Grammar gives a precise and easy to understand grammar specification of programming language 
２． For a certain type of grammar , It can automatically construct an efficient parser ,
 It can determine the grammatical structure of a source program ．
 meanwhile , The construction process of parser can reveal the ambiguity of grammar ,
 We can also find some problems that are easy to be ignored in the initial design stage of the language ．
３． A properly designed grammar gives the structure of a language ．
 This structure helps to translate the source program into the correct object code , And detection errors ．
４． A grammar supports the gradual addition of new language constructs that can complete new tasks to iteratively evolve and develop languages ．

Introduction

 How to integrate the parser into a typical compiler ．
 Study the typical grammar of arithmetic expressions ．
 Parsing techniques that deal with expressions can be used to deal with most constructs of programming languages ．
 Error handling ．

The role of the parser

 In our compiler model ,
 The parser obtains a string of lexical units from the lexical analyzer ,
 And verify that this string can be generated by the grammar of the source language ．
 Pictured 4-1 Shown ．
 Expect the parser to report syntax errors in an easy to understand way ,
 And can recover from common errors and continue to process the rest of the program ．
 Conceptually ,
 For well structured procedures ,
 The parser constructs a parsing tree ,
 And pass it to the rest of the compiler for further processing ．
 actually ,
 There is no need to explicitly construct this parsing tree ,
 As we will see ,
 The checking and translation of the source program can be interleaved with the parsing process ．
 The parser and other parts of the front end can be implemented with a module ．

 There are three types of parsers that deal with grammars ：
 General purpose ,
 Top down ,
 From the bottom up ．

 The commonly used methods of compilers can be divided into top-down and bottom-up ．
 Of the two analytical methods ,
 The input of the parser is always scanned from left to right ,
 Scan one symbol at a time ．

Please add a picture description

 The most efficient top-down and bottom-up methods can only deal with some grammar subclasses ,
 But some subclasses , especially LL and LR Grammar ,
 Its expressive ability is enough to describe most of the grammatical structures of modern programming languages ．
 Hand implemented parsers are often used LL Grammar ．
 Such as 2.4.2 The predictive parsing method in section can handle LL Grammar ．
 Deal with larger LR Grammatical parsers are usually constructed using automated tools ．
 In this chapter , Suppose the output of the parser is a representation of the parsing tree ．
 The parsing tree corresponds to the stream of lexical units from the lexical analyzer ．
 In practice , The parsing process may involve multiple tasks ,
 Such as collecting the information of different lexical units into the symbol table ,
 Perform type checking and other types of semantic analysis ,
 And generate intermediate code ．
 Summarize all these activities into figure 4-1 Of " The rest of the front end "

Representative grammar

 First, give some grammars that will be studied in this chapter ．
 The following grammar indicates the associativity and priority of operators ．
 This grammar and the description expression used in Chapter 2 , term , The grammar of factors is similar ．
E Represents a group with + An expression consisting of items separated by symbols ;
T Means by a group with * Items consisting of factors separated by the symbol ;
 and F Representation factor , It may be an expression enclosed in parentheses , It may also be an identifier ：

Please add a picture description

 Expression grammar ４．１ Belong to LR Grammar class ,
 It is suitable for bottom-up parsing technology ．
 This grammar can be modified to handle more operators and more priority levels ．
 but , It cannot be used for top-down parsing , Because it is left recursive ．
 Here's how T->T*F | F No left recursive version of , This version will be used for top-down parsing ：

Please add a picture description

 The following grammar is treated in the same way + and *,
 Therefore, it can be used to illustrate those techniques that deal with ambiguity in the process of parsing ：

Please add a picture description

 here E Represents various types of expressions ．
 Grammar 4.3 Allow an expression , Such as ａ＋ｂ＝ｃ, With multiple parsing trees ．

Handling of grammatical errors

 The nature of grammatical errors and general strategies for error recovery ．
 The program may have different levels of errors 
１． Lexical errors 
 Include identifiers , Keywords or operators are misspelled and incorrectly quoted on string text 
２． Grammar mistakes 
 Including the semicolon misplaced , Curly braces are redundant or missing ．
３． Semantic error 
 Including type mismatch between operators and operational components ．
４． Logic error 
 It can be any error caused by the programmer's wrong reasoning 

 Some parsing methods , such as LL and LR Method ,
 Can find the error at the first time ．
 That is to say ,
 When the stream of lexical units from the lexical analyzer cannot be further analyzed according to the grammar of the language 
 You'll find mistakes ．
 To be more precise ,
 They have feasible prefix characteristics ．
 namely ,
 Once they find a prefix, they cannot add some symbols 
 And form the string of this language ,
 Grammar errors can be detected immediately ．
 Many errors occur in the form of grammatical errors , And it is exposed that we cannot continue parsing ．
 Some semantic errors can also be detected efficiently ．
 in general , It is difficult to accurately detect semantic and logical errors at compile time ．
 The goal of the error handler in the parser is simple , Realize the challenge ．
１． Report errors clearly and accurately 
２． Recover from a mistake , Continue to detect the following errors 
３． Minimize the overhead of processing the correct program

Error recovery strategy

 The easiest way is to let the parser detect the first error ,
 Give an error message ,
 And then quit ．
 For example, the parser can restore itself to a certain state ,
 And it is reasonable to expect that continuing to process the input from there will provide meaningful diagnostic information ,
 Then it will usually find more errors ．
 Too many mistakes , It is best to stop analysis after exceeding a certain number ．
－　 The resumption of panic patterns 
 Once the lexical analyzer finds an error, it constantly discards the symbols in the input ,
 Discard one symbol at a time ,
 Until an element in the set of synchronous lexical units is found ．
 Synchronous lexical units are often delimiters , Such as semicolon or }．
 The compiler designer needs to choose the appropriate synchronous lexical unit for the source language ．
 The error correction method of panic mode often skips a lot of input ,
 Do not check for other errors in the skipped part ．
－　 Phrase level recovery 
 When an error is found ,
 The parser can locally correct the remaining input ．
 He may replace one of the remaining prefixes with another string ,
 Enables the parser to continue parsing ．
 Common local correction methods include 
 Replace a comma with a semicolon , Delete an extra semicolon , Insert a missing semicolon ．
 You need to choose the replacement method carefully , To avoid entering an infinite loop ．
－　 Error production 
 By predicting common errors that may be encountered ,
 Special production expressions can be added to the grammar of the current language ,
 These production formulas can contain incorrect constructions ,
 Based on the grammar construction with the addition of error production formula, we get a parser ．
 For example, an error generator is used in the parsing process ,
 The parser detected an expected error ．
 Based on this, appropriate error diagnosis information can be generated , Point out the wrong structure identified in the input 
－　 Global correction 
 It is hoped that the compiler can convert a wrong input string into a syntactically correct string with minimal changes ．
 Some algorithms can choose a minimum sequence of changes ,
 Get the lowest cost global correction method ．
 Given an incorrect input ｘ And grammar Ｇ
 The algorithm will find a correlation string ｙ Syntax analysis tree 
 To cause to ｘ Convert to ｙ Required insertion , Delete , The number of lexical units changed is the least ．

Context free method

2.2 The concept of grammar has been introduced 
 It is used to systematically describe the syntax of the construction of programming languages ．
 The following production uses syntax variables stmt To express the statement ,
 With variable expr Expression

Please add a picture description

 The above production describes the structure of this form of conditional statements ．
 Other production formulas precisely define expr What is it? ,stmt What can it be

Formal definition of context free grammar

 According to the 2.2 The introduction shows that ,
 A context free grammar consists of a terminator , Non terminal symbols , A start symbol and a set of production forms 
１． The terminator is the basic symbol that makes up the string ．
 The term " Lexical unit name " yes " Terminator " A synonym for ．
 When we discuss the names of apparently lexical units , Commonly used " lexical unit " To refer to the terminator ．
 Assume that the terminal symbol is the first component of the lexical unit output by the lexical analyzer ．
4.4 in , The terminator is a keyword if and else And symbols "(" and ")"
２． Non terminating symbols are syntax variables that represent a set of strings 
4.4 in ,stmt and expr Yes no terminator 
 A set of strings represented by non terminating symbols is used to define the language generated by grammar 
 Non terminal symbols give the hierarchy of language , This hierarchical structure is the key to grammar analysis and Translation 
３． In a grammar , A non terminal symbol is specified as the start symbol 
 The string set represented by this symbol is the language generated by this grammar ．
 The customary , First list the production of the start symbol ．
４． The production of a grammar describes the method of combining terminal and non terminal symbols into a string ．
 Each production consists of the following elements 
ａ． A non terminal symbol called the production head or left ．
 This production defines a part of the string set represented by this header ．
ｂ． Symbol ->．
 Sometimes you use ::= Instead of arrows 
ｃ． A production or right part consisting of zero or more terminations and non terminations ．
 The components in the production body describe a certain construction method of the string corresponding to the non terminal symbol on the production head ．

 chart 4-2 The grammar in defines simple arithmetic expressions ．
 In this grammar , The terminator is 
id + - * / ( )
 The nonterminal symbol is expression term factor
expression Is the beginning sign

Please add a picture description

Convention of symbolic representation

 To avoid always declaring " These are terminations "," These are nonterminal symbols " wait for ．
 The rest of this book uses the following conventions for the representation of grammatical symbols 
１． The following symbols are terminations 
ａ． The first small letters in the alphabet , Such as ａ,ｂ,ｃ．
ｂ． Operation symbol , such as ＋,＊ etc. 
ｃ． Punctuation , For example, brackets , Comma, etc 
ｄ． Numbers ０～９
ｅ． Bold string , such as ｉｄ or ｉｆ． Each such string represents a Terminator 
２． The following symbols are non terminating symbols 
ａ． The first capital letter in the alphabet , such as Ａ,Ｂ,Ｃ
ｂ． Letter Ｓ, When it appears, it usually represents the start symbol 
ｃ． A lowercase letter , Names in italics , such as ｅｘｐｒ or ｓｔｍｔ
ｄ． When discussing the construction of programming language ,
 Capital letters can be used to represent non terminal symbols representing program construction ．
 such as , expression , The nonterminal symbols of terms and factors are usually used separately Ｅ,Ｔ and Ｆ Express ．
ｅ． The last capital letters in the alphabet ［ Such as Ｘ,Ｙ,Ｚ］ Indicates a grammatical symbol ．
 namely , Indicates a non terminal symbol or a terminal symbol 
ｆ． The lower case letters that follow in the alphabet ［ Mainly ｕ,ｖ,．．．,ｚ］ Express 
［ It could be empty ］ Terminator string ．
ｇ． Small Greek letters , such as α β γ  Express ［ It could be empty ］ Grammar symbol string ．
 therefore ,
 A common production can be written A->α, among A It's the head of production ,α Is the productive body 
ｈ． A set of production formulas with the same head A->α1,A->α2,．．．,A->α_{k}［Ａ Generative formula ］
 Can write A->α1|α2|...|αk|．
 hold α１,α２,．．．,αｋ Referred to as Ａ Non identical alternative of 
ｉ． Unless otherwise specified ,
 The head of the first production is the start symbol 

 As agreed above , example ４．５ It can be changed to the following simpler form

Please add a picture description

 The above symbol indicates that the Convention tells us Ｅ,Ｔ and Ｆ Yes no terminator , among Ｅ Is the beginning sign ．
 The remaining symbols are terminations ．

deduction

 Think of production as rewriting rules ,
 From the perspective of derivation, we can accurately describe the method of constructing syntax analysis tree ．
 Start with the start symbol ,
 Each rewriting step replaces a non terminal symbol with one of its production bodies ．
 This derivation idea corresponds to the process of constructing parsing tree from top to bottom ．
 But the accuracy of the derivation concept is particularly useful in discussing the bottom-up parsing process ．
 Bottom up parsing and a method called " The most right " The derivation type of derivation is related ．
 In this derivation , Each step rewrites the rightmost non terminal symbol ．

 Consider the following , There is only one non terminal symbol Ｅ Grammar ．
 It's in grammar 4.3 A production formula is added in E->-E

Please add a picture description

 Generative formula E->-E indicate ,
 If Ｅ Represents an expression ,
 be -E Must also represent an expression ．
 Will a Ｅ Replace with -E The process of writing

Please add a picture description

 The above formula reads "E Deduce -E"
 Generative formula E->(E) Any grammar symbol that appears in the string E Replace any instance of with (E)．
 Such as ,

Please add a picture description

 Individual... Can be sorted in any order E Constantly apply various production formulas , Get a replacement sequence ．
 Such as ：

Please add a picture description

 We call this replacement sequence from Ｅ To -(id) The derivation of ．
 This derivation proves that string -(id) Is an example of an expression ．

 Give a general definition of derivation ,
 Consider a non terminating symbol in the middle of a sequence of grammatical symbols Ａ,
 Such as αAβ,
 among α β Is an arbitrary string of grammatical symbols ．
 hypothesis A->γ  It's a generative ．
 Then we write

Please add a picture description

 Symbol

Please add a picture description

 Means to deduce in one step ．
 When a derivation sequence

Please add a picture description

 take a1 Replace with an．
 say a1 Deduce an
 We often say －－ Deduce... After zero or more steps ,
 You can use symbols

Please add a picture description

 To express this relationship ．
 therefore ,
１． For any string ａ,ａ⇒_{*}ａ And 
２． Such as ａ⇒_{*}ｂ, And ｂ⇒ｒ, be ａ⇒_{*}ｒ
 Similarly ,⇒_{+} Express " Deduce... In one or more steps "

 If Ｓ⇒_{*}ａ,
 among Ｓ It's grammar Ｇ The start sign of 
 We said ａ yes Ｇ A sentence pattern of 
 A sentence pattern may contain both terminations and non terminations ,
 Or it could be an empty string ．
 Grammar Ｇ A sentence of is a sentence pattern that does not contain non terminations ．
 A grammatically generated language is a collection of all its sentences ．
 therefore ,
 A terminator string ｗ stay Ｇ Generated language Ｌ(G) in ,
 If and only if ｗ yes Ｇ A sentence of ［ Or say S⇒_{*}ｗ］
 The language that can be generated by grammar is called context free language ．
 For example, two grammars generate the same language ,
 These two grammars are called equivalent ．

 strand -(id+id) It's grammar 4.7 A sentence of ,
 Because there is a derivation process

Please add a picture description

 strand E,-E,-(E),...,-(id+id) It's all the sentence pattern of this grammar ．
 use E⇒_{*}-(id+id) To indicate -(id+id) Can be obtained from E To derive ．

 There are two choices to be made in each derivation step ．
 Choose which non terminal symbol to replace ,
 And after making this decision ,
 You also need to select a production with this non terminal symbol as the header ．
 Such as , The following -(id+id) Another derivation and derivation of 4.8 The last two steps are different ．

Please add a picture description

 In these two derivations ,
 Each nonterminal symbol is replaced by the same production body , But the order of replacement is different ．
 To understand how the parser works ,
 Two derivation processes will be considered to select the replaced non terminal symbols in each derivation step as follows 
１． In the leftmost derivation ,
 Always select the leftmost non terminator of each sentence pattern ．
 If ａ⇒ｂ It's a derivation step ,
 And replaced by ａ Leftmost nonterminal symbol in ,
 writing ａ⇒_{lm}ｂ
２． In the rightmost derivation ,
 Always select the rightmost non terminating symbol ,
 Writing at this time ａ⇒_{rm}ｂ
 deduction 4.8 Is the leftmost derivation , So it can be written as

Please add a picture description

 Be careful , deduction ４．９ Is a rightmost derivation ．
 According to our notation Convention ,
 Each leftmost derivation step can be written as

Please add a picture description

 among ｗ Contains only the Terminator ,
A->δ Is the applied production ,
 and ｒ Is a string of grammatical symbols ．
 To emphasize ａ After a leftmost derivation process, we get ｂ,
 writing

Please add a picture description

If

Please add a picture description

 So let's say ａ Is the leftmost sentence pattern in the current grammar ．
 There is a similar definition of rightmost derivation ．
 Rightmost derivation is sometimes called canonical derivation ．

Parsing tree and derivation

 Parsing tree is a graphical representation of derivation ,
 It filters out the sequence of applying production to non terminal symbols in the derivation process ．
 Each internal node of the parsing tree represents a production application ．
 The label of the internal node is the non terminal symbol in the production header Ａ;
 The labels of the child nodes of this node form from left to right to replace this in the derivation process Ａ Production of ．

 Such as , chart 4-3 in ,-(id+id) The parsing tree of is based on derivation 4.8 Got ,
 It can also be deduced 4.9 obtain ．

Please add a picture description