当前位置:网站首页>Compilation Principle -- syntax analysis
Compilation Principle -- syntax analysis
2022-07-03 21:00:00 【raindayinrain】
Syntax analysis
First introduce the basic concepts ,
Then introduce the technology suitable for manual implementation ,
Finally, the algorithm for automation tools is introduced .
And discuss how to extend parsing methods , To recover from common errors .
When designing languages ,
Every programming language has a set of precise rules to describe the grammatical structure of well constructed programs .
The syntax constructed by programming language can be used 2.2 Introduce context free grammar or BNF Representation to describe .
1. Grammar gives a precise and easy to understand grammar specification of programming language
2. For a certain type of grammar , It can automatically construct an efficient parser ,
It can determine the grammatical structure of a source program .
meanwhile , The construction process of parser can reveal the ambiguity of grammar ,
We can also find some problems that are easy to be ignored in the initial design stage of the language .
3. A properly designed grammar gives the structure of a language .
This structure helps to translate the source program into the correct object code , And detection errors .
4. A grammar supports the gradual addition of new language constructs that can complete new tasks to iteratively evolve and develop languages .
Introduction
How to integrate the parser into a typical compiler .
Study the typical grammar of arithmetic expressions .
Parsing techniques that deal with expressions can be used to deal with most constructs of programming languages .
Error handling .
The role of the parser
In our compiler model ,
The parser obtains a string of lexical units from the lexical analyzer ,
And verify that this string can be generated by the grammar of the source language .
Pictured 4-1 Shown .
Expect the parser to report syntax errors in an easy to understand way ,
And can recover from common errors and continue to process the rest of the program .
Conceptually ,
For well structured procedures ,
The parser constructs a parsing tree ,
And pass it to the rest of the compiler for further processing .
actually ,
There is no need to explicitly construct this parsing tree ,
As we will see ,
The checking and translation of the source program can be interleaved with the parsing process .
The parser and other parts of the front end can be implemented with a module .
There are three types of parsers that deal with grammars :
General purpose ,
Top down ,
From the bottom up .
The commonly used methods of compilers can be divided into top-down and bottom-up .
Of the two analytical methods ,
The input of the parser is always scanned from left to right ,
Scan one symbol at a time .

The most efficient top-down and bottom-up methods can only deal with some grammar subclasses ,
But some subclasses , especially LL and LR Grammar ,
Its expressive ability is enough to describe most of the grammatical structures of modern programming languages .
Hand implemented parsers are often used LL Grammar .
Such as 2.4.2 The predictive parsing method in section can handle LL Grammar .
Deal with larger LR Grammatical parsers are usually constructed using automated tools .
In this chapter , Suppose the output of the parser is a representation of the parsing tree .
The parsing tree corresponds to the stream of lexical units from the lexical analyzer .
In practice , The parsing process may involve multiple tasks ,
Such as collecting the information of different lexical units into the symbol table ,
Perform type checking and other types of semantic analysis ,
And generate intermediate code .
Summarize all these activities into figure 4-1 Of " The rest of the front end "
Representative grammar
First, give some grammars that will be studied in this chapter .
The following grammar indicates the associativity and priority of operators .
This grammar and the description expression used in Chapter 2 , term , The grammar of factors is similar .
E Represents a group with + An expression consisting of items separated by symbols ;
T Means by a group with * Items consisting of factors separated by the symbol ;
and F Representation factor , It may be an expression enclosed in parentheses , It may also be an identifier :

Expression grammar 4.1 Belong to LR Grammar class ,
It is suitable for bottom-up parsing technology .
This grammar can be modified to handle more operators and more priority levels .
but , It cannot be used for top-down parsing , Because it is left recursive .
Here's how T->T*F | F No left recursive version of , This version will be used for top-down parsing :

The following grammar is treated in the same way + and *,
Therefore, it can be used to illustrate those techniques that deal with ambiguity in the process of parsing :

here E Represents various types of expressions .
Grammar 4.3 Allow an expression , Such as a+b=c, With multiple parsing trees .
Handling of grammatical errors
The nature of grammatical errors and general strategies for error recovery .
The program may have different levels of errors
1. Lexical errors
Include identifiers , Keywords or operators are misspelled and incorrectly quoted on string text
2. Grammar mistakes
Including the semicolon misplaced , Curly braces are redundant or missing .
3. Semantic error
Including type mismatch between operators and operational components .
4. Logic error
It can be any error caused by the programmer's wrong reasoning
Some parsing methods , such as LL and LR Method ,
Can find the error at the first time .
That is to say ,
When the stream of lexical units from the lexical analyzer cannot be further analyzed according to the grammar of the language
You'll find mistakes .
To be more precise ,
They have feasible prefix characteristics .
namely ,
Once they find a prefix, they cannot add some symbols
And form the string of this language ,
Grammar errors can be detected immediately .
Many errors occur in the form of grammatical errors , And it is exposed that we cannot continue parsing .
Some semantic errors can also be detected efficiently .
in general , It is difficult to accurately detect semantic and logical errors at compile time .
The goal of the error handler in the parser is simple , Realize the challenge .
1. Report errors clearly and accurately
2. Recover from a mistake , Continue to detect the following errors
3. Minimize the overhead of processing the correct program
Error recovery strategy
The easiest way is to let the parser detect the first error ,
Give an error message ,
And then quit .
For example, the parser can restore itself to a certain state ,
And it is reasonable to expect that continuing to process the input from there will provide meaningful diagnostic information ,
Then it will usually find more errors .
Too many mistakes , It is best to stop analysis after exceeding a certain number .
- The resumption of panic patterns
Once the lexical analyzer finds an error, it constantly discards the symbols in the input ,
Discard one symbol at a time ,
Until an element in the set of synchronous lexical units is found .
Synchronous lexical units are often delimiters , Such as semicolon or }.
The compiler designer needs to choose the appropriate synchronous lexical unit for the source language .
The error correction method of panic mode often skips a lot of input ,
Do not check for other errors in the skipped part .
- Phrase level recovery
When an error is found ,
The parser can locally correct the remaining input .
He may replace one of the remaining prefixes with another string ,
Enables the parser to continue parsing .
Common local correction methods include
Replace a comma with a semicolon , Delete an extra semicolon , Insert a missing semicolon .
You need to choose the replacement method carefully , To avoid entering an infinite loop .
- Error production
By predicting common errors that may be encountered ,
Special production expressions can be added to the grammar of the current language ,
These production formulas can contain incorrect constructions ,
Based on the grammar construction with the addition of error production formula, we get a parser .
For example, an error generator is used in the parsing process ,
The parser detected an expected error .
Based on this, appropriate error diagnosis information can be generated , Point out the wrong structure identified in the input
- Global correction
It is hoped that the compiler can convert a wrong input string into a syntactically correct string with minimal changes .
Some algorithms can choose a minimum sequence of changes ,
Get the lowest cost global correction method .
Given an incorrect input x And grammar G
The algorithm will find a correlation string y Syntax analysis tree
To cause to x Convert to y Required insertion , Delete , The number of lexical units changed is the least .
Context free method
2.2 The concept of grammar has been introduced
It is used to systematically describe the syntax of the construction of programming languages .
The following production uses syntax variables stmt To express the statement ,
With variable expr Expression

The above production describes the structure of this form of conditional statements .
Other production formulas precisely define expr What is it? ,stmt What can it be
Formal definition of context free grammar
According to the 2.2 The introduction shows that ,
A context free grammar consists of a terminator , Non terminal symbols , A start symbol and a set of production forms
1. The terminator is the basic symbol that makes up the string .
The term " Lexical unit name " yes " Terminator " A synonym for .
When we discuss the names of apparently lexical units , Commonly used " lexical unit " To refer to the terminator .
Assume that the terminal symbol is the first component of the lexical unit output by the lexical analyzer .
4.4 in , The terminator is a keyword if and else And symbols "(" and ")"
2. Non terminating symbols are syntax variables that represent a set of strings
4.4 in ,stmt and expr Yes no terminator
A set of strings represented by non terminating symbols is used to define the language generated by grammar
Non terminal symbols give the hierarchy of language , This hierarchical structure is the key to grammar analysis and Translation
3. In a grammar , A non terminal symbol is specified as the start symbol
The string set represented by this symbol is the language generated by this grammar .
The customary , First list the production of the start symbol .
4. The production of a grammar describes the method of combining terminal and non terminal symbols into a string .
Each production consists of the following elements
a. A non terminal symbol called the production head or left .
This production defines a part of the string set represented by this header .
b. Symbol ->.
Sometimes you use ::= Instead of arrows
c. A production or right part consisting of zero or more terminations and non terminations .
The components in the production body describe a certain construction method of the string corresponding to the non terminal symbol on the production head .
chart 4-2 The grammar in defines simple arithmetic expressions .
In this grammar , The terminator is
id + - * / ( )
The nonterminal symbol is expression term factor
expression Is the beginning sign

Convention of symbolic representation
To avoid always declaring " These are terminations "," These are nonterminal symbols " wait for .
The rest of this book uses the following conventions for the representation of grammatical symbols
1. The following symbols are terminations
a. The first small letters in the alphabet , Such as a,b,c.
b. Operation symbol , such as +,* etc.
c. Punctuation , For example, brackets , Comma, etc
d. Numbers 0~9
e. Bold string , such as id or if. Each such string represents a Terminator
2. The following symbols are non terminating symbols
a. The first capital letter in the alphabet , such as A,B,C
b. Letter S, When it appears, it usually represents the start symbol
c. A lowercase letter , Names in italics , such as expr or stmt
d. When discussing the construction of programming language ,
Capital letters can be used to represent non terminal symbols representing program construction .
such as , expression , The nonterminal symbols of terms and factors are usually used separately E,T and F Express .
e. The last capital letters in the alphabet [ Such as X,Y,Z] Indicates a grammatical symbol .
namely , Indicates a non terminal symbol or a terminal symbol
f. The lower case letters that follow in the alphabet [ Mainly u,v,...,z] Express
[ It could be empty ] Terminator string .
g. Small Greek letters , such as α β γ Express [ It could be empty ] Grammar symbol string .
therefore ,
A common production can be written A->α, among A It's the head of production ,α Is the productive body
h. A set of production formulas with the same head A->α1,A->α2,...,A->α_{k}[A Generative formula ]
Can write A->α1|α2|...|αk|.
hold α1,α2,...,αk Referred to as A Non identical alternative of
i. Unless otherwise specified ,
The head of the first production is the start symbol
As agreed above , example 4.5 It can be changed to the following simpler form

The above symbol indicates that the Convention tells us E,T and F Yes no terminator , among E Is the beginning sign .
The remaining symbols are terminations .
deduction
Think of production as rewriting rules ,
From the perspective of derivation, we can accurately describe the method of constructing syntax analysis tree .
Start with the start symbol ,
Each rewriting step replaces a non terminal symbol with one of its production bodies .
This derivation idea corresponds to the process of constructing parsing tree from top to bottom .
But the accuracy of the derivation concept is particularly useful in discussing the bottom-up parsing process .
Bottom up parsing and a method called " The most right " The derivation type of derivation is related .
In this derivation , Each step rewrites the rightmost non terminal symbol .
Consider the following , There is only one non terminal symbol E Grammar .
It's in grammar 4.3 A production formula is added in E->-E

Generative formula E->-E indicate ,
If E Represents an expression ,
be -E Must also represent an expression .
Will a E Replace with -E The process of writing

The above formula reads "E Deduce -E"
Generative formula E->(E) Any grammar symbol that appears in the string E Replace any instance of with (E).
Such as ,

Individual... Can be sorted in any order E Constantly apply various production formulas , Get a replacement sequence .
Such as :

We call this replacement sequence from E To -(id) The derivation of .
This derivation proves that string -(id) Is an example of an expression .
Give a general definition of derivation ,
Consider a non terminating symbol in the middle of a sequence of grammatical symbols A,
Such as αAβ,
among α β Is an arbitrary string of grammatical symbols .
hypothesis A->γ It's a generative .
Then we write

Symbol

Means to deduce in one step .
When a derivation sequence

take a1 Replace with an.
say a1 Deduce an
We often say -- Deduce... After zero or more steps ,
You can use symbols

To express this relationship .
therefore ,
1. For any string a,a⇒_{*}a And
2. Such as a⇒_{*}b, And b⇒r, be a⇒_{*}r
Similarly ,⇒_{+} Express " Deduce... In one or more steps "
If S⇒_{*}a,
among S It's grammar G The start sign of
We said a yes G A sentence pattern of
A sentence pattern may contain both terminations and non terminations ,
Or it could be an empty string .
Grammar G A sentence of is a sentence pattern that does not contain non terminations .
A grammatically generated language is a collection of all its sentences .
therefore ,
A terminator string w stay G Generated language L(G) in ,
If and only if w yes G A sentence of [ Or say S⇒_{*}w]
The language that can be generated by grammar is called context free language .
For example, two grammars generate the same language ,
These two grammars are called equivalent .
strand -(id+id) It's grammar 4.7 A sentence of ,
Because there is a derivation process

strand E,-E,-(E),...,-(id+id) It's all the sentence pattern of this grammar .
use E⇒_{*}-(id+id) To indicate -(id+id) Can be obtained from E To derive .
There are two choices to be made in each derivation step .
Choose which non terminal symbol to replace ,
And after making this decision ,
You also need to select a production with this non terminal symbol as the header .
Such as , The following -(id+id) Another derivation and derivation of 4.8 The last two steps are different .

In these two derivations ,
Each nonterminal symbol is replaced by the same production body , But the order of replacement is different .
To understand how the parser works ,
Two derivation processes will be considered to select the replaced non terminal symbols in each derivation step as follows
1. In the leftmost derivation ,
Always select the leftmost non terminator of each sentence pattern .
If a⇒b It's a derivation step ,
And replaced by a Leftmost nonterminal symbol in ,
writing a⇒_{lm}b
2. In the rightmost derivation ,
Always select the rightmost non terminating symbol ,
Writing at this time a⇒_{rm}b
deduction 4.8 Is the leftmost derivation , So it can be written as

Be careful , deduction 4.9 Is a rightmost derivation .
According to our notation Convention ,
Each leftmost derivation step can be written as

among w Contains only the Terminator ,
A->δ Is the applied production ,
and r Is a string of grammatical symbols .
To emphasize a After a leftmost derivation process, we get b,
writing

If

So let's say a Is the leftmost sentence pattern in the current grammar .
There is a similar definition of rightmost derivation .
Rightmost derivation is sometimes called canonical derivation .
Parsing tree and derivation
Parsing tree is a graphical representation of derivation ,
It filters out the sequence of applying production to non terminal symbols in the derivation process .
Each internal node of the parsing tree represents a production application .
The label of the internal node is the non terminal symbol in the production header A;
The labels of the child nodes of this node form from left to right to replace this in the derivation process A Production of .
Such as , chart 4-3 in ,-(id+id) The parsing tree of is based on derivation 4.8 Got ,
It can also be deduced 4.9 obtain .

边栏推荐
- Brief analysis of ref nerf
- Last week's content review
- MDM mass data synchronization test verification
- Phpexcel import export
- 2166. Design bit set
- JVM JNI and PVM pybind11 mass data transmission and optimization
- TLS environment construction and plaintext analysis
- Scientific research document management Zotero
- thrift go
- Go learning notes (4) basic types and statements (3)
猜你喜欢
![Oak-d raspberry pie cloud project [with detailed code]](/img/34/76b461bf03fba373da5b5898c5204c.jpg)
Oak-d raspberry pie cloud project [with detailed code]

Node MySQL serialize cannot rollback transactions

MySQL master-slave synchronization principle

19、 MySQL -- SQL statements and queries

Discussion Net legacy application transformation

18、 MySQL -- index

Q&A:Transformer, Bert, ELMO, GPT, VIT

Based on laravel 5.5\5.6\5 X solution to the failure of installing laravel ide helper

From the behind the scenes arena of the ice and snow event, see how digital builders can ensure large-scale events

强化學習-學習筆記1 | 基礎概念
随机推荐
In 2021, the global revenue of syphilis rapid detection kits was about US $608.1 million, and it is expected to reach US $712.9 million in 2028
如临现场的视觉感染力,NBA决赛直播还能这样看?
Pytorch sets the weight and bias of the model to zero
First knowledge of database
In 2021, the global revenue of thick film resistors was about $1537.3 million, and it is expected to reach $2118.7 million in 2028
Deep search DFS + wide search BFS + traversal of trees and graphs + topological sequence (template article acwing)
Cap and base theory
"Actbert" Baidu & Sydney University of technology proposed actbert to learn the global and local video text representation, which is effective in five video text tasks
19、 MySQL -- SQL statements and queries
Gauss elimination solves linear equations (floating-point Gauss elimination template)
Hcie security Day11: preliminarily learn the concepts of firewall dual machine hot standby and vgmp
Qtablewidget control of QT
Operate BOM objects (key)
Node MySQL serialize cannot rollback transactions
鹏城杯 WEB_WP
In 2021, the global general crop protection revenue was about $52750 million, and it is expected to reach $64730 million in 2028
Talk about daily newspaper design - how to write a daily newspaper and what is the use of a daily newspaper?
C 10 new feature [caller parameter expression] solves my confusion seven years ago
2022 high voltage electrician examination and high voltage electrician reexamination examination
Fingerprint password lock based on Hal Library