当前位置:网站首页>Lex & yacc of Pisa proxy SQL parsing

Lex & yacc of Pisa proxy SQL parsing

2022-07-07 17:19:00 InfoQ

One 、 Preface

1.1  The authors introduce

Wang Bo ,SphereEx MeshLab  R & D Engineer , Focus on  Database Mesh,Cloud Native  The development of .Linux,llvm,yacc,ebpf user. Gopher & Rustacean and c bug hunter.
GitHub: 
https://xie.infoq.cn/article/d59e47a90edaf763df760ac2d
1.2  background

In the last article 《
Pisa-Proxy  And  SQL  Analytical practice
》 This paper introduces the in  Pisa-Proxy  One of the core modules of  SQL  Parser related content . stay  MySQL  and  PostgreSQL  in  
SQL  analysis
It's through  Yacc  Realized , Again  Pisa- Proxy  Of  SQL  The parser is composed of  Yacc  Such tools realize , So this article will focus on  SQL  The parser introduces some compilation principles and  Lex & Yacc  Use , It will also show readers how to pass  Lex & Yacc  Implement a simple one  SQL  Parser . So as to help you better understand  Pisa-Proxy  in  SQL  How the parser works .

Two 、 Compiler exploration


A programming language, no matter what we often use  Java,Golang  Or is it  SQL  It is essentially a notation system , Like natural language , Its complete definition should include both syntax and semantics . The grammar of a language is actually a corresponding set of rules , It can be used to form and produce a suitable program . Currently, the most widely used method is context free grammar , Context free grammar is used as a tool to describe the syntax of programming language . Grammar just defines what kind of symbol sequence is legal , It has nothing to do with the meaning of these symbols . However, there are two categories in semantics : Static semantics and dynamic semantics . Static semantics refers to a series of qualification rules , And determine which syntax is appropriate for the program ; Dynamic semantics is also called running semantics or execution semantics , Specify what the program will calculate .
2.1  Compiler workflow

Pictured  2.1.1  As shown in , Generally, the compiler compiles the source code into executable files in the following steps :

  • Scan the source file , Split the character stream of the source file into words (token), This is lexical analysis
  • According to the grammar rules, these marks are constructed into a grammar tree , This is syntax analysis
  • Check the relationship between the nodes of the syntax tree , Check whether semantic rules are violated , At the same time, the syntax tree is optimized as necessary , This is semantic analysis
  • Traverse the nodes of the syntax tree , Convert each node into intermediate code , And put them together in a specific order , This is intermediate code generation
  • Optimize the intermediate code
  • Convert intermediate code into object code
  • Optimize the object code , Generate the final target program

chart  2.1.1

about  SQL  Analytically speaking , The steps in the above figure can be simplified as shown in the figure  2.1.2  In the form of , Source code input (SQL  sentence ), take  SQL  Sentence for lexical analysis , Generate  SQL  In the specific  token  Sign flow . Then get the token stream and parse it to generate the final  SQL AST.

chart  2.1.2

2.2  Lexical analysis

As mentioned above , Whether compiler or  SQL  A key step of the parser is to do lexical analysis on the source file , Lexical analysis can be understood as right  SQL  The sentence itself is processed by word segmentation . So at this stage ,SQL  What the parser needs to do is scan the source file from left to right , take  SQL  Statements are divided into one by one  token, here  token  Refer to  SQL  A string of characters that cannot be further divided in . For example, figure  2.1.2  Medium  SQL  sentence , After lexical analysis , Generated  token  by :
SELECT
*
FROM
pisa_proxy
  wait .

stay  SQL  What can be used in the sentence  token  Categories are also limited , For example, reserved words  
SELECT
INSERT
DELETE
  wait . And operators , such as : arithmetic operator 、 Comparison operator . And identifiers , such as : Built in function name and so on . At this stage, each scan  token  Will be maintained in a data structure , Then in the next phase, the parsing phase uses . Generally speaking , Lexical analysis has direct scanning , Regular matching scanning mode .
2.2.1  Direct scanning method

The logic of direct scanning method is very clear , Each scan determines which type it belongs to according to the first character  token, Then take different strategies to scan out a complete  token, Then proceed to the next scan . stay  Pisa-Proxy  Medium  SQL  Parsing , Lexical analysis adopts this implementation , use  Python  Show how to implement a simple  SQL  Lexical analyzer pair  SQL  scan , The code is as follows :

# -*- coding: utf-8 -*-

single_char_operators_typeA = {
    ";", ",", "(", ")","/", "+", "-", "*", "%", ".",
}

single_char_operators_typeB = {
&nbsp;&nbsp;&nbsp;&nbsp;&quot;<&quot;,&nbsp;&quot;>&quot;,&nbsp;&quot;=&quot;,&nbsp;&quot;!&quot;
}

double_char_operators&nbsp;=&nbsp;{
&nbsp;&nbsp;&nbsp;&nbsp;&quot;>=&quot;,&nbsp;&quot;<=&quot;,&nbsp;&quot;==&quot;,&nbsp;&quot;~=&quot;
}

reservedWords&nbsp;=&nbsp;{
&nbsp;&nbsp;&nbsp;&nbsp;&quot;select&quot;,&nbsp;&quot;insert&quot;,&nbsp;&quot;update&quot;,&nbsp;&quot;delete&quot;,&nbsp;&quot;show&quot;,
&nbsp;&nbsp;&nbsp;&nbsp;&quot;create&quot;,&nbsp;&quot;set&quot;,&nbsp;&quot;grant&quot;,&nbsp;&quot;from&quot;,&nbsp;&quot;where&quot;
}

class&nbsp;Token:
&nbsp;&nbsp;&nbsp;&nbsp;def&nbsp;__init__(self,&nbsp;_type,&nbsp;_val&nbsp;=&nbsp;None):
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;if&nbsp;_val&nbsp;is&nbsp;None:
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;self.type&nbsp;=&nbsp;&quot;T_&quot;&nbsp;+&nbsp;_type;
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;self.val&nbsp;=&nbsp;_type;
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;else:
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;self.type,&nbsp;self.val&nbsp;=&nbsp;_type,&nbsp;_val

&nbsp;&nbsp;&nbsp;&nbsp;def&nbsp;__str__(self):
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;return&nbsp;&quot;%-20s%s&quot;&nbsp;%&nbsp;(self.type,&nbsp;self.val)

class&nbsp;NoneTerminateQuoteError(Exception):
&nbsp;&nbsp;&nbsp;&nbsp;pass

def&nbsp;isWhiteSpace(ch):
&nbsp;&nbsp;&nbsp;&nbsp;return&nbsp;ch&nbsp;in&nbsp;&quot;&nbsp;\t\r\a\n&quot;

def&nbsp;isDigit(ch):
&nbsp;&nbsp;&nbsp;&nbsp;return&nbsp;ch&nbsp;in&nbsp;&quot;0123456789&quot;

def&nbsp;isLetter(ch):
&nbsp;&nbsp;&nbsp;&nbsp;return&nbsp;ch&nbsp;in&nbsp;&quot;abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ&quot;

def&nbsp;scan(s):
&nbsp;&nbsp;&nbsp;&nbsp;n,&nbsp;i&nbsp;=&nbsp;len(s),&nbsp;0
&nbsp;&nbsp;&nbsp;&nbsp;while&nbsp;i&nbsp;<&nbsp;n:
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;ch,&nbsp;i&nbsp;=&nbsp;s[i],&nbsp;i&nbsp;+&nbsp;1

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;if&nbsp;isWhiteSpace(ch):
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;continue

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;if&nbsp;ch&nbsp;==&nbsp;&quot;#&quot;:
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;return

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;if&nbsp;ch&nbsp;in&nbsp;single_char_operators_typeA:
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;yield&nbsp;Token(ch)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;elif&nbsp;ch&nbsp;in&nbsp;single_char_operators_typeB:
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;if&nbsp;i&nbsp;<&nbsp;n&nbsp;and&nbsp;s[i]&nbsp;==&nbsp;&quot;=&quot;:
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;yield&nbsp;Token(ch&nbsp;+&nbsp;&quot;=&quot;)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;else:
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;yield&nbsp;Token(ch)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;elif&nbsp;isLetter(ch)&nbsp;or&nbsp;ch&nbsp;==&nbsp;&quot;_&quot;:
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;begin&nbsp;=&nbsp;i&nbsp;-&nbsp;1
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;while&nbsp;i&nbsp;<&nbsp;n&nbsp;and&nbsp;(isLetter(s[i])&nbsp;or&nbsp;isDigit(s[i])&nbsp;or&nbsp;s[i]&nbsp;==&nbsp;&quot;_&quot;):
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;i&nbsp;+=&nbsp;1
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;word&nbsp;=&nbsp;s[begin:i]
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;if&nbsp;word&nbsp;in&nbsp;reservedWords:
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;yield&nbsp;Token(word)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;else:
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;yield&nbsp;Token(&quot;T_identifier&quot;,&nbsp;word)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;elif&nbsp;isDigit(ch):
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;begin&nbsp;=&nbsp;i&nbsp;-&nbsp;1
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;aDot&nbsp;=&nbsp;False
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;while&nbsp;i&nbsp;<&nbsp;n:
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;if&nbsp;s[i]&nbsp;==&nbsp;&quot;.&quot;:
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;if&nbsp;aDot:
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;raise&nbsp;Exception(&quot;Too&nbsp;many&nbsp;dot&nbsp;in&nbsp;a&nbsp;number!\n\tline:&quot;+line)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;aDot&nbsp;=&nbsp;True
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;elif&nbsp;not&nbsp;isDigit(s[i]):
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;break
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;i&nbsp;+=&nbsp;1
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;yield&nbsp;Token(&quot;T_double&quot;&nbsp;if&nbsp;aDot&nbsp;else&nbsp;&quot;T_integer&quot;,&nbsp;s[begin:i])
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;elif&nbsp;ord(ch)&nbsp;==&nbsp;34:&nbsp;#&nbsp;34&nbsp;means&nbsp;'&quot;'
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;begin&nbsp;=&nbsp;i
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;while&nbsp;i&nbsp;<&nbsp;n&nbsp;and&nbsp;ord(s[i])&nbsp;!=&nbsp;34:
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;i&nbsp;+=&nbsp;1
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;if&nbsp;i&nbsp;==&nbsp;n:
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;raise&nbsp;Exception(&quot;Non-terminated&nbsp;string&nbsp;quote!\n\tline:&quot;+line)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;yield&nbsp;Token(&quot;T_string&quot;,&nbsp;chr(34)&nbsp;+&nbsp;s[begin:i]&nbsp;+&nbsp;chr(34))
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;i&nbsp;+=&nbsp;1
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;else:
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;raise&nbsp;Exception(&quot;Unknown&nbsp;symbol!\n\tline:&quot;+line+&quot;\n\tchar:&quot;+ch)

if&nbsp;__name__&nbsp;==&nbsp;&quot;__main__&quot;:
&nbsp;&nbsp;&nbsp;&nbsp;print&nbsp;&quot;%-20s%s&quot;&nbsp;%&nbsp;(&quot;TOKEN&nbsp;TYPE&quot;,&nbsp;&quot;TOKEN&nbsp;VALUE&quot;)
&nbsp;&nbsp;&nbsp;&nbsp;print&nbsp;&quot;-&quot;&nbsp;*&nbsp;50
&nbsp;&nbsp;&nbsp;&nbsp;sql&nbsp;=&nbsp;&quot;select&nbsp;*&nbsp;from&nbsp;pisa_proxy&nbsp;where&nbsp;id&nbsp;=&nbsp;1;&quot;
&nbsp;&nbsp;&nbsp;&nbsp;for&nbsp;token&nbsp;in&nbsp;scan(sql):
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;print&nbsp;token

The final output is as follows :

TOKEN&nbsp;TYPE&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;TOKEN&nbsp;VALUE
--------------------------------------------------
T_select&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;select
T_*&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;*
T_from&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;from
T_identifier&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;pisa_proxy
T_where&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;where
T_identifier&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;id
T_=&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;=
T_integer&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;1
T_;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;;

We can see from the above code , At the beginning, we defined several  token  type , for example :
reservedWords
  Array is maintained  SQL  Reserved word in , also  SQL  Operator in  
single_char_operators_typeA
. We can see the final implementation result , The scanner will eventually be a  SQL  Sentences are segmented into their corresponding types  token.
2.2.2  Regular expression scanning

In fact, the essence of a program is a set of strings , And language can be regarded as a collection of legal programs , So the compiler or  SQL  The essence of parsing is to judge that the input string is the legal type of the language , In addition, it is also a legal program that converts the legal program in the language into the target language . So the essence of regular expression scanning , That is to judge whether the string matches the regular expression in a finite state machine , In the following introduction, we will mention  flex,flex  The essence of is to construct a finite state automaton from the word segmentation matching pattern written by users in regular expressions .
2.3  Syntax analysis

After lexical analysis ,SQL  The character stream of the statement is split into  token, Then grammatical analysis is to analyze  SQL  The grammatical structure of , Will be linear  token  The flow is transformed into a tree structure , For subsequent semantic analysis and  AST  Prepare for generation . In the compiler , Regular expressions are still difficult to represent the set of sentences represented by programming languages , So we introduce context free grammar , Context free grammar can describe the grammatical structure of current programming languages .

Take our natural language as an example , Suppose a programming language is  SQLX, It only contains the main - Predicate - Bin   For this structure , The subject is only
you
I
He
, Predicate only
Love
, The object has
Rust
  and  
Pisanix
, Then the grammar can be in the following form :

sentence &nbsp;->&nbsp; The subject &nbsp; Predicate &nbsp; The object
The subject &nbsp;->&nbsp; I
The subject &nbsp;->&nbsp; you
The subject &nbsp;->&nbsp; He
Predicate &nbsp;->&nbsp; Love
The object &nbsp;->&nbsp;Rust
The object &nbsp;->&nbsp;Pisanix

From the above example, we can replace the subject predicate object respectively , So as to write all the statements that meet this structure . On the contrary, we can compare any sentence with this structure to judge whether it meets  SQLX  Language . This leads to several concepts , In the above grammar, the form is “ The subject -> Predicate -> The object ” The formula of becomes
Generative formula

The symbol on the left of production ( sentence 、 The subject 、 Predicate 、 The object ) be called
non-terminal
. and “ I , you , He , Love ,Rust,Pisanix” These symbols can no longer produce new symbols , So it's called
Terminator
, The terminator can only appear on the right of the production . The word "sentence" is the starting point of all sentences , So it's also called
Start symbol
.

Usually put a non   The production of terminators is written together , use “|” separate , It can be summed up as follows :

sentence &nbsp;->&nbsp; The subject &nbsp; Predicate &nbsp; The object
The subject &nbsp;->&nbsp; you | I | He
Predicate &nbsp;->&nbsp; Love
The object &nbsp;->&nbsp;Rust&nbsp;|&nbsp;Pisanix

We start with a  SQL  Statements, for example : SELECT 1 + 1, We can write the push expression of this statement as , The analysis tree is shown in the figure  2.3.1

Expr&nbsp;=>&nbsp;Expr&nbsp;Expr&nbsp;+&nbsp;Expr&nbsp;=>&nbsp;SELECT&nbsp;Expr&nbsp;+&nbsp;Expr&nbsp;=>&nbsp;SELECT&nbsp;number&nbsp;+&nbsp;Expr&nbsp;=>&nbsp;SELECT&nbsp;number&nbsp;+&nbsp;number&nbsp;=>&nbsp;SELECT&nbsp;1&nbsp;+&nbsp;2

chart  2.3.1

2.3.1  Two analytical methods

Grammar analysis includes two analysis methods :

  • Top down analysis : Top down analysis starts with the starting symbol , Keep picking out the right production , Expand the non terminator in the middle sentence , Finally, expand to the given sentence .
  • Bottom up analysis : The order of bottom-up analysis is just the opposite to that of top-down analysis , Start with the given sentence , Keep picking out the right production , Fold the substring in the middle sentence into a non terminator , Finally collapse to the starting symbol

In the process of derivation , At each step, only one production can be applied , Every step can eliminate all other production formulas . But in actual analysis , In the intermediate process, you may encounter that all productions are not applicable or there are multiple productions that can be applied . For the second case , Backtracking is required , First try to choose a production application , If it is deduced until the final sentence ( Or starting symbol ), Then it shows that this production is available , If we continue to deduce, we will encounter the first case , Then go back here , Choose another production . If all the production formulas here have been tried, all of them encounter the first case , It shows that the final sentence does not conform to the grammatical structure . If there are multiple production expressions here, they can be deduced to the final sentence ( Or starting symbol ), It shows that the grammar is ambiguous . Backtracking analysis is generally very slow , Therefore, we usually avoid backtracking by carefully constructing grammar .
2.4  Summary

The above content mainly introduces the related concepts of compiler , And through a simple example to intuitively feel the basic principle and working process of lexical analysis , Next, I will briefly introduce the two most commonly used tools in lexical analysis and grammatical analysis  Lex  and  Yacc.

3、 ... and 、 know  Lex  and  Yacc

3.1  know  Lex

Flex( Fast lexical analyzer generator ) yes  Lex  Free open source software alternatives . It is a generated lexical analyzer ( Also known as “ Scanner ” or “ Lexical analyzer ”) A computer program for . Scanner is a program that recognizes lexical patterns in text , Used to recognize lexical patterns in text .
3.1.1 Lex It's made up of three parts

  • Definition section : The definition segment includes text blocks 、 Definition 、 Internal table declaration 、 Starting conditions and transitions .
  • The rules section : The rule segment is a series of matching patterns and actions , Patterns are usually written in regular expressions , The action part is  C  Code .
  • User subroutine segment : Here for  C  Code , Will be copied to c In file , Generally, some auxiliary functions are defined here , Such as the auxiliary function used in the action code .

Here we make a simple example , use  Lex  Implement a simple one  SQL  Lexical analyzer ,lex  The code is as follows :

%{
#include&nbsp;<stdio.h>
%}

%%

select&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;printf(&quot;KW-SELECT&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;:&nbsp;%s\n&quot;,&nbsp;yytext);
from&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;printf(&quot;KW-FROM&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;:&nbsp;%s\n&quot;,&nbsp;yytext);
where&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;printf(&quot;KW-WHERE&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;:&nbsp;%s\n&quot;,&nbsp;yytext);
and&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;printf(&quot;KW-AND&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;:&nbsp;%s\n&quot;,&nbsp;yytext);
or&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;printf(&quot;KW-OR&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;:&nbsp;%s\n&quot;,&nbsp;yytext);
[*]&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;printf(&quot;IDENTIFIED&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;:&nbsp;%s\n&quot;,&nbsp;yytext);
[,]&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;printf(&quot;IDENTIFIED&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;:&nbsp;%s\n&quot;,&nbsp;yytext);
[=]&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;printf(&quot;OP-EQ&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;:&nbsp;%s\n&quot;,&nbsp;yytext);
[<]&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;printf(&quot;KW-LT&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;:&nbsp;%s\n&quot;,&nbsp;yytext);
[>]&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;printf(&quot;KW-GT&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;:&nbsp;%s\n&quot;,&nbsp;yytext);
[a-zA-Z][a-zA-Z0-9]*&nbsp;&nbsp;&nbsp;printf(&quot;IDENTIFIED:&nbsp;&nbsp;&nbsp;&nbsp;:&nbsp;%s\n&quot;,&nbsp;yytext);
[0-9]+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;printf(&quot;NUM:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;:&nbsp;%s\n&quot;,&nbsp;yytext);
[&nbsp;\t]+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;printf(&quot;&nbsp;&quot;);
.&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;printf(&quot;Unknown&nbsp;:&nbsp;%c\n&quot;,yytext[0]);

%%

int&nbsp;main(int&nbsp;argc,&nbsp;char*&nbsp;argv[])&nbsp;{
&nbsp;&nbsp;&nbsp;&nbsp;yylex();
&nbsp;&nbsp;&nbsp;&nbsp;return&nbsp;0;
}

int&nbsp;yywrap()&nbsp;{
&nbsp;&nbsp;&nbsp;&nbsp;return&nbsp;1;
}

#&nbsp;flex&nbsp;sql.l&nbsp;#&nbsp; use &nbsp;flex&nbsp; compile &nbsp;.l&nbsp; File generation &nbsp;c&nbsp; Code
#&nbsp;ls
lex.yy.c&nbsp;sql.l
#&nbsp;gcc&nbsp;-o&nbsp;sql&nbsp;lex.yy.c&nbsp;#&nbsp; Compile and generate executable binaries
#&nbsp;./sql&nbsp;&nbsp;#&nbsp; Execute binary
select&nbsp;*&nbsp;from&nbsp;pisaproxy&nbsp;where&nbsp;id&nbsp;>&nbsp;1&nbsp;and&nbsp;sid&nbsp;<&nbsp;2&nbsp;&nbsp;#&nbsp; Input test &nbsp;sql
KW-SELECT&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;:&nbsp;select
IDENTIFIED&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;:&nbsp;*
KW-FROM&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;:&nbsp;from
IDENTIFIED:&nbsp;&nbsp;&nbsp;&nbsp;:&nbsp;pisaproxy
KW-WHERE&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;:&nbsp;where
IDENTIFIED:&nbsp;&nbsp;&nbsp;&nbsp;:&nbsp;id
KW-GT&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;:&nbsp;>
NUM:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;:&nbsp;1
KW-AND&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;:&nbsp;and
IDENTIFIED:&nbsp;&nbsp;&nbsp;&nbsp;:&nbsp;sid
KW-LT&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;:&nbsp;<
NUM:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;:&nbsp;2

We can see from the above example that ,Lex  Successfully put a  SQL  The statement is split into separate  token.
3.2  know  Yacc

Yacc  It is an industrial tool for developing compilers ,  use  LALR(1)  Grammar analysis method .LR(k)  Analysis method , In brackets  k(k>=0)  Indicates to view the number of input string symbols to the right .LR  The analysis method gives a method that can view the input string according to the symbol string in the current analysis stack and the right order  k  A symbol can uniquely determine whether the analyzer's action is to move in or to specify and which production specification to use .

Yacc  and  Lex  equally , It also includes “%%” Three separate segments : Definition statement 、 Rule of grammar 、C Code segment .

  • Define segments and predefined tag sections :
above %{ %} The code and Lex equally , Generally referred to as definition segment . Just some header file declarations , Macro definition 、 Variable definition declaration 、 Function declaration, etc . among
%left
It means left combination ,
%right
  Right combination . The last listed definitions have the highest priority . Therefore, multiplication and division have higher priority than addition and subtraction .+ - * /  All four arithmetic symbols are left combined . Use this simple technique , We can disambiguate grammar .
  • The rules section : Rule segments consist of grammatical rules and include C The action composition of the code . In the rule, the target or non terminal character is placed on the left , Followed by a colon (:), Then there is the right side of production , Then there is the corresponding action ( use {} contain )
  • Code section : This part is the function part . When Yacc  When parsing error , Would call  
    yyerror()
    , Users can customize the implementation of functions .
Here we use a simple example to pass  Yacc  and  Lex  To achieve a simple  SQL  Parser

sql.l  Code example

%{
#include&nbsp;<stdio.h>
#include&nbsp;<string.h>
#include&nbsp;&quot;struct.h&quot;
#include&nbsp;&quot;sql.tab.h&quot;
int&nbsp;numerorighe=0;
%}
%option&nbsp;noyywrap
%%
select&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;return&nbsp;SELECT;
from&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;return&nbsp;FROM;
where&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;return&nbsp;WHERE;
and&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;return&nbsp;AND;
or&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;return&nbsp;OR;
[*]&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;return&nbsp;*yytext;
[,]&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;return&nbsp;*yytext;
[=]&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;return&nbsp;*yytext;
[<]&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;return&nbsp;*yytext;
[>]&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;return&nbsp;*yytext;
[a-zA-Z][a-zA-Z0-9]*&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;{yylval.Mystr=strdup(yytext);return&nbsp;IDENTIFIER;}
[0-9]+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;return&nbsp;CONST;
\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;{++yylval.numerorighe;&nbsp;return&nbsp;NL;}
[&nbsp;\t]+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;/*&nbsp;ignore&nbsp;whitespace&nbsp;*/

%%

sql.y  Some code examples

%{

%}

%token&nbsp;<numerorighe>&nbsp;NL
%token&nbsp;<Mystr>&nbsp;IDENTIFIER&nbsp;CONST&nbsp;'<'&nbsp;'>'&nbsp;'='&nbsp;'*'
%token&nbsp;SELECT&nbsp;FROM&nbsp;WHERE&nbsp;AND&nbsp;OR
%type&nbsp;<Mystr>&nbsp;identifiers&nbsp;cond&nbsp;compare&nbsp;op

%%

lines:
&nbsp;&nbsp;&nbsp;&nbsp;line
&nbsp;&nbsp;&nbsp;&nbsp;|&nbsp;lines&nbsp;line
&nbsp;&nbsp;&nbsp;&nbsp;|&nbsp;lines&nbsp;error
&nbsp;&nbsp;&nbsp;&nbsp;;
line:&nbsp;
&nbsp;&nbsp;&nbsp;&nbsp;select&nbsp;identifiers&nbsp;FROM&nbsp;identifiers&nbsp;WHERE&nbsp;cond&nbsp;NL&nbsp;
&nbsp;&nbsp;&nbsp;&nbsp;{
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;ptr=putsymb($2,$4,$7);
&nbsp;&nbsp;&nbsp;&nbsp;}
&nbsp;&nbsp;&nbsp;&nbsp;;
&nbsp;&nbsp;&nbsp;&nbsp;
identifiers:
&nbsp;&nbsp;&nbsp;&nbsp;'*'&nbsp;{$=&quot;ALL&quot;;}
&nbsp;&nbsp;&nbsp;&nbsp;|&nbsp;IDENTIFIER&nbsp;{$=$1;}
&nbsp;&nbsp;&nbsp;&nbsp;|&nbsp;IDENTIFIER','identifiers
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;{
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;char*&nbsp;s&nbsp;=&nbsp;malloc(sizeof(char)*(strlen($1)+strlen($3)+1));
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;strcpy(s,$1);
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;strcat(s,&quot;&nbsp;&quot;);
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;strcat(s,$3);&nbsp;$=s;
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;}
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;;
&nbsp;&nbsp;
select:&nbsp;
&nbsp;&nbsp;&nbsp;&nbsp;SELECT
&nbsp;&nbsp;&nbsp;&nbsp;;

cond:&nbsp;
&nbsp;&nbsp;&nbsp;&nbsp;IDENTIFIER&nbsp;op&nbsp;compare
&nbsp;&nbsp;&nbsp;&nbsp;|&nbsp;IDENTIFIER&nbsp;op&nbsp;compare&nbsp;conn&nbsp;cond;

compare:&nbsp;
&nbsp;&nbsp;&nbsp;&nbsp;IDENTIFIER
&nbsp;&nbsp;&nbsp;&nbsp;|&nbsp;CONST
&nbsp;&nbsp;&nbsp;&nbsp;;

op:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
&nbsp;&nbsp;&nbsp;&nbsp;'<'
&nbsp;&nbsp;&nbsp;&nbsp;|'='
&nbsp;&nbsp;&nbsp;&nbsp;|'>'
&nbsp;&nbsp;&nbsp;&nbsp;;

conn:&nbsp;&nbsp;&nbsp;
&nbsp;&nbsp;&nbsp;&nbsp;AND
&nbsp;&nbsp;&nbsp;&nbsp;|&nbsp;OR
&nbsp;&nbsp;&nbsp;&nbsp;;
%%


#&nbsp; Here, because the compilation process is cumbersome , Here are just the key results
#&nbsp;./parser&nbsp;&quot;select&nbsp;id,name,age&nbsp;from&nbsp;pisaproxy1,pisaproxy2,pisaproxy3&nbsp;where&nbsp;id&nbsp;>&nbsp;1&nbsp;and&nbsp;name&nbsp;=&nbsp;'dasheng'&quot;

''Row&nbsp;#1&nbsp;is&nbsp;correct
&nbsp;columns:&nbsp;id&nbsp;name&nbsp;age
&nbsp;tables:&nbsp;pisaproxy1&nbsp;pisaproxy2&nbsp;pisaproxy3

It can be seen through  Lex  First of all, will  SQL  It can be interpreted as  token, And then by  yacc  Do grammar analysis , according to  .y  The rules in will  
column
  and  
tables
  Correctly solve .

Four 、Pisa-Proxy SQL  Analysis and implementation


stay  Pisa-Proxy  Medium  SQL  The parser is through  Grmtools  This tool implements ,Grmtools  yes  Rust  Realized  lex  and  yacc  library .Pisa-Proxy  Of  SQL  Parsing mainly includes two parts , First of all  
lex.rs
  file , This document was passed  Grmtools  The method provided realizes the handwriting lexical analyzer , As mentioned earlier , This module will  SQL  Sentence segmentation , Generate  token. And then there was  
grammar.y
  file , This document describes  SQL  The derivation of the statement ,Grmtools  The file will be parsed and finally generated  SQL AST.

5、 ... and 、 summary


This article mainly shares the relevant concepts and principles of compilers , We can see  SQL  Parser for  SQL  What is the meaning of the sentence , And how to  SQL  The form of statement string is transformed into the abstract syntax tree we need .Lex  and  Yacc  Are two very powerful tools , It can help developers easily and quickly implement their own parser , However, the compilation principle is a very complex subject .Pisa-Proxy  Of  SQL  The parser also encountered many problems in the implementation process , For example, how to resolve conflicts , Two senses , Priority and so on . There will continue to be an in-depth analysis of the article later  SQL  Statements in  Rust  The specific implementation of , This article will not repeat .

6、 ... and .  Related links :

6.1 Pisanix:

Project address :
https://github.com/database-mesh/pisanix
Official website address :
Hello from Pisanix | Pisanix
Database Mesh:
https://www.database-mesh.io/
SphereEx  Official website :
https://www.sphere-ex.com
6.2  Community

Tens of millions of steps for open source projects ,Pisanix  Just started. . Open source is a door ,Pisanix  Welcome to join us , Express your ideas , Share your insights , Whether it's code or documentation ,issue  still  pull request, The community welcomes . You guys who are willing to help with database management , Let's build  Pisanix  Community bar ~
at present  Pisanix  The community organizes online discussions every two weeks , The detailed arrangement is as follows , we will wait for you ~

7、 ... and 、 Reference material


  • 《 Compiler principle 》
  • 《 Principles of modern compilation 》
  • https://github.com/mysql/mysql-server/blob/8.0/sql/sql_yacc.yy
  • http://web.stanford.edu/class/archive/cs/cs143/cs143.1128/
原网站

版权声明
本文为[InfoQ]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/188/202207071451161098.html