当前位置：网站首页>Lex & yacc of Pisa proxy SQL parsing

Lex & yacc of Pisa proxy SQL parsing

2022-07-07 17:19:00 【InfoQ】

One 、 Preface

1.1 The authors introduce

Wang Bo ,SphereEx MeshLab R & D Engineer , Focus on Database Mesh,Cloud Native The development of .Linux,llvm,yacc,ebpf user. Gopher & Rustacean and c bug hunter.

GitHub:

https://xie.infoq.cn/article/d59e47a90edaf763df760ac2d

1.2 background

In the last article 《

Pisa-Proxy And SQL Analytical practice

》 This paper introduces the in Pisa-Proxy One of the core modules of SQL Parser related content . stay MySQL and PostgreSQL in

SQL analysis

It's through Yacc Realized , Again Pisa- Proxy Of SQL The parser is composed of Yacc Such tools realize , So this article will focus on SQL The parser introduces some compilation principles and Lex & Yacc Use , It will also show readers how to pass Lex & Yacc Implement a simple one SQL Parser . So as to help you better understand Pisa-Proxy in SQL How the parser works .

Two 、 Compiler exploration

A programming language, no matter what we often use Java,Golang Or is it SQL It is essentially a notation system , Like natural language , Its complete definition should include both syntax and semantics . The grammar of a language is actually a corresponding set of rules , It can be used to form and produce a suitable program . Currently, the most widely used method is context free grammar , Context free grammar is used as a tool to describe the syntax of programming language . Grammar just defines what kind of symbol sequence is legal , It has nothing to do with the meaning of these symbols . However, there are two categories in semantics ： Static semantics and dynamic semantics . Static semantics refers to a series of qualification rules , And determine which syntax is appropriate for the program ; Dynamic semantics is also called running semantics or execution semantics , Specify what the program will calculate .

2.1 Compiler workflow

Pictured 2.1.1 As shown in , Generally, the compiler compiles the source code into executable files in the following steps ：

Scan the source file , Split the character stream of the source file into words （token）, This is lexical analysis

According to the grammar rules, these marks are constructed into a grammar tree , This is syntax analysis

Check the relationship between the nodes of the syntax tree , Check whether semantic rules are violated , At the same time, the syntax tree is optimized as necessary , This is semantic analysis

Traverse the nodes of the syntax tree , Convert each node into intermediate code , And put them together in a specific order , This is intermediate code generation

Optimize the intermediate code

Convert intermediate code into object code

Optimize the object code , Generate the final target program

chart 2.1.1

about SQL Analytically speaking , The steps in the above figure can be simplified as shown in the figure 2.1.2 In the form of , Source code input （SQL sentence ）, take SQL Sentence for lexical analysis , Generate SQL In the specific token Sign flow . Then get the token stream and parse it to generate the final SQL AST.

chart 2.1.2

2.2 Lexical analysis

As mentioned above , Whether compiler or SQL A key step of the parser is to do lexical analysis on the source file , Lexical analysis can be understood as right SQL The sentence itself is processed by word segmentation . So at this stage ,SQL What the parser needs to do is scan the source file from left to right , take SQL Statements are divided into one by one token, here token Refer to SQL A string of characters that cannot be further divided in . For example, figure 2.1.2 Medium SQL sentence , After lexical analysis , Generated token by ：

SELECT

、

*

、

FROM

、

pisa_proxy

wait .

stay SQL What can be used in the sentence token Categories are also limited , For example, reserved words

SELECT

、

INSERT

、

DELETE

wait . And operators , such as ： arithmetic operator 、 Comparison operator . And identifiers , such as ： Built in function name and so on . At this stage, each scan token Will be maintained in a data structure , Then in the next phase, the parsing phase uses . Generally speaking , Lexical analysis has direct scanning , Regular matching scanning mode .

2.2.1 Direct scanning method

The logic of direct scanning method is very clear , Each scan determines which type it belongs to according to the first character token, Then take different strategies to scan out a complete token, Then proceed to the next scan . stay Pisa-Proxy Medium SQL Parsing , Lexical analysis adopts this implementation , use Python Show how to implement a simple SQL Lexical analyzer pair SQL scan , The code is as follows ：

#&nbsp;-*-&nbsp;coding:&nbsp;utf-8&nbsp;-*-

single_char_operators_typeA&nbsp;=&nbsp;{
&nbsp;&nbsp;&nbsp;&nbsp;&quot;;&quot;,&nbsp;&quot;,&quot;,&nbsp;&quot;(&quot;,&nbsp;&quot;)&quot;,&quot;/&quot;,&nbsp;&quot;+&quot;,&nbsp;&quot;-&quot;,&nbsp;&quot;*&quot;,&nbsp;&quot;%&quot;,&nbsp;&quot;.&quot;,
}

single_char_operators_typeB&nbsp;=&nbsp;{
&nbsp;&nbsp;&nbsp;&nbsp;&quot;<&quot;,&nbsp;&quot;>&quot;,&nbsp;&quot;=&quot;,&nbsp;&quot;!&quot;
}

double_char_operators&nbsp;=&nbsp;{
&nbsp;&nbsp;&nbsp;&nbsp;&quot;>=&quot;,&nbsp;&quot;<=&quot;,&nbsp;&quot;==&quot;,&nbsp;&quot;~=&quot;
}

reservedWords&nbsp;=&nbsp;{
&nbsp;&nbsp;&nbsp;&nbsp;&quot;select&quot;,&nbsp;&quot;insert&quot;,&nbsp;&quot;update&quot;,&nbsp;&quot;delete&quot;,&nbsp;&quot;show&quot;,
&nbsp;&nbsp;&nbsp;&nbsp;&quot;create&quot;,&nbsp;&quot;set&quot;,&nbsp;&quot;grant&quot;,&nbsp;&quot;from&quot;,&nbsp;&quot;where&quot;
}

class&nbsp;Token:
&nbsp;&nbsp;&nbsp;&nbsp;def&nbsp;__init__(self,&nbsp;_type,&nbsp;_val&nbsp;=&nbsp;None):
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;if&nbsp;_val&nbsp;is&nbsp;None:
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;self.type&nbsp;=&nbsp;&quot;T_&quot;&nbsp;+&nbsp;_type;
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;self.val&nbsp;=&nbsp;_type;
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;else:
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;self.type,&nbsp;self.val&nbsp;=&nbsp;_type,&nbsp;_val

&nbsp;&nbsp;&nbsp;&nbsp;def&nbsp;__str__(self):
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;return&nbsp;&quot;%-20s%s&quot;&nbsp;%&nbsp;(self.type,&nbsp;self.val)

class&nbsp;NoneTerminateQuoteError(Exception):
&nbsp;&nbsp;&nbsp;&nbsp;pass

def&nbsp;isWhiteSpace(ch):
&nbsp;&nbsp;&nbsp;&nbsp;return&nbsp;ch&nbsp;in&nbsp;&quot;&nbsp;\t\r\a\n&quot;

def&nbsp;isDigit(ch):
&nbsp;&nbsp;&nbsp;&nbsp;return&nbsp;ch&nbsp;in&nbsp;&quot;0123456789&quot;

def&nbsp;isLetter(ch):
&nbsp;&nbsp;&nbsp;&nbsp;return&nbsp;ch&nbsp;in&nbsp;&quot;abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ&quot;

def&nbsp;scan(s):
&nbsp;&nbsp;&nbsp;&nbsp;n,&nbsp;i&nbsp;=&nbsp;len(s),&nbsp;0
&nbsp;&nbsp;&nbsp;&nbsp;while&nbsp;i&nbsp;<&nbsp;n:
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;ch,&nbsp;i&nbsp;=&nbsp;s[i],&nbsp;i&nbsp;+&nbsp;1

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;if&nbsp;isWhiteSpace(ch):
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;continue

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;if&nbsp;ch&nbsp;==&nbsp;&quot;#&quot;:
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;return

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;if&nbsp;ch&nbsp;in&nbsp;single_char_operators_typeA:
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;yield&nbsp;Token(ch)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;elif&nbsp;ch&nbsp;in&nbsp;single_char_operators_typeB:
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;if&nbsp;i&nbsp;<&nbsp;n&nbsp;and&nbsp;s[i]&nbsp;==&nbsp;&quot;=&quot;:
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;yield&nbsp;Token(ch&nbsp;+&nbsp;&quot;=&quot;)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;else:
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;yield&nbsp;Token(ch)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;elif&nbsp;isLetter(ch)&nbsp;or&nbsp;ch&nbsp;==&nbsp;&quot;_&quot;:
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;begin&nbsp;=&nbsp;i&nbsp;-&nbsp;1
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;while&nbsp;i&nbsp;<&nbsp;n&nbsp;and&nbsp;(isLetter(s[i])&nbsp;or&nbsp;isDigit(s[i])&nbsp;or&nbsp;s[i]&nbsp;==&nbsp;&quot;_&quot;):
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;i&nbsp;+=&nbsp;1
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;word&nbsp;=&nbsp;s[begin:i]
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;if&nbsp;word&nbsp;in&nbsp;reservedWords:
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;yield&nbsp;Token(word)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;else:
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;yield&nbsp;Token(&quot;T_identifier&quot;,&nbsp;word)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;elif&nbsp;isDigit(ch):
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;begin&nbsp;=&nbsp;i&nbsp;-&nbsp;1
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;aDot&nbsp;=&nbsp;False
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;while&nbsp;i&nbsp;<&nbsp;n:
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;if&nbsp;s[i]&nbsp;==&nbsp;&quot;.&quot;:
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;if&nbsp;aDot:
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;raise&nbsp;Exception(&quot;Too&nbsp;many&nbsp;dot&nbsp;in&nbsp;a&nbsp;number!\n\tline:&quot;+line)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;aDot&nbsp;=&nbsp;True
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;elif&nbsp;not&nbsp;isDigit(s[i]):
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;break
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;i&nbsp;+=&nbsp;1
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;yield&nbsp;Token(&quot;T_double&quot;&nbsp;if&nbsp;aDot&nbsp;else&nbsp;&quot;T_integer&quot;,&nbsp;s[begin:i])
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;elif&nbsp;ord(ch)&nbsp;==&nbsp;34:&nbsp;#&nbsp;34&nbsp;means&nbsp;'&quot;'
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;begin&nbsp;=&nbsp;i
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;while&nbsp;i&nbsp;<&nbsp;n&nbsp;and&nbsp;ord(s[i])&nbsp;!=&nbsp;34:
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;i&nbsp;+=&nbsp;1
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;if&nbsp;i&nbsp;==&nbsp;n:
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;raise&nbsp;Exception(&quot;Non-terminated&nbsp;string&nbsp;quote!\n\tline:&quot;+line)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;yield&nbsp;Token(&quot;T_string&quot;,&nbsp;chr(34)&nbsp;+&nbsp;s[begin:i]&nbsp;+&nbsp;chr(34))
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;i&nbsp;+=&nbsp;1
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;else:
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;raise&nbsp;Exception(&quot;Unknown&nbsp;symbol!\n\tline:&quot;+line+&quot;\n\tchar:&quot;+ch)

if&nbsp;__name__&nbsp;==&nbsp;&quot;__main__&quot;:
&nbsp;&nbsp;&nbsp;&nbsp;print&nbsp;&quot;%-20s%s&quot;&nbsp;%&nbsp;(&quot;TOKEN&nbsp;TYPE&quot;,&nbsp;&quot;TOKEN&nbsp;VALUE&quot;)
&nbsp;&nbsp;&nbsp;&nbsp;print&nbsp;&quot;-&quot;&nbsp;*&nbsp;50
&nbsp;&nbsp;&nbsp;&nbsp;sql&nbsp;=&nbsp;&quot;select&nbsp;*&nbsp;from&nbsp;pisa_proxy&nbsp;where&nbsp;id&nbsp;=&nbsp;1;&quot;
&nbsp;&nbsp;&nbsp;&nbsp;for&nbsp;token&nbsp;in&nbsp;scan(sql):
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;print&nbsp;token

The final output is as follows ：

TOKEN&nbsp;TYPE&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;TOKEN&nbsp;VALUE
--------------------------------------------------
T_select&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;select
T_*&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;*
T_from&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;from
T_identifier&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;pisa_proxy
T_where&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;where
T_identifier&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;id
T_=&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;=
T_integer&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;1
T_;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;;

We can see from the above code , At the beginning, we defined several token type , for example ：

reservedWords

Array is maintained SQL Reserved word in , also SQL Operator in

single_char_operators_typeA

. We can see the final implementation result , The scanner will eventually be a SQL Sentences are segmented into their corresponding types token.

2.2.2 Regular expression scanning

In fact, the essence of a program is a set of strings , And language can be regarded as a collection of legal programs , So the compiler or SQL The essence of parsing is to judge that the input string is the legal type of the language , In addition, it is also a legal program that converts the legal program in the language into the target language . So the essence of regular expression scanning , That is to judge whether the string matches the regular expression in a finite state machine , In the following introduction, we will mention flex,flex The essence of is to construct a finite state automaton from the word segmentation matching pattern written by users in regular expressions .

2.3 Syntax analysis

After lexical analysis ,SQL The character stream of the statement is split into token, Then grammatical analysis is to analyze SQL The grammatical structure of , Will be linear token The flow is transformed into a tree structure , For subsequent semantic analysis and AST Prepare for generation . In the compiler , Regular expressions are still difficult to represent the set of sentences represented by programming languages , So we introduce context free grammar , Context free grammar can describe the grammatical structure of current programming languages .

Take our natural language as an example , Suppose a programming language is SQLX, It only contains the main - Predicate - Bin For this structure , The subject is only

you

、

I

、

He

, Predicate only

Love

, The object has

Rust

and

Pisanix

, Then the grammar can be in the following form ：

 sentence &nbsp;->&nbsp; The subject &nbsp; Predicate &nbsp; The object 
 The subject &nbsp;->&nbsp; I 
 The subject &nbsp;->&nbsp; you 
 The subject &nbsp;->&nbsp; He 
 Predicate &nbsp;->&nbsp; Love 
 The object &nbsp;->&nbsp;Rust
 The object &nbsp;->&nbsp;Pisanix

From the above example, we can replace the subject predicate object respectively , So as to write all the statements that meet this structure . On the contrary, we can compare any sentence with this structure to judge whether it meets SQLX Language . This leads to several concepts , In the above grammar, the form is “ The subject -> Predicate -> The object ” The formula of becomes

Generative formula

The symbol on the left of production （ sentence 、 The subject 、 Predicate 、 The object ） be called

non-terminal

. and “ I , you , He , Love ,Rust,Pisanix” These symbols can no longer produce new symbols , So it's called

Terminator

, The terminator can only appear on the right of the production . The word "sentence" is the starting point of all sentences , So it's also called

Start symbol

Usually put a non The production of terminators is written together , use “｜” separate , It can be summed up as follows ：

 sentence &nbsp;->&nbsp; The subject &nbsp; Predicate &nbsp; The object 
 The subject &nbsp;->&nbsp; you ｜ I ｜ He 
 Predicate &nbsp;->&nbsp; Love 
 The object &nbsp;->&nbsp;Rust&nbsp;|&nbsp;Pisanix

We start with a SQL Statements, for example ： SELECT 1 + 1, We can write the push expression of this statement as , The analysis tree is shown in the figure 2.3.1

Expr&nbsp;=>&nbsp;Expr&nbsp;Expr&nbsp;+&nbsp;Expr&nbsp;=>&nbsp;SELECT&nbsp;Expr&nbsp;+&nbsp;Expr&nbsp;=>&nbsp;SELECT&nbsp;number&nbsp;+&nbsp;Expr&nbsp;=>&nbsp;SELECT&nbsp;number&nbsp;+&nbsp;number&nbsp;=>&nbsp;SELECT&nbsp;1&nbsp;+&nbsp;2

chart 2.3.1

2.3.1 Two analytical methods

Grammar analysis includes two analysis methods ：

Top down analysis ： Top down analysis starts with the starting symbol , Keep picking out the right production , Expand the non terminator in the middle sentence , Finally, expand to the given sentence .

Bottom up analysis ： The order of bottom-up analysis is just the opposite to that of top-down analysis , Start with the given sentence , Keep picking out the right production , Fold the substring in the middle sentence into a non terminator , Finally collapse to the starting symbol

In the process of derivation , At each step, only one production can be applied , Every step can eliminate all other production formulas . But in actual analysis , In the intermediate process, you may encounter that all productions are not applicable or there are multiple productions that can be applied . For the second case , Backtracking is required , First try to choose a production application , If it is deduced until the final sentence （ Or starting symbol ）, Then it shows that this production is available , If we continue to deduce, we will encounter the first case , Then go back here , Choose another production . If all the production formulas here have been tried, all of them encounter the first case , It shows that the final sentence does not conform to the grammatical structure . If there are multiple production expressions here, they can be deduced to the final sentence （ Or starting symbol ）, It shows that the grammar is ambiguous . Backtracking analysis is generally very slow , Therefore, we usually avoid backtracking by carefully constructing grammar .

2.4 Summary

The above content mainly introduces the related concepts of compiler , And through a simple example to intuitively feel the basic principle and working process of lexical analysis , Next, I will briefly introduce the two most commonly used tools in lexical analysis and grammatical analysis Lex and Yacc.

3、 ... and 、 know Lex and Yacc

3.1 know Lex

Flex（ Fast lexical analyzer generator ） yes Lex Free open source software alternatives . It is a generated lexical analyzer （ Also known as “ Scanner ” or “ Lexical analyzer ”） A computer program for . Scanner is a program that recognizes lexical patterns in text , Used to recognize lexical patterns in text .

3.1.1 Lex It's made up of three parts

Definition section ： The definition segment includes text blocks 、 Definition 、 Internal table declaration 、 Starting conditions and transitions .

The rules section ： The rule segment is a series of matching patterns and actions , Patterns are usually written in regular expressions , The action part is C Code .

User subroutine segment ： Here for C Code , Will be copied to c In file , Generally, some auxiliary functions are defined here , Such as the auxiliary function used in the action code .

Here we make a simple example , use Lex Implement a simple one SQL Lexical analyzer ,lex The code is as follows :

%{
#include&nbsp;<stdio.h>
%}

%%

select&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;printf(&quot;KW-SELECT&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;:&nbsp;%s\n&quot;,&nbsp;yytext);
from&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;printf(&quot;KW-FROM&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;:&nbsp;%s\n&quot;,&nbsp;yytext);
where&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;printf(&quot;KW-WHERE&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;:&nbsp;%s\n&quot;,&nbsp;yytext);
and&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;printf(&quot;KW-AND&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;:&nbsp;%s\n&quot;,&nbsp;yytext);
or&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;printf(&quot;KW-OR&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;:&nbsp;%s\n&quot;,&nbsp;yytext);
[*]&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;printf(&quot;IDENTIFIED&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;:&nbsp;%s\n&quot;,&nbsp;yytext);
[,]&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;printf(&quot;IDENTIFIED&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;:&nbsp;%s\n&quot;,&nbsp;yytext);
[=]&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;printf(&quot;OP-EQ&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;:&nbsp;%s\n&quot;,&nbsp;yytext);
[<]&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;printf(&quot;KW-LT&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;:&nbsp;%s\n&quot;,&nbsp;yytext);
[>]&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;printf(&quot;KW-GT&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;:&nbsp;%s\n&quot;,&nbsp;yytext);
[a-zA-Z][a-zA-Z0-9]*&nbsp;&nbsp;&nbsp;printf(&quot;IDENTIFIED:&nbsp;&nbsp;&nbsp;&nbsp;:&nbsp;%s\n&quot;,&nbsp;yytext);
[0-9]+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;printf(&quot;NUM:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;:&nbsp;%s\n&quot;,&nbsp;yytext);
[&nbsp;\t]+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;printf(&quot;&nbsp;&quot;);
.&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;printf(&quot;Unknown&nbsp;:&nbsp;%c\n&quot;,yytext[0]);

%%

int&nbsp;main(int&nbsp;argc,&nbsp;char*&nbsp;argv[])&nbsp;{
&nbsp;&nbsp;&nbsp;&nbsp;yylex();
&nbsp;&nbsp;&nbsp;&nbsp;return&nbsp;0;
}

int&nbsp;yywrap()&nbsp;{
&nbsp;&nbsp;&nbsp;&nbsp;return&nbsp;1;
}

#&nbsp;flex&nbsp;sql.l&nbsp;#&nbsp; use &nbsp;flex&nbsp; compile &nbsp;.l&nbsp; File generation &nbsp;c&nbsp; Code 
#&nbsp;ls
lex.yy.c&nbsp;sql.l
#&nbsp;gcc&nbsp;-o&nbsp;sql&nbsp;lex.yy.c&nbsp;#&nbsp; Compile and generate executable binaries 
#&nbsp;./sql&nbsp;&nbsp;#&nbsp; Execute binary 
select&nbsp;*&nbsp;from&nbsp;pisaproxy&nbsp;where&nbsp;id&nbsp;>&nbsp;1&nbsp;and&nbsp;sid&nbsp;<&nbsp;2&nbsp;&nbsp;#&nbsp; Input test &nbsp;sql
KW-SELECT&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;:&nbsp;select
IDENTIFIED&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;:&nbsp;*
KW-FROM&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;:&nbsp;from
IDENTIFIED:&nbsp;&nbsp;&nbsp;&nbsp;:&nbsp;pisaproxy
KW-WHERE&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;:&nbsp;where
IDENTIFIED:&nbsp;&nbsp;&nbsp;&nbsp;:&nbsp;id
KW-GT&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;:&nbsp;>
NUM:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;:&nbsp;1
KW-AND&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;:&nbsp;and
IDENTIFIED:&nbsp;&nbsp;&nbsp;&nbsp;:&nbsp;sid
KW-LT&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;:&nbsp;<
NUM:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;:&nbsp;2

We can see from the above example that ,Lex Successfully put a SQL The statement is split into separate token.

3.2 know Yacc

Yacc It is an industrial tool for developing compilers , use LALR(1) Grammar analysis method .LR(k) Analysis method , In brackets k(k>=0) Indicates to view the number of input string symbols to the right .LR The analysis method gives a method that can view the input string according to the symbol string in the current analysis stack and the right order k A symbol can uniquely determine whether the analyzer's action is to move in or to specify and which production specification to use .

Yacc and Lex equally , It also includes “%%” Three separate segments ： Definition statement 、 Rule of grammar 、C Code segment .

Define segments and predefined tag sections ：

above %{ %} The code and Lex equally , Generally referred to as definition segment . Just some header file declarations , Macro definition 、 Variable definition declaration 、 Function declaration, etc . among

%left

It means left combination ,

%right

Right combination . The last listed definitions have the highest priority . Therefore, multiplication and division have higher priority than addition and subtraction .+ - * / All four arithmetic symbols are left combined . Use this simple technique , We can disambiguate grammar .

The rules section ： Rule segments consist of grammatical rules and include C The action composition of the code . In the rule, the target or non terminal character is placed on the left , Followed by a colon （:）, Then there is the right side of production , Then there is the corresponding action （ use {} contain ）

Code section ： This part is the function part . When Yacc When parsing error , Would call
yyerror()
, Users can customize the implementation of functions .

Here we use a simple example to pass Yacc and Lex To achieve a simple SQL Parser

sql.l Code example

%{
#include&nbsp;<stdio.h>
#include&nbsp;<string.h>
#include&nbsp;&quot;struct.h&quot;
#include&nbsp;&quot;sql.tab.h&quot;
int&nbsp;numerorighe=0;
%}
%option&nbsp;noyywrap
%%
select&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;return&nbsp;SELECT;
from&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;return&nbsp;FROM;
where&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;return&nbsp;WHERE;
and&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;return&nbsp;AND;
or&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;return&nbsp;OR;
[*]&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;return&nbsp;*yytext;
[,]&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;return&nbsp;*yytext;
[=]&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;return&nbsp;*yytext;
[<]&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;return&nbsp;*yytext;
[>]&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;return&nbsp;*yytext;
[a-zA-Z][a-zA-Z0-9]*&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;{yylval.Mystr=strdup(yytext);return&nbsp;IDENTIFIER;}
[0-9]+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;return&nbsp;CONST;
\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;{++yylval.numerorighe;&nbsp;return&nbsp;NL;}
[&nbsp;\t]+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;/*&nbsp;ignore&nbsp;whitespace&nbsp;*/

%%

sql.y Some code examples

%{

%}

%token&nbsp;<numerorighe>&nbsp;NL
%token&nbsp;<Mystr>&nbsp;IDENTIFIER&nbsp;CONST&nbsp;'<'&nbsp;'>'&nbsp;'='&nbsp;'*'
%token&nbsp;SELECT&nbsp;FROM&nbsp;WHERE&nbsp;AND&nbsp;OR
%type&nbsp;<Mystr>&nbsp;identifiers&nbsp;cond&nbsp;compare&nbsp;op

%%

lines:
&nbsp;&nbsp;&nbsp;&nbsp;line
&nbsp;&nbsp;&nbsp;&nbsp;|&nbsp;lines&nbsp;line
&nbsp;&nbsp;&nbsp;&nbsp;|&nbsp;lines&nbsp;error
&nbsp;&nbsp;&nbsp;&nbsp;;
line:&nbsp;
&nbsp;&nbsp;&nbsp;&nbsp;select&nbsp;identifiers&nbsp;FROM&nbsp;identifiers&nbsp;WHERE&nbsp;cond&nbsp;NL&nbsp;
&nbsp;&nbsp;&nbsp;&nbsp;{
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;ptr=putsymb($2,$4,$7);
&nbsp;&nbsp;&nbsp;&nbsp;}
&nbsp;&nbsp;&nbsp;&nbsp;;
&nbsp;&nbsp;&nbsp;&nbsp;
identifiers:
&nbsp;&nbsp;&nbsp;&nbsp;'*'&nbsp;{$=&quot;ALL&quot;;}
&nbsp;&nbsp;&nbsp;&nbsp;|&nbsp;IDENTIFIER&nbsp;{$=$1;}
&nbsp;&nbsp;&nbsp;&nbsp;|&nbsp;IDENTIFIER','identifiers
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;{
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;char*&nbsp;s&nbsp;=&nbsp;malloc(sizeof(char)*(strlen($1)+strlen($3)+1));
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;strcpy(s,$1);
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;strcat(s,&quot;&nbsp;&quot;);
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;strcat(s,$3);&nbsp;$=s;
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;}
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;;
&nbsp;&nbsp;
select:&nbsp;
&nbsp;&nbsp;&nbsp;&nbsp;SELECT
&nbsp;&nbsp;&nbsp;&nbsp;;

cond:&nbsp;
&nbsp;&nbsp;&nbsp;&nbsp;IDENTIFIER&nbsp;op&nbsp;compare
&nbsp;&nbsp;&nbsp;&nbsp;|&nbsp;IDENTIFIER&nbsp;op&nbsp;compare&nbsp;conn&nbsp;cond;

compare:&nbsp;
&nbsp;&nbsp;&nbsp;&nbsp;IDENTIFIER
&nbsp;&nbsp;&nbsp;&nbsp;|&nbsp;CONST
&nbsp;&nbsp;&nbsp;&nbsp;;

op:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
&nbsp;&nbsp;&nbsp;&nbsp;'<'
&nbsp;&nbsp;&nbsp;&nbsp;|'='
&nbsp;&nbsp;&nbsp;&nbsp;|'>'
&nbsp;&nbsp;&nbsp;&nbsp;;

conn:&nbsp;&nbsp;&nbsp;
&nbsp;&nbsp;&nbsp;&nbsp;AND
&nbsp;&nbsp;&nbsp;&nbsp;|&nbsp;OR
&nbsp;&nbsp;&nbsp;&nbsp;;
%%

#&nbsp; Here, because the compilation process is cumbersome , Here are just the key results 
#&nbsp;./parser&nbsp;&quot;select&nbsp;id,name,age&nbsp;from&nbsp;pisaproxy1,pisaproxy2,pisaproxy3&nbsp;where&nbsp;id&nbsp;>&nbsp;1&nbsp;and&nbsp;name&nbsp;=&nbsp;'dasheng'&quot;

''Row&nbsp;#1&nbsp;is&nbsp;correct
&nbsp;columns:&nbsp;id&nbsp;name&nbsp;age
&nbsp;tables:&nbsp;pisaproxy1&nbsp;pisaproxy2&nbsp;pisaproxy3

It can be seen through Lex First of all, will SQL It can be interpreted as token, And then by yacc Do grammar analysis , according to .y The rules in will

column

and

tables

Correctly solve .

Four 、Pisa-Proxy SQL Analysis and implementation

stay Pisa-Proxy Medium SQL The parser is through Grmtools This tool implements ,Grmtools yes Rust Realized lex and yacc library .Pisa-Proxy Of SQL Parsing mainly includes two parts , First of all

lex.rs

file , This document was passed Grmtools The method provided realizes the handwriting lexical analyzer , As mentioned earlier , This module will SQL Sentence segmentation , Generate token. And then there was

grammar.y

file , This document describes SQL The derivation of the statement ,Grmtools The file will be parsed and finally generated SQL AST.

5、 ... and 、 summary

This article mainly shares the relevant concepts and principles of compilers , We can see SQL Parser for SQL What is the meaning of the sentence , And how to SQL The form of statement string is transformed into the abstract syntax tree we need .Lex and Yacc Are two very powerful tools , It can help developers easily and quickly implement their own parser , However, the compilation principle is a very complex subject .Pisa-Proxy Of SQL The parser also encountered many problems in the implementation process , For example, how to resolve conflicts , Two senses , Priority and so on . There will continue to be an in-depth analysis of the article later SQL Statements in Rust The specific implementation of , This article will not repeat .

6、 ... and . Related links ：

6.1 Pisanix:

Project address ：

https://github.com/database-mesh/pisanix

Official website address ：

Hello from Pisanix | Pisanix

Database Mesh：

https://www.database-mesh.io/

SphereEx Official website ：

https://www.sphere-ex.com

6.2 Community

Tens of millions of steps for open source projects ,Pisanix Just started. . Open source is a door ,Pisanix Welcome to join us , Express your ideas , Share your insights , Whether it's code or documentation ,issue still pull request, The community welcomes . You guys who are willing to help with database management , Let's build Pisanix Community bar ~

at present Pisanix The community organizes online discussions every two weeks , The detailed arrangement is as follows , we will wait for you ~