当前位置:网站首页>Lex & yacc of Pisa proxy SQL parsing
Lex & yacc of Pisa proxy SQL parsing
2022-07-07 17:19:00 【InfoQ】
One 、 Preface
1.1 The authors introduce
1.2 background
Two 、 Compiler exploration
2.1 Compiler workflow
- Scan the source file , Split the character stream of the source file into words (token), This is lexical analysis
- According to the grammar rules, these marks are constructed into a grammar tree , This is syntax analysis
- Check the relationship between the nodes of the syntax tree , Check whether semantic rules are violated , At the same time, the syntax tree is optimized as necessary , This is semantic analysis
- Traverse the nodes of the syntax tree , Convert each node into intermediate code , And put them together in a specific order , This is intermediate code generation
- Optimize the intermediate code
- Convert intermediate code into object code
- Optimize the object code , Generate the final target program
![](/img/02/5509fd4291f431e1121899411803be.png)
![](/img/26/e4724ef12d7bdc3382d359692a6289.png)
2.2 Lexical analysis
SELECT
*
FROM
pisa_proxy
SELECT
INSERT
DELETE
2.2.1 Direct scanning method
# -*- coding: utf-8 -*-
single_char_operators_typeA = {
";", ",", "(", ")","/", "+", "-", "*", "%", ".",
}
single_char_operators_typeB = {
"<", ">", "=", "!"
}
double_char_operators = {
">=", "<=", "==", "~="
}
reservedWords = {
"select", "insert", "update", "delete", "show",
"create", "set", "grant", "from", "where"
}
class Token:
def __init__(self, _type, _val = None):
if _val is None:
self.type = "T_" + _type;
self.val = _type;
else:
self.type, self.val = _type, _val
def __str__(self):
return "%-20s%s" % (self.type, self.val)
class NoneTerminateQuoteError(Exception):
pass
def isWhiteSpace(ch):
return ch in " \t\r\a\n"
def isDigit(ch):
return ch in "0123456789"
def isLetter(ch):
return ch in "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"
def scan(s):
n, i = len(s), 0
while i < n:
ch, i = s[i], i + 1
if isWhiteSpace(ch):
continue
if ch == "#":
return
if ch in single_char_operators_typeA:
yield Token(ch)
elif ch in single_char_operators_typeB:
if i < n and s[i] == "=":
yield Token(ch + "=")
else:
yield Token(ch)
elif isLetter(ch) or ch == "_":
begin = i - 1
while i < n and (isLetter(s[i]) or isDigit(s[i]) or s[i] == "_"):
i += 1
word = s[begin:i]
if word in reservedWords:
yield Token(word)
else:
yield Token("T_identifier", word)
elif isDigit(ch):
begin = i - 1
aDot = False
while i < n:
if s[i] == ".":
if aDot:
raise Exception("Too many dot in a number!\n\tline:"+line)
aDot = True
elif not isDigit(s[i]):
break
i += 1
yield Token("T_double" if aDot else "T_integer", s[begin:i])
elif ord(ch) == 34: # 34 means '"'
begin = i
while i < n and ord(s[i]) != 34:
i += 1
if i == n:
raise Exception("Non-terminated string quote!\n\tline:"+line)
yield Token("T_string", chr(34) + s[begin:i] + chr(34))
i += 1
else:
raise Exception("Unknown symbol!\n\tline:"+line+"\n\tchar:"+ch)
if __name__ == "__main__":
print "%-20s%s" % ("TOKEN TYPE", "TOKEN VALUE")
print "-" * 50
sql = "select * from pisa_proxy where id = 1;"
for token in scan(sql):
print token
TOKEN TYPE TOKEN VALUE
--------------------------------------------------
T_select select
T_* *
T_from from
T_identifier pisa_proxy
T_where where
T_identifier id
T_= =
T_integer 1
T_; ;
reservedWords
single_char_operators_typeA
2.2.2 Regular expression scanning
2.3 Syntax analysis
you
I
He
Love
Rust
Pisanix
sentence -> The subject Predicate The object
The subject -> I
The subject -> you
The subject -> He
Predicate -> Love
The object -> Rust
The object -> Pisanix
Generative formula
non-terminal
Terminator
Start symbol
sentence -> The subject Predicate The object
The subject -> you | I | He
Predicate -> Love
The object -> Rust | Pisanix
Expr => Expr Expr + Expr => SELECT Expr + Expr => SELECT number + Expr => SELECT number + number => SELECT 1 + 2
![](/img/71/ca04be665586744e7a124587916d6e.png)
2.3.1 Two analytical methods
- Top down analysis : Top down analysis starts with the starting symbol , Keep picking out the right production , Expand the non terminator in the middle sentence , Finally, expand to the given sentence .
- Bottom up analysis : The order of bottom-up analysis is just the opposite to that of top-down analysis , Start with the given sentence , Keep picking out the right production , Fold the substring in the middle sentence into a non terminator , Finally collapse to the starting symbol
2.4 Summary
3、 ... and 、 know Lex and Yacc
3.1 know Lex
3.1.1 Lex It's made up of three parts
- Definition section : The definition segment includes text blocks 、 Definition 、 Internal table declaration 、 Starting conditions and transitions .
- The rules section : The rule segment is a series of matching patterns and actions , Patterns are usually written in regular expressions , The action part is C Code .
- User subroutine segment : Here for C Code , Will be copied to c In file , Generally, some auxiliary functions are defined here , Such as the auxiliary function used in the action code .
%{
#include <stdio.h>
%}
%%
select printf("KW-SELECT : %s\n", yytext);
from printf("KW-FROM : %s\n", yytext);
where printf("KW-WHERE : %s\n", yytext);
and printf("KW-AND : %s\n", yytext);
or printf("KW-OR : %s\n", yytext);
[*] printf("IDENTIFIED : %s\n", yytext);
[,] printf("IDENTIFIED : %s\n", yytext);
[=] printf("OP-EQ : %s\n", yytext);
[<] printf("KW-LT : %s\n", yytext);
[>] printf("KW-GT : %s\n", yytext);
[a-zA-Z][a-zA-Z0-9]* printf("IDENTIFIED: : %s\n", yytext);
[0-9]+ printf("NUM: : %s\n", yytext);
[ \t]+ printf(" ");
. printf("Unknown : %c\n",yytext[0]);
%%
int main(int argc, char* argv[]) {
yylex();
return 0;
}
int yywrap() {
return 1;
}
# flex sql.l # use flex compile .l File generation c Code
# ls
lex.yy.c sql.l
# gcc -o sql lex.yy.c # Compile and generate executable binaries
# ./sql # Execute binary
select * from pisaproxy where id > 1 and sid < 2 # Input test sql
KW-SELECT : select
IDENTIFIED : *
KW-FROM : from
IDENTIFIED: : pisaproxy
KW-WHERE : where
IDENTIFIED: : id
KW-GT : >
NUM: : 1
KW-AND : and
IDENTIFIED: : sid
KW-LT : <
NUM: : 2
3.2 know Yacc
- Define segments and predefined tag sections :
%left
%right
- The rules section : Rule segments consist of grammatical rules and include C The action composition of the code . In the rule, the target or non terminal character is placed on the left , Followed by a colon (:), Then there is the right side of production , Then there is the corresponding action ( use {} contain )
- Code section : This part is the function part . When Yacc When parsing error , Would call
yyerror()
, Users can customize the implementation of functions .
%{
#include <stdio.h>
#include <string.h>
#include "struct.h"
#include "sql.tab.h"
int numerorighe=0;
%}
%option noyywrap
%%
select return SELECT;
from return FROM;
where return WHERE;
and return AND;
or return OR;
[*] return *yytext;
[,] return *yytext;
[=] return *yytext;
[<] return *yytext;
[>] return *yytext;
[a-zA-Z][a-zA-Z0-9]* {yylval.Mystr=strdup(yytext);return IDENTIFIER;}
[0-9]+ return CONST;
\n {++yylval.numerorighe; return NL;}
[ \t]+ /* ignore whitespace */
%%
%{
%}
%token <numerorighe> NL
%token <Mystr> IDENTIFIER CONST '<' '>' '=' '*'
%token SELECT FROM WHERE AND OR
%type <Mystr> identifiers cond compare op
%%
lines:
line
| lines line
| lines error
;
line:
select identifiers FROM identifiers WHERE cond NL
{
ptr=putsymb($2,$4,$7);
}
;
identifiers:
'*' {$="ALL";}
| IDENTIFIER {$=$1;}
| IDENTIFIER','identifiers
{
char* s = malloc(sizeof(char)*(strlen($1)+strlen($3)+1));
strcpy(s,$1);
strcat(s," ");
strcat(s,$3); $=s;
}
;
select:
SELECT
;
cond:
IDENTIFIER op compare
| IDENTIFIER op compare conn cond;
compare:
IDENTIFIER
| CONST
;
op:
'<'
|'='
|'>'
;
conn:
AND
| OR
;
%%
# Here, because the compilation process is cumbersome , Here are just the key results
# ./parser "select id,name,age from pisaproxy1,pisaproxy2,pisaproxy3 where id > 1 and name = 'dasheng'"
''Row #1 is correct
columns: id name age
tables: pisaproxy1 pisaproxy2 pisaproxy3
column
tables
Four 、Pisa-Proxy SQL Analysis and implementation
lex.rs
grammar.y
5、 ... and 、 summary
6、 ... and . Related links :
6.1 Pisanix:
6.2 Community
7、 ... and 、 Reference material
- 《 Compiler principle 》
- 《 Principles of modern compilation 》
- https://github.com/mysql/mysql-server/blob/8.0/sql/sql_yacc.yy
- http://web.stanford.edu/class/archive/cs/cs143/cs143.1128/
边栏推荐
- 编程模式-表驱动编程
- Number of exchanges in the 9th Blue Bridge Cup finals
- Skimage learning (1)
- Seaborn data visualization
- SlashData开发者工具榜首等你而定!!!
- 【Seaborn】组合图表、多子图的实现
- LeetCode 1049. Weight of the last stone II daily question
- 蓝桥杯 决赛 异或变换 100分
- Master this promotion path and share interview materials
- Mrs offline data analysis: process OBS data through Flink job
猜你喜欢
管理VDI的几个最佳实践
Pycharm IDE下载
skimage学习(3)——Gamma 和 log对比度调整、直方图均衡、为灰度图像着色
The latest interview experience of Android manufacturers in 2022, Android view+handler+binder
麒麟信安操作系统衍生产品解决方案 | 存储多路径管理系统,有效提高数据传输可靠性
QT中自定义控件的创建到封装到工具栏过程(二):自定义控件封装到工具栏
NeRF:DeepFake的最终替代者?
QT中自定义控件的创建到封装到工具栏过程(一):自定义控件的创建
直接上干货,100%好评
正在准备面试,分享面经
随机推荐
Solidity函数学习
最新阿里P7技术体系,妈妈再也不用担心我找工作了
【Seaborn】组合图表:PairPlot和JointPlot
LeetCode 213. 打家劫舍 II 每日一题
MRS离线数据分析:通过Flink作业处理OBS数据
Rpcms method of obtaining articles under the specified classification
The server is completely broken and cannot be repaired. How to use backup to restore it into a virtual machine without damage?
【源码解读】| LiveListenerBus源码解读
On Apache Doris Fe processing query SQL source code analysis
LeetCode 1186. 删除一次得到子数组最大和 每日一题
Smart logistics platform: make overseas warehouses smarter
Flask搭建api服务-生成API文档
LeetCode刷题day49
科普达人丨一文弄懂什么是云计算?
DNS 系列(一):为什么更新了 DNS 记录不生效?
Leetcode brush questions day49
[Seaborn] combination chart: pairplot and jointplot
Biped robot controlled by Arduino
LeetCode 300. Daily question of the longest increasing subsequence
The mail server is listed in the blacklist. How to unblock it quickly?