当前位置:网站首页>Lex & yacc of Pisa proxy SQL parsing
Lex & yacc of Pisa proxy SQL parsing
2022-07-07 17:19:00 【InfoQ】
One 、 Preface
1.1 The authors introduce
1.2 background
Two 、 Compiler exploration
2.1 Compiler workflow
- Scan the source file , Split the character stream of the source file into words (token), This is lexical analysis
- According to the grammar rules, these marks are constructed into a grammar tree , This is syntax analysis
- Check the relationship between the nodes of the syntax tree , Check whether semantic rules are violated , At the same time, the syntax tree is optimized as necessary , This is semantic analysis
- Traverse the nodes of the syntax tree , Convert each node into intermediate code , And put them together in a specific order , This is intermediate code generation
- Optimize the intermediate code
- Convert intermediate code into object code
- Optimize the object code , Generate the final target program
2.2 Lexical analysis
SELECT
*
FROM
pisa_proxy
SELECT
INSERT
DELETE
2.2.1 Direct scanning method
# -*- coding: utf-8 -*-
single_char_operators_typeA = {
";", ",", "(", ")","/", "+", "-", "*", "%", ".",
}
single_char_operators_typeB = {
"<", ">", "=", "!"
}
double_char_operators = {
">=", "<=", "==", "~="
}
reservedWords = {
"select", "insert", "update", "delete", "show",
"create", "set", "grant", "from", "where"
}
class Token:
def __init__(self, _type, _val = None):
if _val is None:
self.type = "T_" + _type;
self.val = _type;
else:
self.type, self.val = _type, _val
def __str__(self):
return "%-20s%s" % (self.type, self.val)
class NoneTerminateQuoteError(Exception):
pass
def isWhiteSpace(ch):
return ch in " \t\r\a\n"
def isDigit(ch):
return ch in "0123456789"
def isLetter(ch):
return ch in "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"
def scan(s):
n, i = len(s), 0
while i < n:
ch, i = s[i], i + 1
if isWhiteSpace(ch):
continue
if ch == "#":
return
if ch in single_char_operators_typeA:
yield Token(ch)
elif ch in single_char_operators_typeB:
if i < n and s[i] == "=":
yield Token(ch + "=")
else:
yield Token(ch)
elif isLetter(ch) or ch == "_":
begin = i - 1
while i < n and (isLetter(s[i]) or isDigit(s[i]) or s[i] == "_"):
i += 1
word = s[begin:i]
if word in reservedWords:
yield Token(word)
else:
yield Token("T_identifier", word)
elif isDigit(ch):
begin = i - 1
aDot = False
while i < n:
if s[i] == ".":
if aDot:
raise Exception("Too many dot in a number!\n\tline:"+line)
aDot = True
elif not isDigit(s[i]):
break
i += 1
yield Token("T_double" if aDot else "T_integer", s[begin:i])
elif ord(ch) == 34: # 34 means '"'
begin = i
while i < n and ord(s[i]) != 34:
i += 1
if i == n:
raise Exception("Non-terminated string quote!\n\tline:"+line)
yield Token("T_string", chr(34) + s[begin:i] + chr(34))
i += 1
else:
raise Exception("Unknown symbol!\n\tline:"+line+"\n\tchar:"+ch)
if __name__ == "__main__":
print "%-20s%s" % ("TOKEN TYPE", "TOKEN VALUE")
print "-" * 50
sql = "select * from pisa_proxy where id = 1;"
for token in scan(sql):
print token
TOKEN TYPE TOKEN VALUE
--------------------------------------------------
T_select select
T_* *
T_from from
T_identifier pisa_proxy
T_where where
T_identifier id
T_= =
T_integer 1
T_; ;
reservedWords
single_char_operators_typeA
2.2.2 Regular expression scanning
2.3 Syntax analysis
you
I
He
Love
Rust
Pisanix
sentence -> The subject Predicate The object
The subject -> I
The subject -> you
The subject -> He
Predicate -> Love
The object -> Rust
The object -> Pisanix
Generative formula
non-terminal
Terminator
Start symbol
sentence -> The subject Predicate The object
The subject -> you | I | He
Predicate -> Love
The object -> Rust | Pisanix
Expr => Expr Expr + Expr => SELECT Expr + Expr => SELECT number + Expr => SELECT number + number => SELECT 1 + 2
2.3.1 Two analytical methods
- Top down analysis : Top down analysis starts with the starting symbol , Keep picking out the right production , Expand the non terminator in the middle sentence , Finally, expand to the given sentence .
- Bottom up analysis : The order of bottom-up analysis is just the opposite to that of top-down analysis , Start with the given sentence , Keep picking out the right production , Fold the substring in the middle sentence into a non terminator , Finally collapse to the starting symbol
2.4 Summary
3、 ... and 、 know Lex and Yacc
3.1 know Lex
3.1.1 Lex It's made up of three parts
- Definition section : The definition segment includes text blocks 、 Definition 、 Internal table declaration 、 Starting conditions and transitions .
- The rules section : The rule segment is a series of matching patterns and actions , Patterns are usually written in regular expressions , The action part is C Code .
- User subroutine segment : Here for C Code , Will be copied to c In file , Generally, some auxiliary functions are defined here , Such as the auxiliary function used in the action code .
%{
#include <stdio.h>
%}
%%
select printf("KW-SELECT : %s\n", yytext);
from printf("KW-FROM : %s\n", yytext);
where printf("KW-WHERE : %s\n", yytext);
and printf("KW-AND : %s\n", yytext);
or printf("KW-OR : %s\n", yytext);
[*] printf("IDENTIFIED : %s\n", yytext);
[,] printf("IDENTIFIED : %s\n", yytext);
[=] printf("OP-EQ : %s\n", yytext);
[<] printf("KW-LT : %s\n", yytext);
[>] printf("KW-GT : %s\n", yytext);
[a-zA-Z][a-zA-Z0-9]* printf("IDENTIFIED: : %s\n", yytext);
[0-9]+ printf("NUM: : %s\n", yytext);
[ \t]+ printf(" ");
. printf("Unknown : %c\n",yytext[0]);
%%
int main(int argc, char* argv[]) {
yylex();
return 0;
}
int yywrap() {
return 1;
}
# flex sql.l # use flex compile .l File generation c Code
# ls
lex.yy.c sql.l
# gcc -o sql lex.yy.c # Compile and generate executable binaries
# ./sql # Execute binary
select * from pisaproxy where id > 1 and sid < 2 # Input test sql
KW-SELECT : select
IDENTIFIED : *
KW-FROM : from
IDENTIFIED: : pisaproxy
KW-WHERE : where
IDENTIFIED: : id
KW-GT : >
NUM: : 1
KW-AND : and
IDENTIFIED: : sid
KW-LT : <
NUM: : 2
3.2 know Yacc
- Define segments and predefined tag sections :
%left
%right
- The rules section : Rule segments consist of grammatical rules and include C The action composition of the code . In the rule, the target or non terminal character is placed on the left , Followed by a colon (:), Then there is the right side of production , Then there is the corresponding action ( use {} contain )
- Code section : This part is the function part . When Yacc When parsing error , Would call
yyerror()
, Users can customize the implementation of functions .
%{
#include <stdio.h>
#include <string.h>
#include "struct.h"
#include "sql.tab.h"
int numerorighe=0;
%}
%option noyywrap
%%
select return SELECT;
from return FROM;
where return WHERE;
and return AND;
or return OR;
[*] return *yytext;
[,] return *yytext;
[=] return *yytext;
[<] return *yytext;
[>] return *yytext;
[a-zA-Z][a-zA-Z0-9]* {yylval.Mystr=strdup(yytext);return IDENTIFIER;}
[0-9]+ return CONST;
\n {++yylval.numerorighe; return NL;}
[ \t]+ /* ignore whitespace */
%%
%{
%}
%token <numerorighe> NL
%token <Mystr> IDENTIFIER CONST '<' '>' '=' '*'
%token SELECT FROM WHERE AND OR
%type <Mystr> identifiers cond compare op
%%
lines:
line
| lines line
| lines error
;
line:
select identifiers FROM identifiers WHERE cond NL
{
ptr=putsymb($2,$4,$7);
}
;
identifiers:
'*' {$="ALL";}
| IDENTIFIER {$=$1;}
| IDENTIFIER','identifiers
{
char* s = malloc(sizeof(char)*(strlen($1)+strlen($3)+1));
strcpy(s,$1);
strcat(s," ");
strcat(s,$3); $=s;
}
;
select:
SELECT
;
cond:
IDENTIFIER op compare
| IDENTIFIER op compare conn cond;
compare:
IDENTIFIER
| CONST
;
op:
'<'
|'='
|'>'
;
conn:
AND
| OR
;
%%
# Here, because the compilation process is cumbersome , Here are just the key results
# ./parser "select id,name,age from pisaproxy1,pisaproxy2,pisaproxy3 where id > 1 and name = 'dasheng'"
''Row #1 is correct
columns: id name age
tables: pisaproxy1 pisaproxy2 pisaproxy3
column
tables
Four 、Pisa-Proxy SQL Analysis and implementation
lex.rs
grammar.y
5、 ... and 、 summary
6、 ... and . Related links :
6.1 Pisanix:
6.2 Community
7、 ... and 、 Reference material
- 《 Compiler principle 》
- 《 Principles of modern compilation 》
- https://github.com/mysql/mysql-server/blob/8.0/sql/sql_yacc.yy
- http://web.stanford.edu/class/archive/cs/cs143/cs143.1128/
边栏推荐
- 《产品经理必读:五种经典的创新思维模型》的读后感
- LeetCode 1626. The best team without contradiction
- Sort out several important Android knowledge and advanced Android development interview questions
- LeetCode 1186. 删除一次得到子数组最大和 每日一题
- Notes on installing MySQL in centos7
- 正在准备面试,分享面经
- Skimage learning (1)
- MRS离线数据分析:通过Flink作业处理OBS数据
- The latest interview experience of Android manufacturers in 2022, Android view+handler+binder
- LeetCode 1774. 最接近目标价格的甜点成本 每日一题
猜你喜欢
Sator launched Web3 game "satorspace" and launched hoobi
测试用例管理工具推荐
QT中自定义控件的创建到封装到工具栏过程(二):自定义控件封装到工具栏
正在准备面试,分享面经
Direct dry goods, 100% praise
AI来搞财富分配比人更公平?来自DeepMind的多人博弈游戏研究
Mrs offline data analysis: process OBS data through Flink job
Pychart ide Download
SlashData开发者工具榜首等你而定!!!
Master this promotion path and share interview materials
随机推荐
Skimage learning (3) -- adapt the gray filter to RGB images, separate colors by immunohistochemical staining, and filter the maximum value of the region
【饭谈】如何设计好一款测试平台?
麒麟信安携异构融合云金融信创解决方案亮相第十五届湖南地区金融科技交流会
Matplotlib绘图界面设置
LeetCode 300. Daily question of the longest increasing subsequence
【饭谈】Web3.0到来后,测试人员该何去何从?(十条预言和建议)
Sort out several important Android knowledge and advanced Android development interview questions
LeetCode 1043. Separate the array to get the maximum and daily questions
[Seaborn] implementation of combined charts and multi subgraphs
DNS series (I): why does the updated DNS record not take effect?
LeetCode 152. Product maximum subarray daily question
Is AI more fair than people in the distribution of wealth? Research on multiplayer game from deepmind
How to choose the appropriate automated testing tools?
Pycharm IDE下载
[image sensor] correlated double sampling CDs
LeetCode 152. 乘积最大子数组 每日一题
Solidity函数学习
Seaborn数据可视化
SIGGRAPH 2022最佳技术论文奖重磅出炉!北大陈宝权团队获荣誉提名
mysql实现两个字段合并成一个字段查询