当前位置:网站首页>Detailed AST abstract syntax tree
Detailed AST abstract syntax tree
2022-08-03 19:41:00 【Horse stepping on flying swallows & lin_li】
AST抽象语法树
一、AST的作用
(一)简介
抽象语法树(abstract syntax code,AST)是源代码的抽象语法结构的树状表示,树上的每个节点都表示源代码中的一种结构,这所以说是抽象的,是因为抽象语法树并不会表示出真实语法出现的每一个细节,比如说,嵌套括号被隐含在树的结构中,并没有以节点的形式呈现.抽象语法树并不依赖于源语言的语法,也就是说语法分析阶段所采用的上下文无文文法,因为在写文法时,经常会对文法进行等价的转换(消除左递归,回溯,二义性等),这样会给文法分析引入一些多余的成分,对后续阶段造成不利影响,甚至会使合个阶段变得混乱.因些,很多编译器经常要独立地构造语法分析树,为前端,后端建立一个清晰的接口.
抽象语法树在很多领域有广泛的应用,比如浏览器,智能编辑器,编译器.
(二)The abstract syntax tree instance
(1)四则运算表达式
表达式: 1+3*(4-1)+2
(2)xml
<letter>
<address>
<city>ShiChuang</city>
</address>
<people>
<id>12478</id>
<name>Nosic</name>
</people>
</letter>
(3)程序1
while b != 0
{
if a > b
a = a-b
else
b = b-a
}
return a
(4)程序2
sum=0
for i in range(0,100)
sum=sum+i
end
(三)Why need the abstract syntax tree
When working in the source program syntax analysis,Is in the corresponding programming language grammar rules under the guidance of.Rules of grammar describes the language structure of various grammatical components,Can usually has nothing to do with so-called context makes grammar or equivalentBackus-Naur范式(BNF)A programming language syntax rules exactly describe them.Contextual free grammar has divided into so few classes:LL(1),LR(0),LR(1), LR(k) ,LALR(1)等.Each kind of grammar has different requirements,如LL(1)Required grammar unambiguous and there is no left recursion.When a grammar insteadLL(1)文法时,Need to introduce some outer insulation of grammar symbols with production.
The first characteristics of the abstract syntax tree as:不依赖于具体的文法.无论是LL(1)文法,还是LR(1),Or other method,Require the syntax analysis,Constructing the same syntax tree,This would give the compiler backend provides clear,统一的接口.Even the front have adopted different grammar,Only need to change the front-end code,Without having to bring trouble to the backend.即减少了工作量,Also improve the maintainability of the compiler.
The second characteristics of the abstract syntax tree as:不依赖于语言的细节.The compiler in the family,大名鼎鼎的gccIs a big brother,It can be compiled languages,例如c,c++,java,ADA,Object C, FORTRAN, PASCAL, COBOL等等.在前端gccFor different language lexical,After parsing and semantic analysis,The abstract syntax tree form the middle code as the output,For the back-end processing.要做到这一点,In the syntax tree structure so as to have,不依赖于语言的细节,For example, in different languages,类似于if-condition-thenSuch statements have different representation
在c中为:
if(condition)
{
do_something();
}
在fortran中为:
If condition then
do_somthing()
end if
在构造if-condition-thenStatement of the abstract syntax tree,Only need to use two branch node to table to,一个为condition,一个为if_body.如下图:
In the source program of braces,或者是关键字,都会被丢掉.
二、AST流程
Plain text turnAST的实现
第一步:词法分析,也叫扫描scanner
它读取我们的代码,然后把它们按照预定的规则合并成一个个的标识 tokens.同时,它会移除空白符、注释等.最后,整个代码将被分割进一个 tokens 列表(或者说一维数组).
const a = 5;
// 转换成
[{value: 'const', type: 'keyword'}, {value: 'a', type: 'identifier'}, ...]
当词法分析源代码的时候,它会一个一个字母地读取代码,所以很形象地称之为扫描 - scans.当它遇到空格、操作符,或者特殊符号的时候,它会认为一个话已经完成了.
第二步:语法分析,也称解析器
它会将词法分析出来的数组转换成树形的形式,同时,验证语法.语法如果有错的话,抛出语法错误.
[{value: 'const', type: 'keyword'}, {value: 'a', type: 'identifier'}, ...]
// 语法分析后的树形形式
{
type: "VariableDeclarator",
id: {
type: "Identifier",
name: "a"
},
...
}
当生成树的时候,解析器会删除一些没必要的标识 tokens(比如:不完整的括号),因此 AST 不是 100% 与源码匹配的.
解析器100%覆盖所有代码结构生成树叫做CST.
现在,我们拆解一个简单的add函数
function add(a, b) {
return a + b
}
首先,我们拿到的这个语法块,是一个FunctionDeclaration(函数定义)对象.
用力拆开,它成了三块:
- 一个id,就是它的名字,即add
- 两个params,就是它的参数,即[a, b]
- 一块body,也就是大括号内的一堆东西
add没办法继续拆下去了,它是一个最基础Identifier(标志)对象,用来作为函数的唯一标志,就像人的姓名一样.
{
name: 'add'
type: 'identifier'
...
}
params继续拆下去,其实是两个Identifier组成的数组.之后也没办法拆下去了.
[
{
name: 'a'
type: 'identifier'
...
},
{
name: 'b'
type: 'identifier'
...
}
]
接下来,我们继续拆开body
我们发现,body其实是一个BlockStatement(块状域)对象,用来表示是{return a + b}
打开Blockstatement,里面藏着一个ReturnStatement(Return域)对象,用来表示return a + b
继续打开ReturnStatement,里面是一个BinaryExpression(二项式)对象,用来表示a + b
继续打开BinaryExpression,它成了三部分,left,operator,right
- operator 即+
- left 里面装的,是Identifier对象 a
- right 里面装的,是Identifer对象 b
就这样,我们把一个简单的add函数拆解完毕,用图表示就是
三、在 Python 中生成 AST
在 Python The underlying implementation of already contains the source code to AST 到 CodeObject 的转换过程,实际上 Python Also provides a set of tools,To help us direct control AST,如果熟练掌握的话,Can achieve some interesting magic.
(一)从源码到 AST
Python官方提供的CPython解释器对pythonSource process is as follows:
Parse source code into a parse tree (Parser/pgen.c)
Transform parse tree into an Abstract Syntax Tree (Python/ast.c)
Transform AST into a Control Flow Graph (Python/compile.c)
Emit bytecode based on the Control Flow Graph (Python/compile.c)
即实际pythonCode process is as follows:
源代码解析 --> 语法树 --> 抽象语法树(AST) --> 控制流程图 --> 字节码
AST官方文档 https://docs.python.org/zh-cn/3/library/ast.html
AST源码 https://github.com/python/cpython/blob/3.10/Lib/ast.py
Compile函数
compile(source, filename, mode[, flags[, dont_inherit]])
- source – 字符串或者AST(Abstract Syntax Trees)对象.Generally the wholepy文件内容file.read()传入.
- filename – 代码文件名称,如果不是从文件读取代码则传递一些可辨认的值.
- mode – 指定编译代码的种类.可以指定为 exec, eval, single.
- flags – 变量作用域,局部命名空间,如果被提供,可以是任何映射对象.
- flags和dont_inherit是用来控制编译源码时的标志.
>>> cm = compile(func_def, '<string>', 'exec')
>>> exec cm
==
ast.parse(source, filename='<unknown>', mode='exec')
demo2.py
import types
func_def = \
""" def add(x, y): return x + y print(add(3, 5)) """
cm = compile(func_def, '<string>', 'exec')
print(type(cm))
isinstance(cm, types.CodeType)
exec(func_def) #传入的类型可以是str、bytes或code.
exec(cm) #传入的类型可以是str、bytes或code.
上面func_def经过compile编译得到字节码.
生成ast
Python 已经内置了 ast 模块,Can be directly generated from source AST,There is also a set of tools can be AST 做一些调整.First of all, from the most basic start,From the source code AST 对象.
ast.parse(source, filename='<unknown>', mode='exec', *, type_comments=False, feature_version=None)
主要参数:
- source,To compile the code,字符串;
- filename,Runtime error message will be output to the file;
- mode,If it is a special code for “eval”,Many lines of code is “exec”;
其返回值为 AST 对象.
AST Object is a tree structure,每一个 Node There may be more child nodes,通过 ast.dump 可以方便的查看 AST 的内部.
import ast
src=''' a = 1 b = 2 c = a + b '''
ast_node = ast.parse(src, "msg.log", mode="exec")
print(ast.dump(ast_node))
So you can get the output:
Module(body=[Assign(targets=[Name(id='a', ctx=Store())], value=Num(n=1)), Assign(targets=[Name(id='b', ctx=Store())], value=Num(n=2)), Assign(targets=[Name(id='c', ctx=Store())], value=BinOp(left=Name(id='a', ctx=Load()), op=Add(), right=Name(id='b', ctx=Load())))])
除了ast.dump,有很多dump ast的第三方库,如astunparse, codegen, unparse等.These third-party libraries can not only show in a better wayast结构,还能够将ast反向导出python source代码.
源码分析:
ast.parse(可以直接查看ast模块的源代码)Method is actually a built-in function calledcompile进行编译,源码如下所示:
def parse(source, filename='<unknown>', mode='exec'):
""" Parse the source into an AST node. Equivalent to compile(source, filename, mode, PyCF_ONLY_AST). """
return compile(source, filename, mode, PyCF_ONLY_AST)
传递给compile特殊的flag = PyCF_ONLY_AST, 来通过compileReturn to the abstract syntax tree.
astpretty 优雅输出
AST Is essentially a tree structure of data,The above output is not very convenient to observe,astpretty Provides a more elegant output.
Module(
body=[
Assign(
lineno=2,
col_offset=0,
end_lineno=2,
end_col_offset=5,
targets=[Name(lineno=2, col_offset=0, end_lineno=2, end_col_offset=1, id='a', ctx=Store())],
value=Constant(lineno=2, col_offset=4, end_lineno=2, end_col_offset=5, value=1, kind=None),
type_comment=None,
),
Assign(
lineno=3,
col_offset=0,
end_lineno=3,
end_col_offset=5,
targets=[Name(lineno=3, col_offset=0, end_lineno=3, end_col_offset=1, id='b', ctx=Store())],
value=Constant(lineno=3, col_offset=4, end_lineno=3, end_col_offset=5, value=2, kind=None),
type_comment=None,
),
Assign(
lineno=4,
col_offset=0,
end_lineno=4,
end_col_offset=9,
targets=[Name(lineno=4, col_offset=0, end_lineno=4, end_col_offset=1, id='c', ctx=Store())],
value=BinOp(
lineno=4,
col_offset=4,
end_lineno=4,
end_col_offset=9,
left=Name(lineno=4, col_offset=4, end_lineno=4, end_col_offset=5, id='a', ctx=Load()),
op=Add(),
right=Name(lineno=4, col_offset=8, end_lineno=4, end_col_offset=9, id='b', ctx=Load()),
),
type_comment=None,
),
],
type_ignores=[],
)
ast树解析
Each node in the syntax tree correspondingastUnder one type,根节点是ast.Moudle类型,At the time of analysis can be done byisinstanceFunction and convenient for the node type judgment.
import ast
root_node = ast.parse("print('hello world')")
print(ast.dump(root_node))
print(isinstance(root_node,ast.Module))
print(isinstance(root_node,ast.Expr))
print(isinstance(root_node.body[0],ast.Expr))
ast.expr
和 ast.stmt
The subclass instance has lineno
、col_offset
、end_lineno
和 end_lineno
属性.lineno
和 end_lineno
Is the first line by line source code number and the last line by line(从1开始, So the first line by line number is1),而 col_offset
和 end_col_offset
Generated node is the first and the last token 的 UTF-8 字节偏移量.记录下 UTF-8 The offset is the cause of parser 内部使用 UTF-8 .
Abstract syntax definition of each on the left side of the symbol(比方说, ast.stmt
或者 ast.expr
)定义了一个类.另外,On the right side of the abstract syntax definition,For every constructor also defines a class;These classes inherit from the tree on the left side of the class.比如,ast.Assign
继承自 ast.stmt
.
Each concrete instances of the class each child node has an attribute to it,Corresponding types such as defined in the grammar.比如,ast.Assign Instances of a property target
,类型是 ast.stmt
.
比如 a = 10Such a statement correspondsast.Assign节点类型,而AssignNode types, respectively, there are two children, 分别为ast.Name类型的a和ast.Num类型的10等.我们可以通过ast.dump(node)函数来将node格式化,并进行打印,To view the node content,以“a = 10”这行代码为例.
Module(body=[Assign(targets=[Name(id=‘a’, ctx=Store())], value=Num(n=10))])
(1) root节点
Module(body=[Assign(targets=[Name(id=‘a’, ctx=Store())], value=Num(n=10))])
root节点是Module类型,由于只有一行代码,所有root节点只有AssignSuch a child node.
(2) 子节点
Assign(targets=[Name(id=‘a’, ctx=Store())], value=Num(n=10))
上述的Assign节点有三个子节点,分别是Name, Store和Num.
Name(id=‘a’, ctx=Store())
Num(n=10)
而Name有一个子节点,Store.
Store()(Store表示NameWhen operating in assignment, 类型的有Load,del, Specific reference node types of documents)
一个简单的“a = 10”的这样一行代码,We can through the above thisast treeTo analyze and modify the code structure.
附:Gather the abstract grammar is defined as follows:
module Python
{
mod = Module(stmt* body, type_ignore* type_ignores)
| Interactive(stmt* body)
| Expression(expr body)
| FunctionType(expr* argtypes, expr returns)
stmt = FunctionDef(identifier name, arguments args,
stmt* body, expr* decorator_list, expr? returns,
string? type_comment)
| AsyncFunctionDef(identifier name, arguments args,
stmt* body, expr* decorator_list, expr? returns,
string? type_comment)
| ClassDef(identifier name,
expr* bases,
keyword* keywords,
stmt* body,
expr* decorator_list)
| Return(expr? value)
| Delete(expr* targets)
| Assign(expr* targets, expr value, string? type_comment)
| AugAssign(expr target, operator op, expr value)
-- 'simple' indicates that we annotate simple name without parens
| AnnAssign(expr target, expr annotation, expr? value, int simple)
-- use 'orelse' because else is a keyword in target languages
| For(expr target, expr iter, stmt* body, stmt* orelse, string? type_comment)
| AsyncFor(expr target, expr iter, stmt* body, stmt* orelse, string? type_comment)
| While(expr test, stmt* body, stmt* orelse)
| If(expr test, stmt* body, stmt* orelse)
| With(withitem* items, stmt* body, string? type_comment)
| AsyncWith(withitem* items, stmt* body, string? type_comment)
| Match(expr subject, match_case* cases)
| Raise(expr? exc, expr? cause)
| Try(stmt* body, excepthandler* handlers, stmt* orelse, stmt* finalbody)
| Assert(expr test, expr? msg)
| Import(alias* names)
| ImportFrom(identifier? module, alias* names, int? level)
| Global(identifier* names)
| Nonlocal(identifier* names)
| Expr(expr value)
| Pass | Break | Continue
-- col_offset is the byte offset in the utf8 string the parser uses
attributes (int lineno, int col_offset, int? end_lineno, int? end_col_offset)
-- BoolOp() can use left & right?
expr = BoolOp(boolop op, expr* values)
| NamedExpr(expr target, expr value)
| BinOp(expr left, operator op, expr right)
| UnaryOp(unaryop op, expr operand)
| Lambda(arguments args, expr body)
| IfExp(expr test, expr body, expr orelse)
| Dict(expr* keys, expr* values)
| Set(expr* elts)
| ListComp(expr elt, comprehension* generators)
| SetComp(expr elt, comprehension* generators)
| DictComp(expr key, expr value, comprehension* generators)
| GeneratorExp(expr elt, comprehension* generators)
-- the grammar constrains where yield expressions can occur
| Await(expr value)
| Yield(expr? value)
| YieldFrom(expr value)
-- need sequences for compare to distinguish between
-- x < 4 < 3 and (x < 4) < 3
| Compare(expr left, cmpop* ops, expr* comparators)
| Call(expr func, expr* args, keyword* keywords)
| FormattedValue(expr value, int conversion, expr? format_spec)
| JoinedStr(expr* values)
| Constant(constant value, string? kind)
-- the following expression can appear in assignment context
| Attribute(expr value, identifier attr, expr_context ctx)
| Subscript(expr value, expr slice, expr_context ctx)
| Starred(expr value, expr_context ctx)
| Name(identifier id, expr_context ctx)
| List(expr* elts, expr_context ctx)
| Tuple(expr* elts, expr_context ctx)
-- can appear only in Subscript
| Slice(expr? lower, expr? upper, expr? step)
-- col_offset is the byte offset in the utf8 string the parser uses
attributes (int lineno, int col_offset, int? end_lineno, int? end_col_offset)
expr_context = Load | Store | Del
boolop = And | Or
ast反编译工具 astunparse
安装astunparse:pip install astunparse
astunparse官网:https://pypi.org/project/astunparse/
import ast
import astunparse
src = '''
a = 1
b = 2
c = a + b
print("hello world")
'''
# get back the source code
print(astunparse.unparse(ast.parse(src)))
输出结果:
a = 1
b = 2
c = (a + b)
print('hello world')
(二)The syntax tree traversal analysis
使用NodeVisitorMainly by modifying the way of grammar tree node to changeAST结构,NodeTransformer主要是替换ast中的节点.
1. visitor的定义
可以通过astModules providevisitorTo traverse a syntax tree.
ast.NodeVisitorIs a specialized tool used to traverse the syntax tree,We can inherit this class to complete the syntax tree traversal traversal and processing in the process of.
import ast
import astunparse
func_def = \
""" a = 3 b = 5 def add(x, y): return x + y print(add(a,b)) """
# class CodeVisitor(ast.NodeVisitor):
# def generic_visit(self, node):
# print(type(node).__name__,end=', ')
# ast.NodeVisitor.generic_visit(self, node)
#
# def visit_FunctionDef(self, node):
# print(type(node).__name__,end=', ')
# ast.NodeVisitor.generic_visit(self, node)
#
# def visit_Assign(self, node):
# print(type(node).__name__,end=', ')
# ast.NodeVisitor.generic_visit(self, node)
# r_node = ast.parse(func_def)
# visitor = CodeVisitor()
# visitor.visit(r_node)
class CodeVisitor(ast.NodeVisitor):
def generic_visit(self, node):
print(type(node).__name__)
ast.NodeVisitor.generic_visit(self, node)
def visit_FunctionDef(self, node):
print(type(node).__name__)
ast.NodeVisitor.generic_visit(self, node)
def visit_Assign(self, node):
print(type(node).__name__)
ast.NodeVisitor.generic_visit(self, node)
r_node = ast.parse(func_def)
visitor = CodeVisitor()
visitor.visit(r_node)
如上述代码,定义类CodeVisitor,继承自NodeVisitor,There are mainly two types of function,一种的generic_visit,一种是"visit_" + “Node类型”.
visitor首先从根节点root进行遍历,在遍历的过程中,Assume that the node type forAssign,如果存在visit_Assign类型的函数,则调用visit_Assgin函数,如果不存在则调用generic_visit函数.
In each function processing,According to the requirements need to addast.NodeVisitor.generic_visit(self, node)这段代码,否则visitorWill not continue to visit the child nodes of the current node.
e.g. If the function is defined as follows:
def visit_Moudle(self, node):
print type(node).name
那么,首先访问根节点root,root为Moudle类型,会调用visit_Moudle函数,由于visit_Moudle函数中没有调用NodeVisitor.generic_visit(self, node),So the traversal traversal only a root noderoot,Not through other nodes.
2. visitor方法示例
将def中的addFunction of additive operation to subtract.
import ast
import astunparse
func_def = \
""" a = 3 b = 5 def add(x, y): return x + y print(add(a,b)) """
class CodeVisitor(ast.NodeVisitor):
def generic_visit(self, node):
print(type(node).__name__)
ast.NodeVisitor.generic_visit(self, node)
def visit_FunctionDef(self, node):
print(type(node).__name__)
ast.NodeVisitor.generic_visit(self, node)
def visit_Assign(self, node):
print(type(node).__name__)
ast.NodeVisitor.generic_visit(self, node)
def visit_BinOp(self, node):
if isinstance(node.op, ast.Add):
node.op = ast.Sub()
self.generic_visit(node)
r_node = ast.parse(func_def)
visitor = CodeVisitor()
visitor.visit(r_node)
print(astunparse.unparse(r_node))
exec(compile(r_node, '<string>', 'exec'))
3. walk方式遍历
for node in ast.walk(tree):
if isinstance(node, ast.FunctionDef):
print(node.name)
4.NodeTransfomer定义
使用NodeVisitorMainly by modifying the way of grammar tree node to changeAST结构,NodeTransformer主要是替换ast中的节点.
astThe module also provides aNodeTransfomerNode to support fornode的修改,NodeTransfomer继承自NodeVisitor,并重写了generic_visit函数.
对于NodeTransfomer的generic_visit以及visit_ + 节点类型的函数,都需要返回一个node,可以返回原始node,A new replacementnode,或者是返回Node代表remove掉这个节点.
假设我们有如下的代码:
"""ast test code"""
a = 10
b = "test"
print(a)
我们定义一个NodeTransform的visitor如下:
class ReWriteName(ast.NodeTransformer):
def generic_visit(self, node):
has_lineno = getattr(node, "lineno", "None")
col_offset = getattr(node, "col_offset", "None")
print type(node).__name__, has_lineno, col_offset
ast.NodeTransformer.generic_visit(self, node)
return node
def visit_Name(self, node):
new_node = node
if node.id == "a":
new_node = ast.Name(id = "a_rep", ctx = node.ctx)
return new_node
def visit_Num(self, node):
if node.n == 10:
node.n = 100
return node
file = open("code.py", "r")
source = file.read()
visitor = ReWriteName()
root = ast.parse(source)
root = visitor.visit(root)
ast.fix_missing_locations(root)
code_object = compile(root, "<string>", "exec")
exec(code_object)
在visit_Name中,将变量"a"替换成了变量"a_rep",执行到a = 10以及print a的时候,都会将a替换成a_rep,And returns a new node.
在visit_Num中,简单粗暴的将10替换成了100,Returns the modified the original node.
ast作用在pythonAfter parsing grammar,编译成pyCodeObjectThe bytecode structure before,通过NodeTransformer修改后,Returns the modified syntax tree,We through the built-in modulecompile编译成pyCodeObject对象,交给python虚拟机执行.
执行结果:100
可以看到,我们同时将a = 10和print aTwo will bea名字换成了a_rep,并将10替换成了100,最后打印的结果是100,Successfully changed the syntax tree node.
注意:
修改语法树节点,Especially when delete a syntax tree node carefully,Because may return after modify or delete the wrong grammar tree,直到compileCan find problem or perform.
Through the node changespython codeCan through the above methods,不过请注意,在运用visitor的代码中有ast.fix_missing_locations(root)这样一行代码,This is because we create the node is not containlineno以及col_offsetThe necessary properties,You must manually modify the add to specify,The newly added node code line position and offset.
5.NodeTransfomer方法示例
About node changes,A good example here can refer to:https://greentreesnakes.readthedocs.org/en/latest/examples.html
把def中定义的addTo a reduction function,The function name and parameters, and the called function areast中改掉.
import ast
import astunparse
func_def = \
""" a = 3 b = 5 def add(x, y): return x + y print(add(a,b)) """
class CodeTransformer(ast.NodeTransformer):
def visit_BinOp(self, node):
if isinstance(node.op, ast.Add):
node.op = ast.Sub()
self.generic_visit(node)
return node
def visit_FunctionDef(self, node):
self.generic_visit(node)
if node.name == 'add':
node.name = 'sub'
args_num = len(node.args.args)
print(node.args.args)
# args = tuple([arg.id for arg in node.args.args])
# func_log_stmt = ''.join(["print 'calling func: %s', " % node.name, "'args:'", ", %s" * args_num % args])
# node.body.insert(0, ast.parse(func_log_stmt))
return node
def visit_Name(self, node):
replace = {
'add': 'sub', 'x': 'a', 'y': 'b'}
re_id = replace.get(node.id, None)
node.id = re_id or node.id
self.generic_visit(node)
return node
r_node = ast.parse(func_def)
transformer = CodeTransformer()
r_node = transformer.visit(r_node)
# print astunparse.dump(r_node)
source = astunparse.unparse(r_node)
print(source)
# exec compile(r_node, '<string>', 'exec') # 新加入的node func_log_stmt 缺少lineno和col_offset属性
exec(compile(source, '<string>', 'exec'))
exec(compile(ast.parse(source), '<string>', 'exec'))
四、AST的应用
ASTModules are seldom used in practical programming,But as it is meaningful to have a source code auxiliary examination means;语法检查,调试错误,Special field testing, etc.
(一)Chinese test
Here is China, Japan and South Korea characterunicode编码范围
CJK Unified Ideographs
Range: 4E00— 9FFF
Number of characters: 20992
Languages: chinese, japanese, korean, vietnamese
使用 unicode 范围 \u4e00 - \u9fff To discriminate Chinese characters,Note that the scope does not contain Chinese characters(e.g. u’;’ == u’\uff1b’) .
The following is a judge whether the string contains Chinese characters a classCNCheckHelper:
class CNCheckHelper(object):
# The text to be detected may encoding list
VALID_ENCODING = ('utf-8', 'gbk')
def _get_unicode_imp(self, value, idx = 0):
if idx < len(self.VALID_ENCODING):
try:
return value.decode(self.VALID_ENCODING[idx])
except:
return self._get_unicode_imp(value, idx + 1)
def _get_unicode(self, from_str):
if isinstance(from_str, unicode):
return None
return self._get_unicode_imp(from_str)
def is_any_chinese(self, check_str, is_strict = True):
unicode_str = self._get_unicode(check_str)
if unicode_str:
c_func = any if is_strict else all
return c_func(u'\u4e00' <= char <= u'\u9fff' for char in unicode_str)
return False
接口is_any_chineseThere are two kinds of judgment pattern,Strict testing as long as the containing Chinese string can check out,Not strictly must contain all Chinese.
下面我们利用astTo traverse the source file of the abstract syntax tree,And detect the string contains Chinese characters.
class CodeCheck(ast.NodeVisitor):
def __init__(self):
self.cn_checker = CNCheckHelper()
def visit_Str(self, node):
self.generic_visit(node)
# if node.s and any(u'\u4e00' <= char <= u'\u9fff' for char in node.s.decode('utf-8')):
if self.cn_checker.is_any_chinese(node.s, True):
print 'line no: %d, column offset: %d, CN_Str: %s' % (node.lineno, node.col_offset, node.s)
project_dir = './your_project/script'
for root, dirs, files in os.walk(project_dir):
print root, dirs, files
py_files = filter(lambda file: file.endswith('.py'), files)
checker = CodeCheck()
for file in py_files:
file_path = os.path.join(root, file)
print 'Checking: %s' % file_path
with open(file_path, 'r') as f:
root_node = ast.parse(f.read())
checker.visit(root_node)
五、参考文献
https://www.jb51.net/article/257225.htm
https://greentreesnakes.readthedocs.io/en/latest/examples.html
https://www.cnblogs.com/us-wjz/articles/11013200.html
https://docs.python.org/zh-cn/3/library/ast.html#node-classes
https://github.com/python/cpython/blob/3.10/Lib/ast.py
边栏推荐
- 开发即时通讯到底需要什么样的技术,需要多久的时间
- awk语法-02-运算、数组、格式化输出
- Calculation of the array serial number of Likou brush questions (one question per day 7/28)
- 国产虚拟化云宏CNware WinStack安装体验-5 开启集群HA
- 力扣刷题之爬楼梯(7/30)
- Radondb mysql installation problems
- Postgresql snapshot optimization Globalvis new system analysis (performance greatly enhanced)
- Shell programming loop statement
- 阿里巴巴政委体系-第七章、阿里政委培育
- LeetCode 952. Calculate Maximum Component Size by Common Factor
猜你喜欢
从腾讯阿里等大厂出来创业搞 Web3、元宇宙的人在搞什么
Teach you to locate online MySQL slow query problem hand by hand, package teaching package meeting
The ecological environmental protection management system based on mobile GIS
入门3D建模基础教程详细分解
Line the last time the JVM FullGC make didn't sleep all night, collapse
pytorch框架实现老照片修复功能详细演示(GPU版)
CS kill-free pose
2022 CCF中国开源大会会议通知(第三轮)
阿里巴巴政委体系-第五章、阿里政委体系建设
NNLM、RNNLM等语言模型 实现 下一单词预测(next-word prediction)
随机推荐
Unity获取canvas 下ui 在屏幕中的实际坐标
分享即时通讯开发之WebSocket:概念、原理、易错常识、动手实践
622 设计循环队列——Leetcode天天刷【循环队列,数组模拟,双指针】(2022.8.2)
国产虚拟化云宏CNware WinStack安装体验-5 开启集群HA
The addition and subtraction of the score of the force deduction brush question (a daily question 7/27)
阿里巴巴政委体系-第五章、阿里政委体系建设
友宏医疗与Actxa签署Pre-M Diabetes TM 战略合作协议
Radondb mysql installation problems
Postgresql source code (65) analysis of the working principle of the new snapshot system Globalvis
力扣刷题之移动零
力扣刷题之爬楼梯(7/30)
力扣解法汇总899-有序队列
余弦距离介绍
简易电子琴设计(c语言)
【统计机器学习】线性回归模型
Reveal how the five operational management level of hundreds of millions of easily flow system
开源教育论坛| ChinaOSC
149. 直线上最多的点数-并查集做法
ECCV 2022 Oral | 满分论文!视频实例分割新SOTA: IDOL
面试突击:什么是粘包和半包?怎么解决?