当前位置:网站首页>Text to SQL model ----irnet

Text to SQL model ----irnet

2022-06-26 08:28:00 xuanningmeng

Text-to-SQL Model ----IRNET

I've been doing Text-to-SQL Mission , Read this paper and record the understanding process , If there is a mistake in understanding , Welcome to correct .
IRNET The model is Microsoft 2019 Published in ACL The paper of , The title of the paper is Towards Complex Text-to-SQL in Cross-Domain Database with Intermediate Representation, Thesis download address https://arxiv.org/pdf/1905.08205.pdf
The code address is https://github.com/microsoft/IRNet
The innovation of this paper is that it puts forward SemQL, It's natural language and SQL The middle of the question means , From natural language SemQL,SemQL Decode to get SQL. In actual data ,SQL Of group by,having Etc column And for appearing in natural language ( Here means question),SQL in group by Do aggregate functions for easy use , But there are few details to consider , Making end-to-end model learning a challenge . These problems are collectively referred to as mismatch problem.

IRNET Model

IRNET The model mainly includes SemQL,schema linking,NL encoder, schema encoder, decoder. among SemQL Yes, it will SQL Characterized as SemQL Trees ,schema linking yes NL,columns The process of determining the type ,NL encoder It's for the input NL Encoding ,schema encoder It is used for the column and table Conduct encoder,decoder Is to get sql Statement procedure , The structure of the model is as follows :irnet Model

SemQL

In order to solve NL and SQL The mismatch between , Put forward SemQL, yes NL and SQL In the middle of . establish SemQL, Rules need to be defined in advance , take SQL Represented as a tree structure . stay SemQL It doesn't show up in SQL In the sentence group by,having, Nested clauses, etc , also where,having The conditions in are uniformly used filter Express , In each node column and table, Appoint table It is convenient to represent repeated columns .SemQL The rules are as follows :
SemQL The rules of
among (1)Z Express sql Whether the statement contains intersect,union,except;
(2)R yes sql Medium select Query whether there is where and orderby,
(3)Order,Suerlative It corresponds to orderby Chinese content , If orderby There is limit, Corresponding Suerlative, Vice versa order.
(4)Filter Indicates different calculation symbols , If filter The nodes in the A The aggregation node of is None, said having, Otherwise, it means where, If R stay filter Next node , Represents a nested query .

Here is a SQL Of SemQL Result :
NL: Show the names of students who have a grade higher than 5 and have at least 2 friends.
SQL: SELECT T1.name FROM friend AS T1 JOIN highschooler AS T2
ON T1.student_id = T2.id WHERE T2.grade > 5 GROUP BY T1.student_id HAVING count(*) >= 2
 Insert picture description here

Schema linking

schema linking It's natural language ( It's actually in the data question) And database schema Medium column and table The process of building relationships , Recognition question In the database mentioned in column and table And given the type of process .text-to-SQL Of schema linking It can be understood as the process of entity linking , An entity is a database schema Medium column,table and value etc. .
stay IRNet in schema linking distinguish question Medium column and table The process is adopted n-gram The way , distinguish column and table In a similar way , The general process is as follows :
(1)NL And column according to n-gram Complete and partial matching ,NL Match in span Given type by column
(2)NL And table according to n-gram Complete and partial matching ,NL Match in span Given type by table
(3)NL It's the same as... In the database value Match ,
question The characters in value matching , This uses Concept Net, Only consider Concept-Net Two types in , It is divided into is a type of and related terms, It's just about columns The value of is of these two types columns, according to value Of match situation , take columns The type of is defined as VALUE EXACT MATCH perhaps VALUE PARTIAL MATCH.
Be careful : (1) there n-gram A match is a non coincident match
(2) If n-gram Match also matches to column and table, first column matching
(3) column There are types of 4 The types are exact match,partial match, value exact match and value partial match

Model

IRNET The structure diagram of the model is as follows :
IRNET Model structure
The model is divided into NL Encoder,Schema Encoder and Decoder Three parts .
(1)NL Encoder
For the input question and schema linking After type Conduct encoder, Here we use glove Conduct embedding perhaps bert ,LSTM Conduct encoder.NL Conduct encoder front , take label type by column,table, value These three types are spliced into the corresponding span Go ahead embedding
(2)Schema Encoder
schema encoder That's right. schema linking After column and table Conduct encoder The process of , This schema encoder Results for decoder in , In decoding pointer network as well as selection column and selection table Use in . Yes column Of encoder and table Of encoder be similar ,schema Is expressed as s=(c, t) , among c = { ( c 1 , ϕ 1 ) , ( c 2 , ϕ 2 ) , . . . , c n , ϕ n } c=\{(c_{1},\phi_{1}), (c_{2},\phi_{2}), ..., c_{n},\phi_{n}\} c={ (c1,ϕ1),(c2,ϕ2),...,cn,ϕn} Express column, t = { t 1 , t 2 , . . . , t m } t = \{t_{1}, t_{2}, ..., t_{m}\} t={ t1,t2,...,tm} Express table ,
Yes column Of encoder The process is as follows :
i. Yes column Every character in c i c_{i} ci Conduct embedding, Then on embedding Take the mean value to get vector e ^ c i \hat{e}_{c}^{i} e^ci
ii. Yes column c i c_{i} ci The type of ϕ i \phi_{i} ϕi Conduct embedding Get the characteristic matrix φ i \varphi_{i} φi
iii. Yes column The feature vector in is obtained by weighting the cosine similarity context vector c c i c_{c}^{i} cci
iv. Sum up the characteristic matrix of the above three steps to get the final column Of embedding
column encoder
Yes table Of encoder The process is as follows :
table encoder
(3)Decoder
decoder The goal is to generate SemQL, It's used here applyrule, selection columns and selection table,applyrule. Based on a given SemQL The tree structure of the , We use based on utilization LSTM The syntax of the decoder through dynamic Process generation SemQL. It is used in decoding coarse-to-fine frame , When decoding, you get SemQL, Then a fine-grained decoder is used to supplement the decoder selection column and table Missing details .Decoder The formula is as follows :
decoder
among a i a_{i} ai It's the moment i Of action taken, a < i a_{<i} a<i It's time i Previous sequence action, T T T Is the total time step T T T.
there applyrule It is a perceptron layer
i. selection column and selection table
stay selection column and selection table Used in pointer network,pointer network It can be understood as a memory network , Remember the selected column,decoder There is a door control unit in the to decide whether to start from memory Choose from column Or database schema choice , If from schema choice column, Then column Join in pointer network Medium memory.selection column and selection table The mathematical expression of is as follows :
selection column
selection table
ii Coarse-to-fine
The decoding process is from coarse to fine . The model is as follows :
 Insert picture description here
The specific formula is expressed as follows :
 Insert picture description here

experimental result

stay spider On dataset ,IRNet + glove The results on the validation set and the test set are 53.2%,46.7%,IRNet + bert The results on the validation set and the test set are 61.9%,54.7%. stay Dev Set On IRNet share 483 Wrong predictions . It is mainly divided into three categories :Column Prediction( Proportion 32.3%),Nested Query(23.9%),Operator(12.4%). As follows :
i. Column Prediction error : The corresponding... Cannot be found correctly based on the field value column
ii. Nested Query error : Mostly because Extra Hard level The complex nested query of does not construct accurately
iii. Operator error : part operators The choice of requires the machine to have some common sense , Such as “from old to young” You need to rank your age in descending order

The above is IRNET The content of the model , If there is a mistake , Welcome to correct .

原网站

版权声明
本文为[xuanningmeng]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/02/202202170557266912.html