当前位置:网站首页>Page file system based on C language
Page file system based on C language
2022-07-26 22:40:00 【biyezuopin】
be based on C Page file system of language
Additional functions implemented
- Non cluster indexes that can be dynamically added or deleted , Both primary and non primary indexes support using more than one domain as a key ;
- Attribute domain constraints , Support “CHECK ( in ())”;
- Foreign key constraints ;
- DATE The type and VARCHAR type . however VARCHAR The type is still implemented as a fixed length type ;
- Query optimization : Choose which index to use when there are several indexes in the table ;
- Optimization of multi table connection : Optimize the order of querying tables when connecting multiple tables , Make it use indexes as much as possible ;
- Aggregate query SUM、AVG、MIN、MAX;
- Grouping query clauses GROUP BY;
- Sort clause ORDER BY;
- Two different kinds of I/O Pattern : Read in SQL file , Output CSV file ; Or command line interactive read , Output a character table that is easy for human to interpret .
The system design
summary
From a bottom-up perspective , The backbone of the system can be divided into the following modules : The first is a page file system , The module exchanges data with the hard disk in whole pages , But it provides a byte level read-write interface , So that other modules can access data in different memory pages as data in ordinary memory , Don't worry about specific cache replacement .
The page file system solves the problem of how to store byte blocks , We should also solve the problem of how to store different types of data . So , I defined INT、FLOAT、CHAR、VARCHAR、DATE Five different types , Use a unified interface in easy to store byte blocks 、 Easy to operate runtime representation and easy I/O Conversion between literal quantities of . among DATE Type used C Language native strptime and localtime Function to realize parsing and output . On the other hand , I specialize memory pages into two :BitmapPage Class is used to store bit strings 、ListPage Class is used to store the above different types of data , So far, the system has realized the storage of different types of data in memory pages .
Next, we need to take the memory page as the basic component to establish the table in the database . The implementation of tables in this project is divided into three layers :TablePages Class manages each memory page used in a table ;BaseTable Class manages the basic queries in the table , The index is also implemented here ;Table Class provides a literal oriented interface for tables , This is because SQL Data types in the query can be converted to each other , Using literal expressions outside a single table can avoid the inconvenience of type conversion .
With a watch , You need to organize tables into databases . In this project ,TableMgr Class is responsible for managing multiple tables , Among them, the meta information such as the header of each table is stored in a separate system table , The header of the system table is hard coded .
Last ,SQL The parsing module is responsible for SQL Explain it as specific to TableMgr Executed command , And pass I/O The module performs input and output .
The following focuses on some more complex modules .
Page file system
Because the page file system provided by the course has many defects , I re implemented a page file system .
First , I abstract the direct reading and writing of pages into PageMgr Pure virtual class , And implemented FilePageMgr Subclasses read and write pages on files . The advantage of this is : When unit testing other units , You can replace it with a test class that only reads and writes in memory FilePageMgr, To avoid reading and writing a lot of files during testing .
secondly , I use hash table and linked list together , Store the key of hash table with linked list , Implemented a similar Java in LinkedHashMap Data structure of , Used for page LRU cache , And implemented PageCache Class automatically performs cache replacement , Make the cache transparent to the upper layer .PageCache The interface provided to the upper layer is in the form of iterators rather than bare pointers , Upper level functions can use these iterators just like ordinary pointers . Every time the iterator is called , Will determine whether the target address is in the cache , If not , The page where the address is located will be automatically loaded into the cache , The corresponding pages ejected from the cache will also be automatically written to the file .
If the iterator needs to look up the table every time it determines that the target is not in the cache , The cost is enormous . The actual measurement shows that , A 100000 line insertion operation will bring hundreds of millions of memory accesses . Therefore, an optimization is carried out here : All iterators accessing the same memory page share the same piece of management information space , This space includes whether the page cache is invalid 、 Information such as the location of the page in the cache , The difference between different iterators is limited to the offset . So when an iterator updates the location of the target in the cache , Other iterators can know immediately , There is no need to check the table again .
The data pages in the table are organized (TablePages class )
The main information of a table is in various forms as described above ListPage Page load , these ListPage Pages are stored “ Database name . Table name .data.db” In file . But many ListPage Pages will be fragmented during continuous application and release , To manage ListPage Usage situation , One or more more more BitmapPage Which pages record ListPage It's leisure ( Can be applied for ) Of , This BitmapPage Pages are stored “ Database name . Table name .freelist.db” In file .
ListPage Very flexible . First , It is slotted , Each slot can store several different types of columns , You can manage null values , You can also configure some columns to be non empty . secondly , Every ListPage Each has 4 Reserved fields , Respectively used to store the number of elements 、 Previous page ID、 Next page ID And the mark of the page . So different ListPage Can be marked as different types , It can also be used as a linked list . Each table uses both ListPage Store specific data , Also used ListPage Store index , In order to mix and use these pages in the same file, they are scattered and not messy , these ListPage Will be marked as follows 5 Class :
- RECORD, Indicates that this page is used to store specific data , Each slot in the page stores a record ;
- ENTRY, Indicates that this page is used to store each index entry ( The unowned index is the entry of the linked list ), Each slot in the page stores the... Of an entry page ID;
- REF, Indicates that the page is used as a leaf node of a non cluster index , Each slot in the page stores the key of a primary index ;
- PRIMARY, Indicates that the page is used as a non leaf node of the primary index , Each slot in the page stores the relevant information of a node , That is, for each child node ID And its corresponding key ;
- NON_CLUSTER, Indicates that the page is used as a non leaf node of a non cluster index , The specific non cluster index will be further distinguished , Use the same PRIMARY.
The table inserted 、 Delete 、 Query operation (BaseTable class )
This class uses the lower layer (TablePages class ) The provided page performs a specific insertion in a table 、 Delete 、 Query operation , And back to the top . I regard the modification of the table as deleting and then inserting , So put it on a higher level .
Indexes
Index is implemented in this module . The implementation of index in different cases is as follows :
- If there is no primary index , Each record page is linked into a linked list , Nonclustered index ( if there be ) The page number of the corresponding record is stored in . When querying records directly , Traverse the linked list ; When querying records through non cluster index , Find the page first , Then traverse the page change query ;
- If there is a primary index , Main index of each record B+ The form of leaf nodes exists , Nonclustered index ( if there be ) The primary key of the corresponding record is stored in . Whether the primary index or non primary index , Its leaf nodes will also be linked into a linked list . When querying records directly , Traverse the linked list ; When querying records through the main index , Query directly on the main index tree ; When querying records through non cluster index , First find its primary key , Then query through the main index .
When creating a new non cluster index , First update the above ENTRY Page registration index entry , Then insert the relevant information of all records into the index tree . When deleting a non cluster index , First recursively delete the index number , And then update ENTRY Page delete entry .
The implementation of index is essentially to operate these data pages , And manage the link relationship between them , The complexity of implementing an index is not in implementing B+ Trees , Instead, it's about taking the topological B+ Trees are stored on linear pages . The specific principle of the index is the same as that described in the class , No more details here .
Optimization of queries and inserts

System management module (TableMgr class )
The above classes need to be initialized with some meta information , For example, what columns does a table contain 、 Which indexes, etc . To solve this problem , I have implemented several system tables with known meta information , Respectively used to store the database name 、 Table name 、 Column 、 Main index 、 Nonclustered index 、 Foreign key constraints 、 Information about attribute domain constraints . When a table needs to be used , You need to initialize the corresponding Table object . The function of the system management module is to manage these meta information 、 maintain Table object , In this way, the database 、 Addition and deletion of tables and indexes .
Input inspection
A major function of the system management module is to check the input , Report errors as early as possible when illegal operations are found , Ensure that the data in each table will not be affected after entering illegal instructions . If an error is found during specific table operations , It will make it difficult to restore the state of the table to that before the operation . Illegal operations can be divided into two categories , First, foreign key constraints and attribute domain constraints , Such constraints have complex but explicit behavior , It stipulates what circumstances are legal 、 What is illegal ; The other is input error , For example, access to a nonexistent table or domain 、 Type error 、 Violation of non empty restrictions, etc , There are many kinds of such errors , It needs to be checked patiently .
There are three cases for checking foreign key constraints :
- Insertion time , Check whether the current table has constraints that reference other tables , If you have any , Ensure that the foreign key of the current table exists in the primary key of the referenced table ;
- deleted , Check whether there are other tables that reference the constraints of the current table , If you have any , Ensure that the primary key of the current table does not exist in the foreign key of the reference table ;
- update , First check whether the current table has constraints that reference other tables , If you have any , Ensure that the new value of the foreign key of the current table exists in the primary key of the referenced table ; then , If the primary key of the current table changes , You should also check whether other tables reference the constraints of the current table , If you have any , Ensure that the old value of the primary key of the current table does not exist in the foreign key of the reference table .
For attribute domain constraints , You just need to ensure that the new values of each field meet the constraints of the attribute field when inserting and deleting .
For how many kinds of input errors , No more details here .
In this project , All illegal operations are defined in the form of exceptions exception In the folder , These exceptions inherit C++ STL Of runtime_error abnormal , To take advantage of its error information mechanism , It is convenient to provide users with perfect error information .
Multi-table query

Optimize multi table queries (Optimizer class )

Aggregate query (Aggregate class )
All aggregate functions , Whether it's MIN、MAX、SUM still AVG, Can be regarded as a state machine . Its evaluation can be divided into two steps : First, keep inputting the aggregated records , Change the internal state ; Then output . As long as these two steps are defined , And the initial value of the internal state , You can implement an aggregation function . The aggregation functions can be summarized as follows :
Aggregate Class defines its “ Read in value
X
Post operation ” and “ Output
Y
Time operation ”. When actually evaluating , First direction Aggregate Class to enter the value of the field involved in each record , And then from Aggregate Class to get the output . The only thing to notice is that , If the input is null, you need to ignore .
notes : Because only SUM、AVG、MIN、MAX These four aggregate functions , So I didn't achieve COUNT function , but COUNT Function also fully conforms to the above analysis , Easy to implement . Due to the fact that COUNT, This is the explanation .
Result sorting and grouping aggregation query
Because each data type used in this system has defined the logic to compare its value , It is very simple to sort the results , Use C++ STL The sorting function in .
On the other hand , Group aggregation query can be implemented based on sorting . After sorting by the field on which the group is based , The records belonging to different groups in the result set are naturally separated . Then you can use the above “ Aggregate query ” The method in Section 1 evaluates the aggregation function for each group separately .
Besides , Because the result sorting and grouping aggregation queries may appear in the same query statement at the same time , therefore “ Group aggregation query ” Can't destroy “ Sorting results ” Result . Or first “ Group aggregation query ” Sort , Proceed again “ Sorting results ” Sort ; Or go ahead “ Group aggregation query ” Use stable sorting when sorting (stable sort), This system uses the latter method .
notes : Because only ORDER BY Without mentioning ORDER BY DESC, So I only realize the sorting in one direction , To achieve reverse order , Just modify the comparison function . Due to the fact that ORDER BY DESC, This is the explanation .
Parsing module
I use ANTLR 4 Realized SQL Parsing module . In the implementation , I have improved some grammars , As required by the operation 《 Grammar rules 》 There is a little difference between the provisions in , Specifically listed here :
- Numeric literal quantities now support decimals and negative numbers ;
- String literal quantity now supports “\’” and “\” escape ;
- Both primary and non clustered indexes now support using more than one domain as a key ;
- It is now allowed to define an empty table ;
- INT Type can now specify length ( Used to indicate the width of the corresponding column in the output table );
- It is now allowed to query 、 Delete 、 Omitted in the modification where clause .
I/O modular
This system supports two different I/O Pattern : One , Read in SQL file , Output CSV file ; Two , Command line interactive read , Output a character table that is easy for human to interpret . This enhances the user friendliness of the whole system . Program utilization Linux Of isatty Mechanism judgment stdin Is it a file or a terminal , If it is a file, enter mode 1 , If it is a terminal, enter mode 2 .
In mode one , The entire file is used as SQL Analyze the input of the module , If you find a syntax error , Then directly abandon the execution of the entire file . For the results of the query , Output one record per line , The different fields in the record are separated by commas ( namely CSV Format ), Different queries are separated by blank lines .
In mode 2 , The system constantly accepts the lines entered by the user , If the user's input does not end with a semicolon , Then remind the user to enter ( The command prompt will change ); If the user's input ends with a semicolon , Then submit it to SQL Parsing module processing . If you find a syntax error , Don't quit the program , Instead, the user is allowed to enter . For the results of the query , First calculate the width required for each column ( The width is affected by the width of the title 、 The longest result width of this column and INT The width specified by the user in the type affects ), And then use ASCII Draw characters into a table and output .
Whether it's mode one or mode two , The query results are output to stdout, Both error messages and prompt messages are output to stderr, So if the user will stdout Redirect to file , Only the query results will be recorded in the file , Error messages and prompts will also appear on the screen .
test
The project relies on Google Mock / Google Test The framework has written unit tests in total 155 strip , Cover all data types 、 file system 、 Specialized memory pages 、 Single table operation 、 System ( Multiple tables ) management 、SQL Parser 、I/O, All pass . Of course, all functions of this system can be used normally . For many illegal inputs , This system also has certain robustness .
In unit test , I focused on testing situations that are difficult to replicate by running the program directly , for example : For tests involving memory page cache replacement , I reduce the cache size , The replacement of memory pages is tested through a small number of data pages ; For tests involving indexes , I increase the size of each record , Through a small number of records, the situation with more nodes in the index is tested . For more complex parts of the system, such as indexes , I also based on GNU The native code coverage analysis tool in the tool chain gcov Guidelines , Some boundary conditions that were not covered at the beginning were supplemented .
Besides , I also used random data to test the performance of the system , And use GNU Native in the tool chain gprof The performance analysis tool takes time to analyze each function , It turns out that a lot of time is spent on string based unordered_map( Hashtable ) On . After discovering this problem , I replace many redundant hash tables with lists , And some other hash tables are changed into the form of lazy evaluation ( Realized LazyMap Class and replace it ). Besides , For a very time-consuming redundant list assignment , I also changed it into the form of lazy evaluation ( Realized VectorRef Class and replace it ).
notes : I mentioned in my brief report that my system has performance problems , But it's not . At first, the miscarriage of justice occurred because the performance analyzer itself was too slow .
Third party tools used
- ANTLR 4 Grammar parser generation tool . It's a similar one Lex/YACC Tools for , Used to generate SQL Parser ;
- Google Mock / Google Test Unit test framework . For testing only , Not included in the product program .
It is mentioned in that my system has performance problems , But it's not . At first, the miscarriage of justice occurred because the performance analyzer itself was too slow .
Third party tools used
- ANTLR 4 Grammar parser generation tool . It's a similar one Lex/YACC Tools for , Used to generate SQL Parser ;
- Google Mock / Google Test Unit test framework . For testing only , Not included in the product program .
边栏推荐
- At the 100 billion yogurt track, dairy giants and new brands have started a tug of war
- Arduino实验一:双色灯实验
- SQL injection less26 (filter spaces and comments, and use error injection without spaces)
- NVIDIA SMI error: NVIDIA SMI has failed because it could't communicate with the NVIDIA driver complete record
- 让程序在一秒或者多秒中做一件事情
- Understand China's financial system in one article
- 顺序表实现
- Detailed explanation of SQL secondary injection
- Nacos作为注册中心、配置中心入门使用篇-实现远程调用、动态获取配置文件、数据库配置信息
- Cached database for memcached
猜你喜欢

APP信息侦察&夜神模拟器Burp抓包配置

golang中的信号量的实现原理

SQL injection less26 (filter spaces and comments, and use error injection without spaces)

利用Go制作微信机器人(一)发送消息

2018 arXiv preprint | MolGAN: An implicit generative model for small molecular graphs

动态规划之线性DP

《强化学习周刊》第55期:LB-SGD、MSP-DRL & 对抗鲁棒强化学习

模块8(消息队列存储消息数据的mysql表格)

由若干QWidget实现日历文档

Summary of debugging stc8a8k64d4 MCU 485 communication
随机推荐
Detailed explanation of SQL secondary injection
Overview of banking industry
『Mysql』汇总Mysql索引失效的常见场景
Embedded sig | distributed soft bus
2018 arXiv preprint | MolGAN: An implicit generative model for small molecular graphs
2022年最新西藏建筑安全员模拟题库及答案
Nacos作为注册中心、配置中心入门使用篇-实现远程调用、动态获取配置文件、数据库配置信息
MySQL recommendation
LeetCode 121:买卖股票的最佳时机
Y78. Chapter IV Prometheus' monitoring system and practice -- Prometheus' service discovery mechanism (IX)
在灯塔工厂点亮5G,宁德时代抢先探路中国智造
Linear DP of dynamic programming
Implementation of V-model syntax sugar
Height collapse caused by floating
国产DRAM年底将量产,但前路依旧漫漫!
Network and VPC hands-on experiment
面试:你印象最深的BUG,举个例子
Does Guosen Securities charge for opening a mobile account? Is it safe to open an account?
Apifox--比 Postman 还好用的 API 测试工具
A chip company fell in round B