当前位置：网站首页>Page file system based on C language

Page file system based on C language

2022-07-26 22:40:00 【biyezuopin】

be based on C Page file system of language

Additional functions implemented

Non cluster indexes that can be dynamically added or deleted , Both primary and non primary indexes support using more than one domain as a key ;
Attribute domain constraints , Support “CHECK ( in ())”;
Foreign key constraints ;
DATE The type and VARCHAR type . however VARCHAR The type is still implemented as a fixed length type ;
Query optimization ： Choose which index to use when there are several indexes in the table ;
Optimization of multi table connection ： Optimize the order of querying tables when connecting multiple tables , Make it use indexes as much as possible ;
Aggregate query SUM、AVG、MIN、MAX;
Grouping query clauses GROUP BY;
Sort clause ORDER BY;
Two different kinds of I/O Pattern ： Read in SQL file , Output CSV file ; Or command line interactive read , Output a character table that is easy for human to interpret .

The system design

summary

From a bottom-up perspective , The backbone of the system can be divided into the following modules ： The first is a page file system , The module exchanges data with the hard disk in whole pages , But it provides a byte level read-write interface , So that other modules can access data in different memory pages as data in ordinary memory , Don't worry about specific cache replacement .

The page file system solves the problem of how to store byte blocks , We should also solve the problem of how to store different types of data . So , I defined INT、FLOAT、CHAR、VARCHAR、DATE Five different types , Use a unified interface in easy to store byte blocks 、 Easy to operate runtime representation and easy I/O Conversion between literal quantities of . among DATE Type used C Language native strptime and localtime Function to realize parsing and output . On the other hand , I specialize memory pages into two ：BitmapPage Class is used to store bit strings 、ListPage Class is used to store the above different types of data , So far, the system has realized the storage of different types of data in memory pages .

Next, we need to take the memory page as the basic component to establish the table in the database . The implementation of tables in this project is divided into three layers ：TablePages Class manages each memory page used in a table ;BaseTable Class manages the basic queries in the table , The index is also implemented here ;Table Class provides a literal oriented interface for tables , This is because SQL Data types in the query can be converted to each other , Using literal expressions outside a single table can avoid the inconvenience of type conversion .

With a watch , You need to organize tables into databases . In this project ,TableMgr Class is responsible for managing multiple tables , Among them, the meta information such as the header of each table is stored in a separate system table , The header of the system table is hard coded .

Last ,SQL The parsing module is responsible for SQL Explain it as specific to TableMgr Executed command , And pass I/O The module performs input and output .

The following focuses on some more complex modules .

Page file system

Because the page file system provided by the course has many defects , I re implemented a page file system .

First , I abstract the direct reading and writing of pages into PageMgr Pure virtual class , And implemented FilePageMgr Subclasses read and write pages on files . The advantage of this is ： When unit testing other units , You can replace it with a test class that only reads and writes in memory FilePageMgr, To avoid reading and writing a lot of files during testing .

secondly , I use hash table and linked list together , Store the key of hash table with linked list , Implemented a similar Java in LinkedHashMap Data structure of , Used for page LRU cache , And implemented PageCache Class automatically performs cache replacement , Make the cache transparent to the upper layer .PageCache The interface provided to the upper layer is in the form of iterators rather than bare pointers , Upper level functions can use these iterators just like ordinary pointers . Every time the iterator is called , Will determine whether the target address is in the cache , If not , The page where the address is located will be automatically loaded into the cache , The corresponding pages ejected from the cache will also be automatically written to the file .

If the iterator needs to look up the table every time it determines that the target is not in the cache , The cost is enormous . The actual measurement shows that , A 100000 line insertion operation will bring hundreds of millions of memory accesses . Therefore, an optimization is carried out here ： All iterators accessing the same memory page share the same piece of management information space , This space includes whether the page cache is invalid 、 Information such as the location of the page in the cache , The difference between different iterators is limited to the offset . So when an iterator updates the location of the target in the cache , Other iterators can know immediately , There is no need to check the table again .

The data pages in the table are organized （TablePages class ）

The main information of a table is in various forms as described above ListPage Page load , these ListPage Pages are stored “ Database name . Table name .data.db” In file . But many ListPage Pages will be fragmented during continuous application and release , To manage ListPage Usage situation , One or more more more BitmapPage Which pages record ListPage It's leisure （ Can be applied for ） Of , This BitmapPage Pages are stored “ Database name . Table name .freelist.db” In file .

ListPage Very flexible . First , It is slotted , Each slot can store several different types of columns , You can manage null values , You can also configure some columns to be non empty . secondly , Every ListPage Each has 4 Reserved fields , Respectively used to store the number of elements 、 Previous page ID、 Next page ID And the mark of the page . So different ListPage Can be marked as different types , It can also be used as a linked list . Each table uses both ListPage Store specific data , Also used ListPage Store index , In order to mix and use these pages in the same file, they are scattered and not messy , these ListPage Will be marked as follows 5 Class ：

RECORD, Indicates that this page is used to store specific data , Each slot in the page stores a record ;
ENTRY, Indicates that this page is used to store each index entry （ The unowned index is the entry of the linked list ）, Each slot in the page stores the... Of an entry page ID;
REF, Indicates that the page is used as a leaf node of a non cluster index , Each slot in the page stores the key of a primary index ;
PRIMARY, Indicates that the page is used as a non leaf node of the primary index , Each slot in the page stores the relevant information of a node , That is, for each child node ID And its corresponding key ;
NON_CLUSTER, Indicates that the page is used as a non leaf node of a non cluster index , The specific non cluster index will be further distinguished , Use the same PRIMARY.

The table inserted 、 Delete 、 Query operation （BaseTable class ）

This class uses the lower layer （TablePages class ） The provided page performs a specific insertion in a table 、 Delete 、 Query operation , And back to the top . I regard the modification of the table as deleting and then inserting , So put it on a higher level .

Indexes

Index is implemented in this module . The implementation of index in different cases is as follows ：

If there is no primary index , Each record page is linked into a linked list , Nonclustered index （ if there be ） The page number of the corresponding record is stored in . When querying records directly , Traverse the linked list ; When querying records through non cluster index , Find the page first , Then traverse the page change query ;
If there is a primary index , Main index of each record B+ The form of leaf nodes exists , Nonclustered index （ if there be ） The primary key of the corresponding record is stored in . Whether the primary index or non primary index , Its leaf nodes will also be linked into a linked list . When querying records directly , Traverse the linked list ; When querying records through the main index , Query directly on the main index tree ; When querying records through non cluster index , First find its primary key , Then query through the main index .

When creating a new non cluster index , First update the above ENTRY Page registration index entry , Then insert the relevant information of all records into the index tree . When deleting a non cluster index , First recursively delete the index number , And then update ENTRY Page delete entry .

The implementation of index is essentially to operate these data pages , And manage the link relationship between them , The complexity of implementing an index is not in implementing B+ Trees , Instead, it's about taking the topological B+ Trees are stored on linear pages . The specific principle of the index is the same as that described in the class , No more details here .

Optimization of queries and inserts

Insert picture description here

System management module （TableMgr class ）

The above classes need to be initialized with some meta information , For example, what columns does a table contain 、 Which indexes, etc . To solve this problem , I have implemented several system tables with known meta information , Respectively used to store the database name 、 Table name 、 Column 、 Main index 、 Nonclustered index 、 Foreign key constraints 、 Information about attribute domain constraints . When a table needs to be used , You need to initialize the corresponding Table object . The function of the system management module is to manage these meta information 、 maintain Table object , In this way, the database 、 Addition and deletion of tables and indexes .

Input inspection

A major function of the system management module is to check the input , Report errors as early as possible when illegal operations are found , Ensure that the data in each table will not be affected after entering illegal instructions . If an error is found during specific table operations , It will make it difficult to restore the state of the table to that before the operation . Illegal operations can be divided into two categories , First, foreign key constraints and attribute domain constraints , Such constraints have complex but explicit behavior , It stipulates what circumstances are legal 、 What is illegal ; The other is input error , For example, access to a nonexistent table or domain 、 Type error 、 Violation of non empty restrictions, etc , There are many kinds of such errors , It needs to be checked patiently .

There are three cases for checking foreign key constraints ：

Insertion time , Check whether the current table has constraints that reference other tables , If you have any , Ensure that the foreign key of the current table exists in the primary key of the referenced table ;
deleted , Check whether there are other tables that reference the constraints of the current table , If you have any , Ensure that the primary key of the current table does not exist in the foreign key of the reference table ;
update , First check whether the current table has constraints that reference other tables , If you have any , Ensure that the new value of the foreign key of the current table exists in the primary key of the referenced table ; then , If the primary key of the current table changes , You should also check whether other tables reference the constraints of the current table , If you have any , Ensure that the old value of the primary key of the current table does not exist in the foreign key of the reference table .

For attribute domain constraints , You just need to ensure that the new values of each field meet the constraints of the attribute field when inserting and deleting .

For how many kinds of input errors , No more details here .

In this project , All illegal operations are defined in the form of exceptions exception In the folder , These exceptions inherit C++ STL Of runtime_error abnormal , To take advantage of its error information mechanism , It is convenient to provide users with perfect error information .

Multi-table query

Insert picture description here

Optimize multi table queries （Optimizer class ）

Insert picture description here

Aggregate query （Aggregate class ）

All aggregate functions , Whether it's MIN、MAX、SUM still AVG, Can be regarded as a state machine . Its evaluation can be divided into two steps ： First, keep inputting the aggregated records , Change the internal state ; Then output . As long as these two steps are defined , And the initial value of the internal state , You can implement an aggregation function . The aggregation functions can be summarized as follows ：
Insert picture description here

Aggregate Class defines its “ Read in value

Post operation ” and “ Output

Time operation ”. When actually evaluating , First direction Aggregate Class to enter the value of the field involved in each record , And then from Aggregate Class to get the output . The only thing to notice is that , If the input is null, you need to ignore .

notes ： Because only SUM、AVG、MIN、MAX These four aggregate functions , So I didn't achieve COUNT function , but COUNT Function also fully conforms to the above analysis , Easy to implement . Due to the fact that COUNT, This is the explanation .

Result sorting and grouping aggregation query

Because each data type used in this system has defined the logic to compare its value , It is very simple to sort the results , Use C++ STL The sorting function in .

On the other hand , Group aggregation query can be implemented based on sorting . After sorting by the field on which the group is based , The records belonging to different groups in the result set are naturally separated . Then you can use the above “ Aggregate query ” The method in Section 1 evaluates the aggregation function for each group separately .

Besides , Because the result sorting and grouping aggregation queries may appear in the same query statement at the same time , therefore “ Group aggregation query ” Can't destroy “ Sorting results ” Result . Or first “ Group aggregation query ” Sort , Proceed again “ Sorting results ” Sort ; Or go ahead “ Group aggregation query ” Use stable sorting when sorting （stable sort）, This system uses the latter method .

notes ： Because only ORDER BY Without mentioning ORDER BY DESC, So I only realize the sorting in one direction , To achieve reverse order , Just modify the comparison function . Due to the fact that ORDER BY DESC, This is the explanation .

Parsing module

I use ANTLR 4 Realized SQL Parsing module . In the implementation , I have improved some grammars , As required by the operation 《 Grammar rules 》 There is a little difference between the provisions in , Specifically listed here ：

Numeric literal quantities now support decimals and negative numbers ;
String literal quantity now supports “\’” and “\” escape ;
Both primary and non clustered indexes now support using more than one domain as a key ;
It is now allowed to define an empty table ;
INT Type can now specify length （ Used to indicate the width of the corresponding column in the output table ）;
It is now allowed to query 、 Delete 、 Omitted in the modification where clause .

I/O modular

This system supports two different I/O Pattern ： One , Read in SQL file , Output CSV file ; Two , Command line interactive read , Output a character table that is easy for human to interpret . This enhances the user friendliness of the whole system . Program utilization Linux Of isatty Mechanism judgment stdin Is it a file or a terminal , If it is a file, enter mode 1 , If it is a terminal, enter mode 2 .

In mode one , The entire file is used as SQL Analyze the input of the module , If you find a syntax error , Then directly abandon the execution of the entire file . For the results of the query , Output one record per line , The different fields in the record are separated by commas （ namely CSV Format ）, Different queries are separated by blank lines .

In mode 2 , The system constantly accepts the lines entered by the user , If the user's input does not end with a semicolon , Then remind the user to enter （ The command prompt will change ）; If the user's input ends with a semicolon , Then submit it to SQL Parsing module processing . If you find a syntax error , Don't quit the program , Instead, the user is allowed to enter . For the results of the query , First calculate the width required for each column （ The width is affected by the width of the title 、 The longest result width of this column and INT The width specified by the user in the type affects ）, And then use ASCII Draw characters into a table and output .

Whether it's mode one or mode two , The query results are output to stdout, Both error messages and prompt messages are output to stderr, So if the user will stdout Redirect to file , Only the query results will be recorded in the file , Error messages and prompts will also appear on the screen .

test

The project relies on Google Mock / Google Test The framework has written unit tests in total 155 strip , Cover all data types 、 file system 、 Specialized memory pages 、 Single table operation 、 System （ Multiple tables ） management 、SQL Parser 、I/O, All pass . Of course, all functions of this system can be used normally . For many illegal inputs , This system also has certain robustness .

In unit test , I focused on testing situations that are difficult to replicate by running the program directly , for example ： For tests involving memory page cache replacement , I reduce the cache size , The replacement of memory pages is tested through a small number of data pages ; For tests involving indexes , I increase the size of each record , Through a small number of records, the situation with more nodes in the index is tested . For more complex parts of the system, such as indexes , I also based on GNU The native code coverage analysis tool in the tool chain gcov Guidelines , Some boundary conditions that were not covered at the beginning were supplemented .

Besides , I also used random data to test the performance of the system , And use GNU Native in the tool chain gprof The performance analysis tool takes time to analyze each function , It turns out that a lot of time is spent on string based unordered_map（ Hashtable ） On . After discovering this problem , I replace many redundant hash tables with lists , And some other hash tables are changed into the form of lazy evaluation （ Realized LazyMap Class and replace it ）. Besides , For a very time-consuming redundant list assignment , I also changed it into the form of lazy evaluation （ Realized VectorRef Class and replace it ）.

notes ： I mentioned in my brief report that my system has performance problems , But it's not . At first, the miscarriage of justice occurred because the performance analyzer itself was too slow .

Third party tools used

ANTLR 4 Grammar parser generation tool . It's a similar one Lex/YACC Tools for , Used to generate SQL Parser ;
Google Mock / Google Test Unit test framework . For testing only , Not included in the product program .
It is mentioned in that my system has performance problems , But it's not . At first, the miscarriage of justice occurred because the performance analyzer itself was too slow .

Third party tools used

ANTLR 4 Grammar parser generation tool . It's a similar one Lex/YACC Tools for , Used to generate SQL Parser ;
Google Mock / Google Test Unit test framework . For testing only , Not included in the product program .

原网站

版权声明
本文为[biyezuopin]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/207/202207262149455340.html