当前位置：网站首页>Tiflash source code reading (IV) design and implementation analysis of tiflash DDL module

Tiflash source code reading (IV) design and implementation analysis of tiflash DDL module

2022-07-07 02:11:00 【InfoQ】

TiFlash yes TiDB The analysis engine of , yes TiDB HTAP The key components of form .TiFlash The source code reading series will introduce you from the source code level TiFlash Internal implementation . stay
The last issue of source code reading
in , We introduced TiFlash Storage engine for , This article will introduce TiFlash DDL Module related content , Include DDL Design idea of module , And the specific code implementation .

This article is based on the latest TiFlash v6.1.0 Design and source code analysis . Over time , Some design changes may occur in the new version , Make part of this article invalid , Please pay attention to the readers .TiFlash v6.1.0 The code can be found in TiFlash Of git repo Switch to v6.1.0 tag To view the .

Overview

This chapter , We'll start with DDL The module makes a overview Introduction to , Introduce
DDL stay TiFlash Relevant scenarios in
, as well as
TiFlash in DDL The design idea of the whole module
.

This way DDL Module refers to the corresponding responsible processing add column, drop column, drop table recover table Wait for this series DDL Module of statement , It is also responsible for communicating with databases and tables schema Module for dealing with information .

DDL Modules in TiFlash Relevant scenarios in

<center> Figure 1 TiFlash Schematic architecture </center>

Figure 1 yes TiFlash The architecture of , Above is TiDB/TiSpark Computing layer node , To the left of the dotted line are four TiKV The node of , There are two on the right TiFlash node . This picture shows TiFlash An important design concept ：

By using Raft The consensus algorithm ,TiFlash Will act as Raft Of Learner Nodes to join Raft group For asynchronous replication of data

.Raft Group refer to TiKV More than one region Made up of copies raft leader as well as raft follower Composed of group. from TiKV Synchronize to TiFlash The data of , stay TiFlash It is also in accordance with region The division of , But internally, it will be saved to TiFlash In the columnar storage engine .

<center> Figure 2 TiFlash Schematic architecture （ contain Schema）</center>

Figure 2 is an overview of the architecture design , Covered up many details . The two red circles in the figure correspond to , Is the protagonist of this article

DDL modular

The red circle below is about TiFlash Write operations for .TiFlash The node is based on learner The role joined TiKV One by one region Corresponding raft group in , adopt raft leader Keep sending raft log perhaps raft snapshot Here it is learner Node synchronization data . But because TiKV The data in is in row storage format , And we TiFlash The data required in is in the format of column storage , therefore TiFlash The node is receiving TiKV After sending the data in row storage format , You need to convert it from row to column , Convert to the required column storage format . And this transformation , You need to rely on the corresponding table schema Information to complete . Again , The red circle above refers to TiDB/TiSpark Come on TiFlash The process of reading data in , The process of reading data also depends on schema To participate in the parsing .

therefore ,TiFlash The read and write operations of all require strong dependencies schema Of ,schema stay TiFlash It also plays an important role

DDL The overall design idea of the module

In specific understanding TiFlash DDL Before the overall design idea of the module , So let's see DDL Modules in TiDB and TiKV The corresponding situation in , because TiFlash The received schema The change information of is also from TiKV Sent by the node .

TiDB in DDL Basic information of the module

TiDB Of DDL Module is for reference Google F1 In a distributed scenario , Unlocked and online schema change . Specific implementation can refer to

TiDB Source reading series of articles ( seventeen )DDL The source code parsing | PingCAP

.TiDB Of DDL The mechanism provides two main features ：

DDL Operation will be avoided as much as possible data reorg
（data reorg It refers to the addition, deletion and modification of data in the table ）.

Figure 3 this add column In the example of , The original watch has a b Two columns and two rows of data . As we go add column This DDL In operation , We will not add new lines in the original two lines c Fill in the column with the default value . If there is a subsequent read operation, the data of these two lines will be read , We will give... In the result of reading c Fill in the column with the default value . In this way , Let's avoid DDL Occurs during operation data reorg. Such as add column, drop column, And integer type column expansion , There is no need to trigger data reorg Of .

<center> Figure 3 add column Examples </center>

But for those that are detrimental to the change DDL operation （ for example ： Shorten the column length （ The following abbreviations are abbreviated ） The operation of , May cause user data truncation DDL change ）, We will inevitably happen data reorg. But in the case of damaging changes , We will not modify or rewrite the data on the original column of the table , Instead, add new columns , Convert on the new column , Finally, delete the original column , Rename the new column DDL operation . This abbreviated column in Figure 4 (modify column) In the case of , We have a, b Two , this DDL The operation requires a List from int Type reduction tiny int type . Whole DDL The operation process is :

First add a hidden column _col_a_0.

Put the original a The values in the column are converted and written to the hidden column _col_a_0 On .

After the conversion , Will be original a Column delete , And will _col_a_0 Rename the column to a Column .（ The deletion mentioned here a Columns are not physically a The value of the column is deleted , By modification meta The way of information ）*

<center> Figure 4 modify column Examples </center>

In addition, for the contraction DDL Operation itself , We require that data loss will not occur in the process of column reduction . For example, from int shrink into tinyint when , The value of the original column is required to be in tinyint Within the scope of , It does not support the occurrence of itself beyond tinyint The value of is converted to tinyint type . In the latter case , Will report an error directly overflow, Shrink column operation failed .

Relative data update schema Old data can always be parsed
. This conclusion is also behind us TiFlash DDL An important guarantee of module dependency . This guarantee depends on the format of our row storage data . When saving data , We are will column id and column value Stored together , Instead of column name and column value Store together . In addition, our line storage format can be simplified as a column_id → data One of the map The way （ In fact, our bank deposit is not a map, It is stored in binary code , For details, please refer to
Proposal: A new storage row format for efficient decoding
）.

We can use figure 5 as an example , To better understand this feature . On the left is a two column original table , adopt DDL operation , We deleted a Column , Added c Column , Convert to the right schema state . At this time , We need new schema Information to analyze the original old data , According to the new schema Each of the column id, We go to the old data to find each column id Corresponding value , among id_2 You can find the corresponding value , but id_3 No corresponding value was found , therefore , Just give it to id_3 Add the default value of this column . For multiple in the data id_1 Corresponding value , Choose to give up directly . In this way , We have correctly parsed the original data .

<center> Figure 5 new schema Parse old data samples </center>

TiKV in DDL Basic information of the module

TiKV The storage layer of this line , There is no corresponding data table saved in the node schema The information of , because TiKV Its own reading and writing process does not need to rely on its own schema Information .

TiKV The write operation itself does not need shcema , Because writing TiKV The data of is the data in the format of row storage that the upper layer has completed the conversion （ That is to say kv Medium v）.

about TiKV Read operation

If the read operation only needs to put kv read out , You don't need to schema Information .

If it is necessary to TiKV Medium coprocesser Deal with some TiDB Issue to TiKV When undertaking the downward calculation task ,TiKV Will need schema Information about . But this schema Information , Will be in TiDB The request sent contains , therefore TiKV It can be taken directly TiDB Of the requests sent schema Information to analyze data , And do some exception handling （ If the parsing fails ）. therefore TiKV This kind of read operation does not need to be provided by itself schema Relevant information .

TiFlash in DDL Module design idea

TiFlash in DDL The design idea of the module mainly includes the following three points ：

TiFlash The node will save its own schema copy
. Part of it is because TiFlash Yes schema With strong dependence , need schema To help parse the data of row to column conversion and the data to be read . On the other hand, because TiFlash Is based on Clickhouse Realized , So many designs are also in Clickhouse Evolved from the original design ,Clickhouse In its own design, it keeps a schema copy.

about TiFlash Saved on the node schema copy, We chose to go through
On a regular basis from TiKV Pull the latest schema
（ The essence is to get TiDB The latest in schema Information ） To update , Because it is constantly updated schema It's very expensive , So we choose to update regularly .

Read and write operations , Meeting
Depend on schema copy To parse
. If... On the node schema copy Not meet the current needs of reading and writing , We will
Go pull the latest schema Information
, To guarantee schema Newer than the data , In this way, it can be correctly and successfully parsed （ This is what I mentioned earlier TiDB DDL The guarantee provided by the mechanism ）. Specific reading and writing are right schema copy The needs of , I will introduce it to you in detail in the later part .

DDL Core Process

In this chapter , We will introduce TiFlash DDL The core workflow of the module .

null

<center> Figure 6 DDL Core Process</center>

On the left side of Figure 6 is a thumbnail display of each node , Zoom in on the right TiFlash Middle heel DDL Related core processes , Respectively ：

Local Schema Copy refer to TiFlash Stored on the node schema copy Information about .

Schema Syncer The module is responsible for TiKV Pull Abreast of the times Schema Information , Update based on this Local Schema Copy.

Bootstrap refer to TiFlash Server When it starts , Will be called directly once Schema Syncer, Get all the present schema Information .

Background Sync Thread It is responsible for calling regularly Schema Syncer To update Local Schema Copy modular .

Read and Write The two modules are TiFlash Read and write operations in , Read and write operations will depend on Local Schema Copy, It will also be called when necessary Schema Syncer updated .

Now let's look at how each part is implemented one by one .

Local Schema Copy

TiFlash in schema The most important information is the information related to each data table . stay TiFlash In the storage tier of , Every physical table , There will be one

StorageDeltaMerge

Instance object of , There are two variables in this object , Is responsible for storing and schema Of relevant information .

<center> Figure 7 Schema Copy Storage diagram </center>

tidb_table_info
This variable stores table All kinds of schema Information , Include table id,table name,columns infos,schema version wait . also
tidb_table_info
The storage structure of is similar to TiDB / TiKV Storage in table schema The structure of is completely consistent .

decoding_schema_snapshot
It is based on
tidb_table_info
as well as
StorageDeltaMerge
Some of the information in
Generate
An object of .
decoding_schema_snapshot
It is proposed to optimize the performance of row column conversion in the writing process . Because we are doing row to column conversion , If you rely on
tidb_table_info
Get the corresponding schema Information , You need to do a series of conversion operations to adapt . in consideration of schema It will not be updated frequently , in order to
Avoid repeating these operations every time row to column parsing
, We will use
decoding_schema_snapshot
This variable is used to save the converted results , And it depends on
decoding_schema_snapshot
To parse .

Schema Syncer

Schema Syncer This module is made up of

TiDBSchemaSyncer

This class is responsible for . It passes through RPC Go to TiKV Get the latest schema Updates for . For the acquired schema diffs, Will find every schema diff Corresponding table, stay table Corresponding

StorageDeltaMerge

Object to update schema Information and related content of the corresponding storage layer .

<center>Schemas flow chart </center>

The whole process is through

TiDBSchemaSyncer

function

syncSchema

To achieve , Refer to figure 8 for the specific process :

adopt
tryLoadSchemaDiffs
, TiKV Get this round of new schema Change information .

Then traverse all diffs One by one
applyDiff
.

For each diff, We will find his corresponding table, Conduct
applyAlterPhysicalTable
.

In the Middle East: , We will detect To this round of updates , Everything related to this table schema change , And then call
StorageDeltaMerge::alterFromTiDB
Come to the table corresponding
StorageDeltaMerge
Object changes .

Specific changes , We'll modify
tidb_table_info
, dependent columns And primary key information .

In addition, we will update the table creation statement of this table , Because the table itself has changed , Therefore, his statement of creating tables also needs to be changed accordingly , Do this later recover Wait for the operation to work correctly .

Throughout

syncSchema

In the process of , We will not update

decoding_schema_snapshot

Of .

decoding_schema_snapshot

What is adopted is

Lazy update

The way , Only when the specific data is about to be written , Need to call to

decoding_schema_snapshot

, It will check whether it is the latest schema Corresponding state , If not , According to the latest

tidb_table_info

Relevant information to update . In this way , We can reduce many unnecessary conversions . For example, if a table happens frequently schema change, But I didn't do any writing , Then you can avoid

tidb_table_info

decoding_schema_snapshot

Many calculation conversion operations between .

<center> DDL Process</center>

For the surrounding calls involved Schema Syncer Module ,Read,Write,BootStrap These three modules are all direct calls

TiDBSchemaSyncer::syncSchema

. and Background Sync Thread It is through

SchemaSyncService

To be responsible for , stay TiFlash Server The initial stage of startup , hold

syncSchema

This function is inserted into background thread pool Go inside , Keep about every 10s Call once , To achieve regular updates .

Schema on Data Write

So let's see , The writing process itself needs to deal with the situation . We have a row of data to write , It is necessary to parse each column , Write to the inventory engine . In addition, we have local schema copy To help resolve . however , The data to be written in this line is the same as ours schema copy The order of time is uncertain . Because our data is through raft log / raft snapshot Sent in the form of , It's an asynchronous process .schema copy It is updated regularly , It can also be regarded as an asynchronous process ,

So the corresponding schema Version and The data written in this line stay TiDB We don't know the sequence of events in

. Write operation is to be in such a scenario , Correctly parse the data for writing .

<center> Figure ten Write data </center>

For such a scene , There will be a very direct processing idea ： We can do row to column parsing before , Pull the latest one first schema, So as to ensure our schema It must be newer than the data to be written , This must be a successful parsing . But on the one hand schema Not frequently changed , In addition, every time you write, you have to pull schema It's a huge expense , So the final choice of our writing operation is ,

Let's use the existing schema copy To parse this line of data , If the parsing succeeds, it ends , Parsing failed , Let's get the latest schema To re parse

When doing the first round of analysis , In addition to the correct parsing completion , We may also encounter the following three situations ：

Case one
Unknown Column
, That is, the ratio of data to be written schema There's an extra column e. There are two possibilities for this to happen .

<center> Figure eleven unknown column scene </center>

The first possibility , As shown in Figure 11 ( Left ) Shown , The data to be written is larger than schema new . stay TiDB Time line , First, a new column is added e, Then insert (a,b,c,d,e) This line of data . But the inserted data arrived first TiFlash ,add column e Of schema The change hasn't arrived yet TiFlash Side , So there is data ratio schema One more column .

The second possibility , As shown in Figure 11 ( Right ) Shown , The data to be written is larger than schema used . stay TiDB Time line , Insert this line of data first (a,b,c,d,e), then drop column e. however drop column e Of schema Changes arrive first TiFlash Side , The inserted data arrives , There will also be data ratio schema One more column . under these circumstances , We also have no way to judge which of the above situations , There is no common way to deal with , Therefore, only parsing failure can be returned , To trigger the pull of the latest schema Carry out the second round of analysis .

The second case
Missing Column
, That is, the ratio of data to be written schema A column is missing e. Again , There are also two possibilities .

<center> Figure twelve missing column scene </center>

The first possibility , As shown in Figure 12 ( Left ) Shown , The data to be written is larger than schema new . stay TiDB Time line , First drop column e, Then insert the data (a,b,c,d).

The second possibility , As shown in Figure 12 ( Right ) Shown , The data to be written is larger than schema used . stay TiDB Time line , First insert the data (a,b,c,d), Then insert e Column .

Similarly, we have no way to judge what kind of situation it belongs to at this time , Follow the previous practice , We should still return to pull again after parsing failure . But in this case , If more e Column It has a default value or supports filling NULL Of , We can give it directly to e Fill in the column with the default value or NULL To return the successful parsing . Let's look at two possibilities , We fill in the default value or NULL What kind of impact will it have .

In the first possible case , Because we have drop 了 column e, So all subsequent read operations will not read column e The operation of , So actually, it's for e Enter any value , Will not affect the correctness . For the second possible case , In itself (a,b,c,d) This line of data is missing e The value of the , You need to fill in this line of data when reading e The default value of perhaps NULL Of , So in this case , I directly give this line of data first column e Fill in the default value or NULL, It can also work normally . So in these two cases , We give e Fill in the default value or NULL Can work correctly , So we don't need to return parsing failure . But if there is more e Columns do not support filling in default values or NULL, Then it can only return parsing failure , To trigger the pull of the latest schema Carry out the second round of analysis .

The third case
Overflow Column
, That is, there is a column of data to be written that is larger than our schema The data range of this column in .

<center> Fig13 overflow column scene </center>

In this case , Only figure 13 ( Left ) This situation , That is, expand the column first , Then insert new data , But data precedes schema Arrived at the TiFlash. We can take a look at Figure 13 ( Right ) To understand why it is impossible to insert data first and then trigger column shrinking . If we insert data first (a,b,c,d,E), Then on e The column is shrunk , take e List from int Type reduction tinyint type . And because of the inserted E More than the tinyint The scope of the , So this DDL The operation will report overflow Wrong , operation failed , Therefore, it cannot lead to overflow column This kind of phenomenon .

So there comes overflow Scene , It can only be figure 13 ( Left ) In this case . But because schema change It has not arrived yet TiFlash, We don't know the specific data range of the new column , So there is no way to put this overflow Value E write in TiFlash Storage engine , So we can only return parsing failure , To trigger the pull of the latest schema Carry out the second round of analysis .

After understanding the three exceptions you may encounter when parsing for the first time , Let's take another look at the first parsing failure , Pull the latest again schema in the future , What will happen in the second round of analysis . alike , In addition to the normal completion of parsing in the second round , We may also encounter the above three situations , But the difference is , In the second round of parsing , Can guarantee our

schema It is newer than the data to be written

<center> Fig14 The second round of parsing exception scenarios </center>

Case one Unknown Column. because schema Than The data to be written is new , So we can be sure that it is because after this line of data , It happened again drop column e The operation of , But this schema change Arrived first TiFlash Side , So it led to Unknown Column Scene . So we just need to put e The column data can be deleted directly .

The second case Missing Column. This situation is due to the add column e Caused by the operation of , Therefore, we can directly fill in the default values for the extra columns .

The third case Overflow Column. Because at present our schema It is newer than the data to be written , So again overflow column The situation of , Something unusual must have happened , So we throw an exception directly .

The above is the overall idea of the data writing process , If you want to know the specific code details , You can search for

writeRegionDataToStorage

This function . In addition, our row column conversion process depends on

RegionBlockReader

This class to achieve , This class depends on schema Information is what we mentioned earlier

decoding_schema_snapshot

. In the process of row column conversion ,

RegionBlockReader

Take it

decoding_schema_snapshot

I will check it first

decoding_schema_snapshot

Whether it is the latest

tidb_table_info

Versions are aligned , If not aligned , It will trigger

decoding_schema_snapshot

Update , For specific logic, please refer to

getSchemaSnapshotAndBlockForDecoding

This function .

Schema on Data Read

What is different from writing is , Before starting the internal reading process , We need to check first schema version. Among the requests sent by our upper layer , There will be schema version Information （Query_Version). The read request verification needs to meet the following requirements: ,

The table to be read is local schema Information and read request schema version The corresponding information is consistent

<center> Fig15 Reading data </center>

TiFlash Be responsible for pulling schema Of

TiDBSchemaSyncer

Will record the whole schema version, We call it Local_Version. Therefore, the requirement of read operation is Query_Version = Local_Version. If Query_Version Greater than Local_Version, We will think that local schema The version is behind , So trigger sync schema , Pull the latest schema, Check again . If Query_Version Less than Local_Version, We would think query Of schema Version is too old , Therefore, the read request will be rejected , Let the upper node update schema version Then resend the request .

In this setting , If we have a table that happens very often DDL operation , So his schema version Will be constantly updated . Therefore, if you need to read this table at this time , It is easy to see that the read operation has been Query_Version > Local_Version and Query_Version < Local_Version Alternating back and forth between the two states . For example, reading the request at the beginning schema version Bigger , Trigger TiFlash sync schema, to update local schema copy. Updated local schema version It's newer than reading requests , Therefore, the read request is rejected . Read request update schema version after , We also found that reading requests schema version Than Local schema copy Updated , Go round and begin again .... In this case , We haven't done any special treatment at present . We will think that this situation is very, very rare , Or it won't happen , So if such a special situation happens unfortunately , That can only wait for them to reach a balance , Start reading operation smoothly .

We mentioned the requirements of reading operation Query_Version and Local_Version Completely equal , Therefore, inequality is very easy to occur , As a result, many queries are re initiated or re pulled schema The situation of . In order to reduce the number of such cases , We made a small optimization .

<center> Figure 16 version Diagram of the relationship </center>

We have nothing but TiFlash As a whole schema version Outside , Each watch also has its own schema version, We call it Storage_Version, And our Storage_Version Always less than or equal to Local_Version Of , Because only in the latest schema When changing , Indeed, this table has been modified ,Storage_Version Will be exactly equal to Local_Version, In other cases ,Storage_Version All are less than Local_Version Of . So in [Storage_Version, Local_Version] In this interval , Of our table schema The information has not changed . That is to say Query_Version As long as [Storage_Version, Local_Version] In this range , Read the form requested by the request schema Information and our current schema The version is exactly the same . So we can put Query_Version < Local_Version This limitation is relaxed to Query_Version < Storage_Version. stay Query_Version < Storage_Version when , You need to update the read request schema Information .

After the verification , The module responsible for reading is based on our corresponding table

tidb_table_info

To build stream To read .Schema Related processes , We can do it in

InterpreterSelectQuery.cpp

getAndLockStorageWithSchemaVersion

as well as

DAGStorageInterpreter.cpp

getAndLockStorages

For further understanding .

InterpreterSelectQuery.cpp

and

DAGStorageInterpreter.cpp

It's all about being responsible. Right TiFlash Read the table , The former is responsible for clickhouse client The process read under connection , The latter is TiDB The process read in the branch .

Special Case

Finally, let's take an example , Let's see Drop Table and Recover Table Relevant information .

<center> Figure 17 special case Schematic diagram 1 </center>

The line at the top of Figure 16 is TiDB Timeline of , The line below is TiFlash Timeline of . stay t1 When ,TiDB the insert The operation of , And then in t2 It's time to do it again drop table The operation of .t1' When ,TiFlash received insert This operation is raft log, But the parsing and writing steps have not yet been carried out , And then in t2' When ,TiFlash Synced to drop table This article schema DDL operation , the schema Update . wait until t2'' When ,TiFlash Start Parse the newly inserted data , But at this time, because the corresponding table has been deleted , So we will throw away this data . So far, there are no problems .

<center> Figure 18 special case Schematic diagram II </center>

But if t3 At that time, we went on recover The operation of , Restore this table , The last one inserted row The data is lost . Data loss is an unacceptable result . therefore TiFlash about drop table This kind of DDL, This table will be set tombstone, The specific physical recovery is postponed gc It happens again during operation . about drop table There are still write operations on the latter table , We will continue to parse and write , Do this later recover When , We will not lose data .

Summary

This article mainly introduces TiFlash in DDL Module design idea , Specific implementation and core related processes . More code reading content will be gradually expanded in later chapters , Coming soon .

原网站

版权声明
本文为[InfoQ]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/188/202207061833529331.html