当前位置：网站首页>How to guarantee the data quality of data warehouse?

How to guarantee the data quality of data warehouse?

2022-06-11 02:25:00 【Learn big data in five minutes】

Reading guide

Youzan data report center provides businesses with rich data indicators , Include 30+ page ,100+ Data report and 400+ Different types of data indicators , They help businesses make more sense 、 Operate the store scientifically , At the same time, it also directly provides analysis and decision-making methods for businesses to use . also , The underlying tasks and data tables involved in running every day have reached the Millennium level .

In the face of such a huge data system , As a test, how to formulate a quality assurance strategy ？ This article will be from ：1. Yes, data link 、2. Data layer testing 、 3. Application layer testing 、 4. Follow up planning is carried out in these four aspects .

Nanny level tutorial of data warehouse construction PDF file

One 、 Yes, data link

1、 Data link introduction

First, we introduce the general data architecture diagram of youzan ：

From top to bottom, it can be roughly divided into application service layer 、 Data gateway layer 、 Application storage tier 、 Data warehouse , And job development 、 Metadata management and other platforms are data computing 、 Task scheduling and data query provide basic capabilities .

The above has made a preliminary introduction to the overall architecture , For quality control , The two core parts are ： Data warehouse and data application . Because these two parts belong to the core link of the data link , Relative to other levels , Daily changes are also more frequent , The risk of problems is also relatively high .

Two 、 Data layer testing

1、 Overall overview

First , Quality assurance for the data layer , It can be divided into three aspects ： Data timeliness 、 integrity 、 accuracy .

2、 Data timeliness

Data timeliness , As the name suggests, test data needs to be produced on time . The three key elements of timeliness are ： Scheduled scheduling time 、 Priority and data deadline. The priority of the task determines the amount of data and computing resources it obtains , Affect the task execution time . data deadline It is the unified standard of the latest output time of data , It needs to be strictly observed .

Of these three elements , Belong to “ Universal rules ” And in the quality assurance stage, we need to focus on ： data deadline. So we're based on data deadline, The guarantee strategies for timeliness can be divided into two types ：

Monitor whether the offline data task is finished . This method depends on the monitoring and alarm of youzan job development platform , If the data task is deadline Point in time not completed , Then there will be mail 、 Enterprise micro 、 Telephone and other alarm forms , Notify the corresponding personnel .

Check the number of entries in the whole table or partition . This approach relies on the interface automation platform , By calling dubbo Interface , Judge whether the data index returned by the interface is 0, Whether the monitoring data is output .

Second, we can focus on failure 、 Retry count , When multiple failures occur during task execution 、 Retry exception , You can throw an alarm to let relevant personnel perceive . The alarm in this part is right deadline Supplement of alarm , At present, there is also function integration on youzan job development platform .

3、 Data integrity

Data integrity , As the name suggests, it depends on whether the data is complete , Focus on two points ： There's not much data 、 There's a lot of data .

There's not much data ： Generally, check the data of the whole table 、 Important enumeration values , See if the data is redundant 、 Duplicate data or whether the primary key is unique .
There's a lot of data ： Generally, check the data of the whole table 、 Important fields （ For example, the primary key field 、 Enumerated values 、 Date, etc. ）, See if the value of the field is empty 、 by null etc. .

It can be seen that data integrity is not so closely related to the business itself , More is the general content verification of the warehouse table . So from some basic dimensions , We can divide the test focus into table level 、 The field level has two directions .

Table level integrity ：

Full table dimensions , By viewing the total number of rows in the whole table / Watch size , If the total number of rows in the table appears / The total size remains unchanged or decreases , There may be a problem with the table data .
Partition dimension , By viewing the number of data rows in the partition table of the current day / size , If there is too much difference compared with the previous partition （ Too large or too small ）, There may be a problem with the table data .

At present, the metadata management platform of youzan has integrated relevant data views ：

Field level integrity ：

The sole judgment ： Ensure the uniqueness of the primary key or some fields , Prevent data duplication from causing and other tables join Then the data doubled , Resulting in large final statistics .

For example, judgment ods Whether the order number in the order table of the layer is unique , To write sql：

select 
count(order_no)
,count(distinct order_no) 
from ods.xx_order

If they are equal , shows order_no The value is unique in the table ; Otherwise, explain order_no Not unique in the table , There is a problem with the table data .

Judge not empty ： Ensure that important fields are not empty , Prevent null data from causing and tables join Then the data is lost , Resulting in less final statistics .

For example, judgment ods Whether the order number in the order table of the layer appears null, To write sql：

select 
count(*) 
from ods.xx_order 
where order_no is null

If the result is equal to 0, shows order_no non-existent null; If the result is greater than 0, shows order_no There is null value , There is a problem with the table data .

Enumeration type judgment ： Ensure that the enumerated field values are within the expected range , Prevent dirty business data , Resulting in omission of final statistical results / Redundant data types .

For example, judgment ods Level order table shop_type Whether all enumerated values in the field meet the expectations , To write sql：

select shop_type from ods.xx_order group by shop_type

Analyze whether the query results meet the expectations , Make sure there are no omissions / Redundant enumeration types .

Data validity judgment ： Judge whether the data format meets the expectation , Prevent the incorrect data format of the field from causing errors and missing data statistics . Common date formats yyyymmdd.

In case of data integrity problems , It has a great impact on data quality . So the integrity strategy is more applicable to ods layer , Because we expect to find and solve the problem of unreasonable data from the source , Stop loss in time , Avoid dirty data entering downstream , Data pollution is expanding .

in addition , We see that the content of integrity verification is logically simple , And relatively fixed , It can be templated with a little simple abstraction . So as a test , We prefer to make data integrity verification a tool . At present, I like “ Data form tool ” Has landed , Here are some of my ideas ：

For all tables , Universal rules , For example, the uniqueness of the table primary key .
For different types, such as values 、String、 enumeration 、 Date format type , List common data judgment rules .
Rank each rule , For example, the primary key of a table is not unique , Write it down as critical.String The value of the field type is greater than the null ratio 70%, Write it down as warning.
According to whether the table data meets the above rules , Finally landing a visual report , The tester can evaluate the data quality according to the content of the report .

4、 Data accuracy

Data accuracy , As the name suggests, data should “ accuracy ”.“ accuracy ” This concept is more abstract , Because it's hard for us to pass a strong logical judgment , To show how accurate the data is , Most of them exist in perceptual cognition . Therefore, accuracy testing is also a direction of relatively divergent thinking in the process of data quality assurance .

After concluding , We can check from the field itself 、 Horizontal comparison of data 、 Vertical contrast 、code review Other aspects , To control the accuracy of the data , These test points are also closely related to the business .

4.1 Self check

Check the data itself , It means not comparing with other data , Use your own data to check for accuracy , It's one of the most basic checks . Common self inspections include ： Check that the numerical index is greater than 0、 The ratio index is between 0-1 Range . Such basic rules , Same as data integrity , It can also be combined with “ Data form tool ” Auxiliary test .

for instance , For example, for the order form , The payment amount must be greater than or equal to 0, There will be no negative numbers , To write sql：

select 
count(pay_price) 
from 
dw.dws_xx_order 
where par = 20211025 and pay_price<0

If the result is 0, Explain that the payment amount is greater than 0, Meet expectations ; Otherwise, if count The result is greater than 0, It indicates that there is a problem with the data .

4.2 Comparison of horizontal data in the table

The horizontal comparison in the table can be understood as the same table , Two or more fields associated with business , They have a certain logical relationship , Then it can be used for data comparison .

For example, for the order form , According to the actual business analysis, it is easy to get ： For any item in any store , All meet the number of orders >= Number of orders , To write sql：

select 
kdt_id
,goods_id
,count(order_no)
,count(distinct buyer_id) 
from dw.dws_xx_order
where par = '20211025'
group by kdt_id,goods_id
having count(order_no)<count(distinct buyer_id)

If there is no record in the query result , It means that there is no Number of orders < Number of orders , Reverse order number >= Number of orders , Is in line with expectations ; Otherwise, if the record of the query result is greater than 0, Does not meet expectations .

4.3 Horizontal data comparison between tables

Horizontal comparison between tables can be understood as two or more tables , There are fields with business association or consistent business meaning , It can be used for data comparison ：

Comparison between tables of the same type ： in the light of hive The payment form in A And payment forms B, There are payment amount fields , So... In the same dimension surface A. Pay the amount = surface B. Pay the amount .
Comparison between multiple sets of storage ： For example, youzan data report center aims at payment table , Application layer storage uses mysql and kylin, Used for active / standby switching , So... In the same dimension kylin- surface A. Pay the amount = mysql- surface B. Pay the amount .
Comparison between multiple systems ： Across systems , For example, the data report center and crm System , Both systems have customer index data , Then the data report center under the same dimension - surface A. Customer metrics = crm- surface B. Customer metrics .

We deeply analyze the underlying logic of horizontal comparison of data , The essence is the different fields of two tables , Compare logical operators , It's also easier to abstract into tools . At present, I like “ Data comparison tool ” Has landed , Here are some of my ideas ：

Enter two tables , Set the primary keys of the two tables respectively .
Enter the fields to be compared in the two tables , And set the contrast operator , such as >、=、<.
According to the set rules , The final data comparison passed 、 Failed records , Landing a visual report , The tester can evaluate the data quality according to the content of the report .

4.4 Longitudinal data comparison

Vertical comparison is the comparison of upstream and downstream data , The purpose is to ensure that there are no problems in the upstream and downstream processing of important fields .

Like counting warehouses dw There is a detailed list of orders in the layer , Data products dm There is an aggregate table of the number of orders in the layer , Then the statistical results of the two in the same dimension , There should be consistency .

4.5 code review

First , It's going on code review Previous requirements review stage , We must first clarify the detailed caliber of data statistics , Here are two examples of actual requirements .

demand 1：（ The wrong sample ） Count the payment amount of all users in the store within the time . The problem ： The requirement description is too concise , The time dimension and filtering conditions of data statistics are not clearly explained , The statistical caliber is not clear , The product is required to have a clear caliber .
demand 2：（ The correct sample ） There is an offline payment amount that likes the dimension of home domain stores of all online merchants . Support natural day 、 Nature week 、 Natural month . Within the statistical time , Sum of all payment orders （ Eliminate the lottery group 、 Eliminate gift cards 、 Exclude distribution supply orders ）.

After defining the requirements , Here is a detailed introduction code review Some common concerns of ：

1） Connections & Filter conditions

Association tables use outer join still join, It depends on whether the data needs to be filtered .
Connections on In words , Whether the left and right value types are consistent .
If the relationship is 1：1, Whether the associated keys of the two tables are unique . If it's not the only one , Then the association will produce Cartesian and lead to data expansion .
where Whether the conditions are filtered correctly , Take the above requirements as an example , Focus on sql Whether the lottery group is correctly eliminated in the lottery 、 Gift cards and distribution supply orders .

2） Statistical caliber processing of indicators

The statistics of data indicators involves two basic concepts ：

Cumulative indicators ： For example, payment amount , Views, etc , Indicators that can be counted by adding simple values , For such indicators ,sql The functions used in are generally sum.
Non cumulative indicators ： Like the number of visitors , You can't simply add , Instead, we need to make statistics in the way of de duplication and then summation , For such indicators ,sql Generally used in count(distinct ).

3）insert insert data

Whether re running is supported . It is equivalent to seeing whether there is overwrite keyword , If there is no such keyword , Heavy data run （ Execute the workflow more than once ） Dirty data will not be overwritten , Instead, insert data incrementally into the table , In turn, it may double the final statistics .
Whether the order of the inserted data is exactly the same as that of the inserted table structure . We need to ensure that there is no error in the writing order of data fields , Otherwise, the insertion value will be disordered .

3、 ... and 、 Application layer testing

1、 Overall overview

Basic front page + Server interface test , It is consistent with the focus of general business testing , I won't repeat . This article focuses on “ Data applications “ Where testing requires additional attention .

2、 Downgrade strategy

When adding a data table on the page , demand 、 Confirm whether support is needed in the technical review stage “ Blue stripe ” The function of , Belong to “ Test shift left ”.

Blue bar Introduction ： There is a blue bar at the top of the page that tells the merchant that offline data has not been produced , Among them “ Output time ” = Current access time +2 Hours , Dynamic calculation .

When testing ratio indicators , Focus on the divisor = 0 The special scene of . On the back end code review、 Test page function stage , Focus on this point . At present, some people like to deal with this situation , The front-end shows “-”.

3、 Active / standby strategy

When there is an active / standby switching strategy , During the test, pay attention to the normal double writing of data , And through configuration , Switch between active and standby data sources during data retrieval .

4、 Data security

Pay attention to the permission control of data query , Focus on testing horizontal ultra vires 、 Vertical ultra vires scene .

Four 、 Follow up planning

At present, in the data accuracy comparison of actual projects , The data comparison tool does not support sql function , So it can only replace 50% Manual testing of , Some complex horizontal and vertical data comparison still needs to be written sql. Follow up plan support sum、count、max、min etc. sql function , Increase tool coverage to 75% above , Greatly reduce the cost of data comparison .

at present “ Data form report ”、“ Data comparison tool ” More use of project testing , In the follow-up plan, the morphological inspection and data comparison will be made into online patrol inspection , Combine automation with data tools , Continuously guarantee the quality of warehouse table .

Currently for sql code review The way of is mainly manual , We plan to put some basic sql Check , such as insert into Check ,join on Check the uniqueness of the condition 、 Field insertion sequence check, etc sql Static scanning , Integrate into big data testing services , And empower other business lines .