当前位置:网站首页>How to guarantee the data quality of data warehouse?
How to guarantee the data quality of data warehouse?
2022-06-11 02:25:00 【Learn big data in five minutes】
Reading guide
Youzan data report center provides businesses with rich data indicators , Include 30+ page ,100+ Data report and 400+ Different types of data indicators , They help businesses make more sense 、 Operate the store scientifically , At the same time, it also directly provides analysis and decision-making methods for businesses to use . also , The underlying tasks and data tables involved in running every day have reached the Millennium level .
In the face of such a huge data system , As a test, how to formulate a quality assurance strategy ? This article will be from :1. Yes, data link 、2. Data layer testing 、 3. Application layer testing 、 4. Follow up planning is carried out in these four aspects .
Nanny level tutorial of data warehouse construction PDF file
One 、 Yes, data link
1、 Data link introduction
First, we introduce the general data architecture diagram of youzan :

From top to bottom, it can be roughly divided into application service layer 、 Data gateway layer 、 Application storage tier 、 Data warehouse , And job development 、 Metadata management and other platforms are data computing 、 Task scheduling and data query provide basic capabilities .
The above has made a preliminary introduction to the overall architecture , For quality control , The two core parts are : Data warehouse and data application . Because these two parts belong to the core link of the data link , Relative to other levels , Daily changes are also more frequent , The risk of problems is also relatively high .
Two 、 Data layer testing
1、 Overall overview
First , Quality assurance for the data layer , It can be divided into three aspects : Data timeliness 、 integrity 、 accuracy .

2、 Data timeliness
Data timeliness , As the name suggests, test data needs to be produced on time . The three key elements of timeliness are : Scheduled scheduling time 、 Priority and data deadline. The priority of the task determines the amount of data and computing resources it obtains , Affect the task execution time . data deadline It is the unified standard of the latest output time of data , It needs to be strictly observed .
Of these three elements , Belong to “ Universal rules ” And in the quality assurance stage, we need to focus on : data deadline. So we're based on data deadline, The guarantee strategies for timeliness can be divided into two types :
Monitor whether the offline data task is finished . This method depends on the monitoring and alarm of youzan job development platform , If the data task is deadline Point in time not completed , Then there will be mail 、 Enterprise micro 、 Telephone and other alarm forms , Notify the corresponding personnel .

Check the number of entries in the whole table or partition . This approach relies on the interface automation platform , By calling dubbo Interface , Judge whether the data index returned by the interface is 0, Whether the monitoring data is output .

Second, we can focus on failure 、 Retry count , When multiple failures occur during task execution 、 Retry exception , You can throw an alarm to let relevant personnel perceive . The alarm in this part is right deadline Supplement of alarm , At present, there is also function integration on youzan job development platform .
3、 Data integrity
Data integrity , As the name suggests, it depends on whether the data is complete , Focus on two points : There's not much data 、 There's a lot of data .
There's not much data : Generally, check the data of the whole table 、 Important enumeration values , See if the data is redundant 、 Duplicate data or whether the primary key is unique .
There's a lot of data : Generally, check the data of the whole table 、 Important fields ( For example, the primary key field 、 Enumerated values 、 Date, etc. ), See if the value of the field is empty 、 by null etc. .
It can be seen that data integrity is not so closely related to the business itself , More is the general content verification of the warehouse table . So from some basic dimensions , We can divide the test focus into table level 、 The field level has two directions .

Table level integrity :
Full table dimensions , By viewing the total number of rows in the whole table / Watch size , If the total number of rows in the table appears / The total size remains unchanged or decreases , There may be a problem with the table data .
Partition dimension , By viewing the number of data rows in the partition table of the current day / size , If there is too much difference compared with the previous partition ( Too large or too small ), There may be a problem with the table data .
At present, the metadata management platform of youzan has integrated relevant data views :

Field level integrity :
The sole judgment : Ensure the uniqueness of the primary key or some fields , Prevent data duplication from causing and other tables join Then the data doubled , Resulting in large final statistics .
For example, judgment ods Whether the order number in the order table of the layer is unique , To write sql:
select
count(order_no)
,count(distinct order_no)
from ods.xx_order
If they are equal , shows order_no The value is unique in the table ; Otherwise, explain order_no Not unique in the table , There is a problem with the table data .
Judge not empty : Ensure that important fields are not empty , Prevent null data from causing and tables join Then the data is lost , Resulting in less final statistics .
For example, judgment ods Whether the order number in the order table of the layer appears null, To write sql:
select
count(*)
from ods.xx_order
where order_no is null
If the result is equal to 0, shows order_no non-existent null; If the result is greater than 0, shows order_no There is null value , There is a problem with the table data .
Enumeration type judgment : Ensure that the enumerated field values are within the expected range , Prevent dirty business data , Resulting in omission of final statistical results / Redundant data types .
For example, judgment ods Level order table shop_type Whether all enumerated values in the field meet the expectations , To write sql:
select shop_type from ods.xx_order group by shop_type
Analyze whether the query results meet the expectations , Make sure there are no omissions / Redundant enumeration types .
Data validity judgment : Judge whether the data format meets the expectation , Prevent the incorrect data format of the field from causing errors and missing data statistics . Common date formats
yyyymmdd.
In case of data integrity problems , It has a great impact on data quality . So the integrity strategy is more applicable to ods layer , Because we expect to find and solve the problem of unreasonable data from the source , Stop loss in time , Avoid dirty data entering downstream , Data pollution is expanding .
in addition , We see that the content of integrity verification is logically simple , And relatively fixed , It can be templated with a little simple abstraction . So as a test , We prefer to make data integrity verification a tool . At present, I like “ Data form tool ” Has landed , Here are some of my ideas :
For all tables , Universal rules , For example, the uniqueness of the table primary key .
For different types, such as values 、String、 enumeration 、 Date format type , List common data judgment rules .
Rank each rule , For example, the primary key of a table is not unique , Write it down as critical.String The value of the field type is greater than the null ratio 70%, Write it down as warning.
According to whether the table data meets the above rules , Finally landing a visual report , The tester can evaluate the data quality according to the content of the report .

4、 Data accuracy
Data accuracy , As the name suggests, data should “ accuracy ”.“ accuracy ” This concept is more abstract , Because it's hard for us to pass a strong logical judgment , To show how accurate the data is , Most of them exist in perceptual cognition . Therefore, accuracy testing is also a direction of relatively divergent thinking in the process of data quality assurance .
After concluding , We can check from the field itself 、 Horizontal comparison of data 、 Vertical contrast 、code review Other aspects , To control the accuracy of the data , These test points are also closely related to the business .

4.1 Self check
Check the data itself , It means not comparing with other data , Use your own data to check for accuracy , It's one of the most basic checks . Common self inspections include : Check that the numerical index is greater than 0、 The ratio index is between 0-1 Range . Such basic rules , Same as data integrity , It can also be combined with “ Data form tool ” Auxiliary test .
for instance , For example, for the order form , The payment amount must be greater than or equal to 0, There will be no negative numbers , To write sql:
select
count(pay_price)
from
dw.dws_xx_order
where par = 20211025 and pay_price<0
If the result is 0, Explain that the payment amount is greater than 0, Meet expectations ; Otherwise, if count The result is greater than 0, It indicates that there is a problem with the data .
4.2 Comparison of horizontal data in the table
The horizontal comparison in the table can be understood as the same table , Two or more fields associated with business , They have a certain logical relationship , Then it can be used for data comparison .
For example, for the order form , According to the actual business analysis, it is easy to get : For any item in any store , All meet the number of orders >= Number of orders , To write sql:
select
kdt_id
,goods_id
,count(order_no)
,count(distinct buyer_id)
from dw.dws_xx_order
where par = '20211025'
group by kdt_id,goods_id
having count(order_no)<count(distinct buyer_id)
If there is no record in the query result , It means that there is no Number of orders < Number of orders , Reverse order number >= Number of orders , Is in line with expectations ; Otherwise, if the record of the query result is greater than 0, Does not meet expectations .
4.3 Horizontal data comparison between tables
Horizontal comparison between tables can be understood as two or more tables , There are fields with business association or consistent business meaning , It can be used for data comparison :
Comparison between tables of the same type : in the light of hive The payment form in A And payment forms B, There are payment amount fields , So... In the same dimension surface A. Pay the amount = surface B. Pay the amount .
Comparison between multiple sets of storage : For example, youzan data report center aims at payment table , Application layer storage uses mysql and kylin, Used for active / standby switching , So... In the same dimension kylin- surface A. Pay the amount = mysql- surface B. Pay the amount .
Comparison between multiple systems : Across systems , For example, the data report center and crm System , Both systems have customer index data , Then the data report center under the same dimension - surface A. Customer metrics = crm- surface B. Customer metrics .
We deeply analyze the underlying logic of horizontal comparison of data , The essence is the different fields of two tables , Compare logical operators , It's also easier to abstract into tools . At present, I like “ Data comparison tool ” Has landed , Here are some of my ideas :
Enter two tables , Set the primary keys of the two tables respectively .
Enter the fields to be compared in the two tables , And set the contrast operator , such as >、=、<.
According to the set rules , The final data comparison passed 、 Failed records , Landing a visual report , The tester can evaluate the data quality according to the content of the report .

4.4 Longitudinal data comparison
Vertical comparison is the comparison of upstream and downstream data , The purpose is to ensure that there are no problems in the upstream and downstream processing of important fields .
Like counting warehouses dw There is a detailed list of orders in the layer , Data products dm There is an aggregate table of the number of orders in the layer , Then the statistical results of the two in the same dimension , There should be consistency .
4.5 code review
First , It's going on code review Previous requirements review stage , We must first clarify the detailed caliber of data statistics , Here are two examples of actual requirements .
demand 1:( The wrong sample ) Count the payment amount of all users in the store within the time . The problem : The requirement description is too concise , The time dimension and filtering conditions of data statistics are not clearly explained , The statistical caliber is not clear , The product is required to have a clear caliber .
demand 2:( The correct sample ) There is an offline payment amount that likes the dimension of home domain stores of all online merchants . Support natural day 、 Nature week 、 Natural month . Within the statistical time , Sum of all payment orders ( Eliminate the lottery group 、 Eliminate gift cards 、 Exclude distribution supply orders ).
After defining the requirements , Here is a detailed introduction code review Some common concerns of :
1) Connections & Filter conditions
Association tables use outer join still join, It depends on whether the data needs to be filtered .
Connections on In words , Whether the left and right value types are consistent .
If the relationship is 1:1, Whether the associated keys of the two tables are unique . If it's not the only one , Then the association will produce Cartesian and lead to data expansion .
where Whether the conditions are filtered correctly , Take the above requirements as an example , Focus on sql Whether the lottery group is correctly eliminated in the lottery 、 Gift cards and distribution supply orders .

2) Statistical caliber processing of indicators
The statistics of data indicators involves two basic concepts :
Cumulative indicators : For example, payment amount , Views, etc , Indicators that can be counted by adding simple values , For such indicators ,sql The functions used in are generally sum.
Non cumulative indicators : Like the number of visitors , You can't simply add , Instead, we need to make statistics in the way of de duplication and then summation , For such indicators ,sql Generally used in count(distinct ).

3)insert insert data
Whether re running is supported . It is equivalent to seeing whether there is overwrite keyword , If there is no such keyword , Heavy data run ( Execute the workflow more than once ) Dirty data will not be overwritten , Instead, insert data incrementally into the table , In turn, it may double the final statistics .
Whether the order of the inserted data is exactly the same as that of the inserted table structure . We need to ensure that there is no error in the writing order of data fields , Otherwise, the insertion value will be disordered .

3、 ... and 、 Application layer testing
1、 Overall overview

Basic front page + Server interface test , It is consistent with the focus of general business testing , I won't repeat . This article focuses on “ Data applications “ Where testing requires additional attention .
2、 Downgrade strategy
When adding a data table on the page , demand 、 Confirm whether support is needed in the technical review stage “ Blue stripe ” The function of , Belong to “ Test shift left ”.
Blue bar Introduction : There is a blue bar at the top of the page that tells the merchant that offline data has not been produced , Among them “ Output time ” = Current access time +2 Hours , Dynamic calculation .


When testing ratio indicators , Focus on the divisor = 0 The special scene of . On the back end code review、 Test page function stage , Focus on this point . At present, some people like to deal with this situation , The front-end shows “-”.

3、 Active / standby strategy
When there is an active / standby switching strategy , During the test, pay attention to the normal double writing of data , And through configuration , Switch between active and standby data sources during data retrieval .

4、 Data security
Pay attention to the permission control of data query , Focus on testing horizontal ultra vires 、 Vertical ultra vires scene .
Four 、 Follow up planning
At present, in the data accuracy comparison of actual projects , The data comparison tool does not support sql function , So it can only replace 50% Manual testing of , Some complex horizontal and vertical data comparison still needs to be written sql. Follow up plan support sum、count、max、min etc. sql function , Increase tool coverage to 75% above , Greatly reduce the cost of data comparison .
at present “ Data form report ”、“ Data comparison tool ” More use of project testing , In the follow-up plan, the morphological inspection and data comparison will be made into online patrol inspection , Combine automation with data tools , Continuously guarantee the quality of warehouse table .
Currently for sql code review The way of is mainly manual , We plan to put some basic sql Check , such as insert into Check ,join on Check the uniqueness of the condition 、 Field insertion sequence check, etc sql Static scanning , Integrate into big data testing services , And empower other business lines .
Reference resources
Nanny level tutorial of data warehouse construction PDF file
边栏推荐
猜你喜欢

Using an old mobile phone to build a server and achieve intranet penetration does not require root (I have personally tested the simplest one many times)

Find - (half find / half find)

Secret

Oracle收集统计信息

2022 safety officer-b certificate examination question bank and answers

多级介孔有机金属骨架材料ZIF-8负载乳酸氧化酶(LOD)/四氧化三铁(Fe304)/阿霉素DOX/胰岛素/cas9蛋白/甲硝唑/大黄素甲醚
![[3.delphi common components] 8 dialog box](/img/4b/c06f8789cee58705a0b77f3ca1eee6.jpg)
[3.delphi common components] 8 dialog box

Secret

查看Redis内数据,除了命令行和客户端,你还有第三种选择
![[3.delphi common components] 6 scroll bar](/img/55/891e56de4500a9128ac89e3c5b1721.jpg)
[3.delphi common components] 6 scroll bar
随机推荐
Large screen - full screen, exit full screen
项目 - Redis消息队列+工作线程取出用户操作日志并入库(二)
Enrichment of core knowledge points of interface automation to add points to the interview
Jump without refresh - detailed explanation of pushstate and replacestate methods in history
npm ERR Fix the upstream dependency conflict, or retry
889. construct binary tree according to preorder and postorder traversal
Online courses avaiable
可扩/减容线程池C语言原理讲解及代码实现
CRS-5017
Fb02 edit coding block field
CRS-5017
多级介孔有机金属骨架材料ZIF-8负载乳酸氧化酶(LOD)/四氧化三铁(Fe304)/阿霉素DOX/胰岛素/cas9蛋白/甲硝唑/大黄素甲醚
Find - (block find)
[3.delphi common components] 7 timer
技术分享| 快对讲,全球对讲
1031. 两个非重叠子数组的最大和
Find - (half find / half find)
Record scroll bar position, passive, scrolltop
关于Set集合类你都知道什么?来自《卷Ⅰ》的灵魂提问
Understand the role of before and after Clearfixafter clear floating