0x01. Data quality check dimension Introduction
An evaluation rule dimension provides a way to measure and manage information and data . Distinguishing rule dimensions helps :
- Match dimensions to business needs , And sort out the order of evaluation ;
- Understand that from the assessment of each dimension, you can / What can't you get ;
- With limited time and resources , Better define and manage the sequence of actions in the project plan .
Data quality checking is mainly divided into the following rule dimensions :
- integrity (Completeness): Used to describe the completeness of information .
- Uniqueness (Uniqueness): Used to describe whether there are duplicate records in data , No entity appears more than once .
- effectiveness (Validity): Used to describe whether a model or data meets user-defined conditions . Usually from the name 、 data type 、 length 、 range 、 Value range 、 Content specification and so on .
- Uniformity (Consistency): It is used to describe whether the information attributes of the same information subject in different datasets are the same , Entities 、 Whether the attribute conforms to the consistency constraint .
- accuracy (Accuracy): It is used to describe whether the data is consistent with the characteristics of its corresponding objective entity ( Need a definitive and accessible authoritative reference source ).
- timeliness (Timeless): It is used to describe the time interval from the occurrence of business to the correct storage and normal viewing of corresponding data , Also known as the delay time of data , The timeliness of data should be as close as possible to the actual time point of business .
- Credibility (credibility): It is used to describe whether the occurrence of data conforms to the objective law .
Each rule dimension may require a different measure 、 Timing and process . This leads to the time required to complete the audit assessment 、 There will be differences between money and human resources . The improvement of data quality is not achieved overnight , With a clear understanding of the work required to assess each dimension , Select those currently more urgent inspection dimensions and rules , From easy to difficult 、 Gradually promote the overall management and improvement of data quality from simple to deep . The initial assessment of the rule dimension is to establish a baseline , The rest of the assessment is as part of ongoing testing and information improvement , As part of the business process .
0x02. integrity
The data level integrity dimension can be subdivided into the following dimension categories :
- Non empty constraint : Describes whether the check object has null data value . If the customer opens an account , Customer name is required , Can't be empty .
1. Non empty constraint
Non null constraints are easier to understand , In short, the field cannot be empty , The inspection method is also relatively easy , Just set the fields to be checked , adopt sql The query column value cannot be empty . Query the empty data for rectification .
Of course, non null constraints can set non null constraints to restrict data from being written to the database , If this method is supported, it can avoid non null data checking after the event .
0x03. Uniqueness
The data level uniqueness dimension category can be subdivided into the following dimension sub categories :
- Uniqueness constraint : Describe the information of the same objective entity in different business datasets , It is unique after integration , For the target, it is usually a single or joint primary key , Such as certificate type + ID number + Same name , The customer number should be unique .
1. Uniqueness constraint
A simple example , In general, the uniqueness constraint has a unique identification field, which can be used to judge its uniqueness , In business, the unique business entity can be determined by several associated business attribute pairs . If there is a problem of data duplication in this case , That is, the uniqueness constraint is violated . In this case, if it is a single business primary key , You can check by grouping the primary keys to remove duplicate ones , If it is a business union attribute, it can only be checked manually by business personnel .
0x04. effectiveness
The data level effectiveness dimension can be subdivided into the following dimensions :
- Code range constraint : Describes whether the code value of the check object is in the corresponding code table . Such as business rule definition “ Gender ” The value of should be “1- Unknown Gender ”、“2- men ”、“3- women ”、“4- Unexplained gender ”, If appear “A”、“B” Such a value , Think “ Gender ” There is a problem with the code range of ;
- Length constraint : Describes whether the length of the check object satisfies the length constraint . Such as “ Financial institution code ” stay 《 People's Bank of China's financial institutions coding standards 》 The length specified in 14 position , If not 14 The value of a , The length constraint is not satisfied , Not an effective one “ Financial institution code ”;
- Content specification constraints : Describes whether the value of the check object is entered and stored according to certain requirements and specifications . Such as “ Deposit account number ” Should contain numbers only , If letters or other illegal characters appear , Is not a valid “ Deposit account number ”, Content specification constraints are not met ;
- Value range constraint : Describes whether the value of the check object is within the predefined range . Such as “ Credit line ” The value range should be greater than or equal to 0, If there is less than 0 The situation of , It is beyond the constraint of the value range , Not an effective one “ Credit line ”;
1. Code range constraint
Describes whether the value of the check object is entered and stored according to certain requirements and specifications .
example 1: According to business rules, the gender is only “0: male ” ,”1: Woman ”, Then the gender field should only appear 0 or 1.
example 2: Currency code (CURCODE) Only should RMB
or USD
value .
In data quality, the code value field must first specify the unified coding table of enterprise level , And then according to the contrast relation etl transformation , As for the report, we only need to pass sql It's OK to query values that are no longer in the range .
2. Length constraint
Describes whether the length of the check object satisfies the length constraint .
For example, the ID number is 18 position .
Length constraint can be restricted by specifying character length when creating table , If the business system didn't have restrictions at first , Only through sql Determine the length of the way to obtain the exception value and then deal with .
3. Content specification constraints
Describes whether the value of the check object is entered and stored according to certain requirements and specifications .
for example : Balance or date are usually stored in a fixed type , If the original design is character type, it should be adjusted according to the corresponding type .
First of all, it's better to set up a unified standard from the beginning , Specify the type of technology according to the business meaning . If you don't do it well in the first place , Data exploration can be done by type , Unified format of data .
4. Value range constraint
Describes whether the value of the check object is within the predefined range .
for example : The balance cannot be negative , Date cannot be negative, etc .
If there are no restrictions at the beginning of the business , Only through sql To filter and query the data , Yes, the problem data set etl Handle .
0x05. Uniformity
Data level consistency dimensions can be subdivided into the following dimension categories :
- Equivalence consistency dependency constraint : Describe the constraint rules of data values between check objects . The data value of one check object must be equal to another or more check objects under certain rules .
- There are consistency dependency constraints : Constraint rules describing the relationship between data values between check objects . The data value of one check object must exist when another check object satisfies a certain condition .
- Logical consistency dependency constraints : Constraint rules that describe the logical relationship between data values between check objects . The data value of one check object must satisfy some logical relationship with the data value of another check object ( If it is greater than 、 Less than equal ).
1. Equivalence consistency dependency constraint
Generally refers to the scene of foreign key Association .
for example : Policy form , The policy number of the claim form exists in the policy master table , Same table , The association between two fields .
2. There are consistency dependency constraints
It mainly emphasizes the relevance of the business , If a state occurs, then a value must be .
for example : Insurance status
For insured , be Insurance Date
Should not be empty ;
3. Logical consistency dependency constraints
The main emphasis is the mutual constraint between fields .
for example : Starting time of insurance
Less than or equal to End time of insurance
.
0x06. accuracy
Data level accuracy mainly refers to the accuracy of the value , Describe whether the check object is consistent with the characteristics of its corresponding objective entity .
for example : The gender code of the insured is 0- women
, Although the code range constraint is satisfied , But it does not meet the accuracy constraints , Because it's male , The gender code should be 1- men
;
Again : The service charge of international letter of guarantee should be entered as International guarantee fee income
, However, it is recorded as Domestic guarantee service fee income
.
Accuracy requires not only the range and content specification of data to meet the requirements of validity , Its value is also the data of the objective real world . thus it can be seen , Valid data may not be accurate , Otherwise, it is established .
Accuracy usually requires manual verification by business people or other parties .
Treat this situation , Data quality rules can not be directly unified processing , The data results can only be checked in detail through even query .
0x07. timeliness
Timeliness constraint : Describe whether the inspection data can timely reflect the corresponding actual business time point status .
for example : In the system Five level classification of loans
A few days later than the actual classification ; Another example is the successful state of financial management business in financial management system , But in the core system, it is not recorded because of communication .
Timeliness is due to multiple systems 、 Communication and other reasons , It usually requires manual verification by business or system personnel .
Generally speaking, data synchronization is based on the technical fields of the business system ( such as :CREATE_DT), However, there may be a time interval between the occurrence time of the business and the field . It can be done by simple sql Compare the two times , Judge whether the timeliness of data meets the demand .
0x08. Credibility
Data credibility constraints : Describe daily in data synchronization / Whether the monthly increment data conforms to the theoretical empirical value .
for example : The daily partition data of insurance policy data is generally more than that of the previous day 10% growth , Suddenly data growth becomes 200%, In this case, there may be problems with data synchronization .
Again : The total monthly revenue generally goes up according to certain rules , If the data fluctuates suddenly, there may be problems .
Credibility requires that the total fluctuation of the data conform to the basic objective law , Generally speaking, it can be done by 7,15,30 Daily data were compared , If there is a large gap, then carry out a detailed problem exploration .