当前位置:网站首页>When we talk about data quality, what are we talking about?

When we talk about data quality, what are we talking about?

2020-11-08 23:46:00 Station

0x01. Data quality check dimension Introduction

An evaluation rule dimension provides a way to measure and manage information and data . Distinguishing rule dimensions helps :

  • Match dimensions to business needs , And sort out the order of evaluation ;
  • Understand that from the assessment of each dimension, you can / What can't you get ;
  • With limited time and resources , Better define and manage the sequence of actions in the project plan .

Data quality checking is mainly divided into the following rule dimensions :

  • integrity (Completeness): Used to describe the completeness of information .
  • Uniqueness (Uniqueness): Used to describe whether there are duplicate records in data , No entity appears more than once .
  • effectiveness (Validity): Used to describe whether a model or data meets user-defined conditions . Usually from the name 、 data type 、 length 、 range 、 Value range 、 Content specification and so on .
  • Uniformity (Consistency): It is used to describe whether the information attributes of the same information subject in different datasets are the same , Entities 、 Whether the attribute conforms to the consistency constraint .
  • accuracy (Accuracy): It is used to describe whether the data is consistent with the characteristics of its corresponding objective entity ( Need a definitive and accessible authoritative reference source ).
  • timeliness (Timeless): It is used to describe the time interval from the occurrence of business to the correct storage and normal viewing of corresponding data , Also known as the delay time of data , The timeliness of data should be as close as possible to the actual time point of business .
  • Credibility (credibility): It is used to describe whether the occurrence of data conforms to the objective law .

Each rule dimension may require a different measure 、 Timing and process . This leads to the time required to complete the audit assessment 、 There will be differences between money and human resources . The improvement of data quality is not achieved overnight , With a clear understanding of the work required to assess each dimension , Select those currently more urgent inspection dimensions and rules , From easy to difficult 、 Gradually promote the overall management and improvement of data quality from simple to deep . The initial assessment of the rule dimension is to establish a baseline , The rest of the assessment is as part of ongoing testing and information improvement , As part of the business process .

0x02. integrity

The data level integrity dimension can be subdivided into the following dimension categories :

  • Non empty constraint : Describes whether the check object has null data value . If the customer opens an account , Customer name is required , Can't be empty .

1. Non empty constraint

Non null constraints are easier to understand , In short, the field cannot be empty , The inspection method is also relatively easy , Just set the fields to be checked , adopt sql The query column value cannot be empty . Query the empty data for rectification .
Of course, non null constraints can set non null constraints to restrict data from being written to the database , If this method is supported, it can avoid non null data checking after the event .

0x03. Uniqueness

The data level uniqueness dimension category can be subdivided into the following dimension sub categories :

  • Uniqueness constraint : Describe the information of the same objective entity in different business datasets , It is unique after integration , For the target, it is usually a single or joint primary key , Such as certificate type + ID number + Same name , The customer number should be unique .

1. Uniqueness constraint

A simple example , In general, the uniqueness constraint has a unique identification field, which can be used to judge its uniqueness , In business, the unique business entity can be determined by several associated business attribute pairs . If there is a problem of data duplication in this case , That is, the uniqueness constraint is violated . In this case, if it is a single business primary key , You can check by grouping the primary keys to remove duplicate ones , If it is a business union attribute, it can only be checked manually by business personnel .

0x04. effectiveness

The data level effectiveness dimension can be subdivided into the following dimensions :

  • Code range constraint : Describes whether the code value of the check object is in the corresponding code table . Such as business rule definition “ Gender ” The value of should be “1- Unknown Gender ”、“2- men ”、“3- women ”、“4- Unexplained gender ”, If appear “A”、“B” Such a value , Think “ Gender ” There is a problem with the code range of ;
  • Length constraint : Describes whether the length of the check object satisfies the length constraint . Such as “ Financial institution code ” stay 《 People's Bank of China's financial institutions coding standards 》 The length specified in 14 position , If not 14 The value of a , The length constraint is not satisfied , Not an effective one “ Financial institution code ”;
  • Content specification constraints : Describes whether the value of the check object is entered and stored according to certain requirements and specifications . Such as “ Deposit account number ” Should contain numbers only , If letters or other illegal characters appear , Is not a valid “ Deposit account number ”, Content specification constraints are not met ;
  • Value range constraint : Describes whether the value of the check object is within the predefined range . Such as “ Credit line ” The value range should be greater than or equal to 0, If there is less than 0 The situation of , It is beyond the constraint of the value range , Not an effective one “ Credit line ”;

1. Code range constraint

Describes whether the value of the check object is entered and stored according to certain requirements and specifications .
example 1: According to business rules, the gender is only “0: male ” ,”1: Woman ”, Then the gender field should only appear 0 or 1.
example 2: Currency code (CURCODE) Only should RMB or USD value .
In data quality, the code value field must first specify the unified coding table of enterprise level , And then according to the contrast relation etl transformation , As for the report, we only need to pass sql It's OK to query values that are no longer in the range .

2. Length constraint

Describes whether the length of the check object satisfies the length constraint .
For example, the ID number is 18 position .
Length constraint can be restricted by specifying character length when creating table , If the business system didn't have restrictions at first , Only through sql Determine the length of the way to obtain the exception value and then deal with .

3. Content specification constraints

Describes whether the value of the check object is entered and stored according to certain requirements and specifications .
for example : Balance or date are usually stored in a fixed type , If the original design is character type, it should be adjusted according to the corresponding type .
First of all, it's better to set up a unified standard from the beginning , Specify the type of technology according to the business meaning . If you don't do it well in the first place , Data exploration can be done by type , Unified format of data .

4. Value range constraint

Describes whether the value of the check object is within the predefined range .
for example : The balance cannot be negative , Date cannot be negative, etc .
If there are no restrictions at the beginning of the business , Only through sql To filter and query the data , Yes, the problem data set etl Handle .

0x05. Uniformity

Data level consistency dimensions can be subdivided into the following dimension categories :

  • Equivalence consistency dependency constraint : Describe the constraint rules of data values between check objects . The data value of one check object must be equal to another or more check objects under certain rules .
  • There are consistency dependency constraints : Constraint rules describing the relationship between data values between check objects . The data value of one check object must exist when another check object satisfies a certain condition .
  • Logical consistency dependency constraints : Constraint rules that describe the logical relationship between data values between check objects . The data value of one check object must satisfy some logical relationship with the data value of another check object ( If it is greater than 、 Less than equal ).

1. Equivalence consistency dependency constraint

Generally refers to the scene of foreign key Association .
for example : Policy form , The policy number of the claim form exists in the policy master table , Same table , The association between two fields .

2. There are consistency dependency constraints

It mainly emphasizes the relevance of the business , If a state occurs, then a value must be .
for example : Insurance status For insured , be Insurance Date Should not be empty ;

3. Logical consistency dependency constraints

The main emphasis is the mutual constraint between fields .
for example : Starting time of insurance Less than or equal to End time of insurance .

0x06. accuracy

Data level accuracy mainly refers to the accuracy of the value , Describe whether the check object is consistent with the characteristics of its corresponding objective entity .
for example : The gender code of the insured is 0- women , Although the code range constraint is satisfied , But it does not meet the accuracy constraints , Because it's male , The gender code should be 1- men ;
Again : The service charge of international letter of guarantee should be entered as International guarantee fee income , However, it is recorded as Domestic guarantee service fee income .
Accuracy requires not only the range and content specification of data to meet the requirements of validity , Its value is also the data of the objective real world . thus it can be seen , Valid data may not be accurate , Otherwise, it is established .
Accuracy usually requires manual verification by business people or other parties .

Treat this situation , Data quality rules can not be directly unified processing , The data results can only be checked in detail through even query .

0x07. timeliness

Timeliness constraint : Describe whether the inspection data can timely reflect the corresponding actual business time point status .
for example : In the system Five level classification of loans A few days later than the actual classification ; Another example is the successful state of financial management business in financial management system , But in the core system, it is not recorded because of communication .
Timeliness is due to multiple systems 、 Communication and other reasons , It usually requires manual verification by business or system personnel .
Generally speaking, data synchronization is based on the technical fields of the business system ( such as :CREATE_DT), However, there may be a time interval between the occurrence time of the business and the field . It can be done by simple sql Compare the two times , Judge whether the timeliness of data meets the demand .

0x08. Credibility

Data credibility constraints : Describe daily in data synchronization / Whether the monthly increment data conforms to the theoretical empirical value .
for example : The daily partition data of insurance policy data is generally more than that of the previous day 10% growth , Suddenly data growth becomes 200%, In this case, there may be problems with data synchronization .
Again : The total monthly revenue generally goes up according to certain rules , If the data fluctuates suddenly, there may be problems .
Credibility requires that the total fluctuation of the data conform to the basic objective law , Generally speaking, it can be done by 7,15,30 Daily data were compared , If there is a large gap, then carry out a detailed problem exploration .

版权声明
本文为[Station]所创,转载请带上原文链接,感谢