当前位置:网站首页>How to improve data quality
How to improve data quality
2022-07-02 00:07:00 【000X000】
One 、 Preface
The key step of data quality assurance is data quality rules 、 Data quality indicators , Data exploration , Data guarantee mechanism and data cleaning , Whether you are doing data quality or planning to do data quality work, you can study it in detail , It should help .
This chapter contains the basis of number quality , Data quality rules 、 indicators ( Attached template download ), Data exploration ( Attached template download ), Data assurance mechanism , Data cleaning ( Attached template download ), Common quality problems ( Download documentation attached )
Two 、 Data quality fundamentals
Data quality management (Data Quality Management), It refers to data from plan 、 obtain 、 Storage 、 share 、 maintain 、 application 、 All kinds of data quality problems that may arise in every stage of the extinction life cycle , For identification 、 Measure 、 monitor 、 Early warning and other management activities , And by improving and improving the management level of the organization, the data quality can be further improved .
Data quality is the most critical 6 Dimensions :
1) integrity : It refers to data entry 、 There is no missing or omission in the transmission process , Including physical integrity 、 Attribute integrity 、 Record integrity and field value integrity .
2) timeliness : It refers to the timely recording and transmission of relevant data , Meet the time requirements of business for information acquisition .
3) effectiveness : Refers to the value of the data 、 The format and presentation form meet the requirements of data definition and business definition .
4) Uniformity : It refers to recording and transmitting data and information in accordance with unified data standards , Mainly reflected in data
Is the record standard 、 Is the data logical .
5) Uniqueness : The same data can only have unique identifiers .
6) accuracy : Means truly 、 Accurately record original data , No false data and information .
3、 ... and 、 Data quality rules , Data quality indicators
Data quality rules are the core content of data quality , Completeness and incompleteness of data quality rules and index design , Is it reasonable? , Determines the quality of data . The following is a version I synthesized based on Huawei's way of data, the way of digital transformation of industrial enterprises and my experience , If these rules are in place , Data quality should be guaranteed .
object | Quality characteristics | Type of rule | indicators |
Single column | integrity | Cannot be empty class | Null rate |
effectiveness | Syntax constraint class | 1- Sample record outlier ratio | |
effectiveness | Format specification class | ||
effectiveness | Length constraint class | ||
effectiveness | Range constraint class | ||
accuracy | Fact reference standard class | Ratio of true records in sample records | |
Cross column | integrity | Null value class expected | |
timeliness | Timely warehousing | Ratio of sample records meeting time requirements | |
Uniformity | Single table equivalent consistent constraint class | ||
Uniformity | Single table logical consistency constraint class | ||
enjambment | Uniqueness | Record unique class | |
Uniformity | Hierarchical consistency constraints | ||
Cross table | Uniformity | External association constraint class | Ratio of sample records with no corresponding primary key for foreign keys |
Uniformity | Cross table equivalent consistency constraint class | ||
Uniformity | Cross table logical consistency constraint class | ||
Cross system | Uniformity | Cross system record consistency constraint class | Matching rate of sample records with other systems |
timeliness | Timely warehousing | Ratio of sample records meeting time requirements |
Four 、 Data exploration
Data exploration is a very important step in data quality assurance , He is the foundation of design , Eliminate objective causes , Good efficiency and quality can be improved through design , If there is no data probe , Generally, data items are repeated many times , May affect personnel changes , Handover difficulty , Difficult to maintain , Long project completion cycle and other problems .
Probe item | Analytical significance | Analysis point | Analysis point interpretation |
Integrity analysis | Ensure the reliability of the analysis | Number of null records | The number of records with no value for the probe field at the probe time point |
Total number of records | Total records of probe field at probe time point | ||
Absence rate | The proportion of missing information records in the total records of the exploration field at the exploration time point | ||
Null value alert | The missing rate of probe field at the probe time point is higher than 10% Then give an early warning | ||
Primary key uniqueness | Probe whether the primary key field has duplicate records at the probe time point | ||
Range analysis | Analyze whether there is abnormal data | Maximum | Numerical type , Maximum value of date type field at probe time point |
minimum value | Numerical type , Minimum value of date type field at probe time point | ||
Enumeration value analysis | Lists all enumeration values for the detection field | Enumeration range | Enumeration value definition of property field |
Enumerate actual range values | The actual enumeration value and its distribution of the property field at the exploration time point | ||
Abnormal proportion | Probe time point , The proportion of enumeration values outside the scope of enumeration definition in the total number of records | ||
Logical exploration | Business logic | Probe whether the field follows the business logic according to the business logic |
5、 ... and 、 Data quality assurance Mechanism
The continuous improvement of data quality depends on the guarantee mechanism , Only Automation , Normalization , Continuously monitor data quality , To continuously improve the quality of data , Data quality assurance mainly includes the following key steps :
Design quantitative index —> Design quality scoring rules -> Design score assessment -> Abnormal data monitoring -> Indicators show -> Push and remind relevant responsible persons according to rules
example : Null rate >5%, remember 1 branch , Daily null rate indicator warning , Daily door wide notification , Affect year-end assessment .
This part needs to be designed in detail according to the actual situation of the company .
6、 ... and 、 Data cleaning
Data cleaning (Data cleaning)– The process of re examining and verifying data , The purpose is to remove duplicate information 、 Correct existing errors , And provide data consistency . There are mainly incomplete data 、 bad data 、 There are three categories of duplicate data ;
If the front-end control is not in place , And want high-quality data , Only by data cleaning , Data cleaning is a key step to improve the quality of stock data , The cleaned data can better support data analysis , Data insight .
7、 ... and 、 Conclusion
The above is my understanding and practical experience of data quality , If it helps you , Please pay attention to 、 forward , If there are any questions , Please leave a message or add me to the wechat group , Let's discuss , Continue to build a data governance system together .
边栏推荐
- Windows installation WSL (II)
- mysql:insert ignore、insert和replace区别
- Concurrentskiplistmap -- principle of table skipping
- Jielizhi, production line assembly link [chapter]
- Learn online case practice
- 安全协议重点
- 牛客-练习赛101-推理小丑
- ADO.NET 之sqlConnection 对象使用摘要
- ADO. Net SqlConnection object usage summary
- Regular expression collection
猜你喜欢
BlocProvider为什么感觉和Provider很相似?
起床困难综合症(按位贪心)
[Qt] résoudre le problème que Qt msvc 2017 ne peut pas Compiler
Difficult to get up syndrome (bit by bit greed)
Pytorch learning record
Why does blocprovider feel similar to provider?
- Oui. Env. Fichier XXX, avec constante, mais non spécifié
关联性——组内相关系数
2021 robocom world robot developer competition - preliminary competition of higher vocational group
Chapter 6 data flow modeling
随机推荐
cookie、session、tooken
求逆序数的三个方法
Shell process control
Chapter 6 data flow modeling
I would like to ask, which securities is better for securities account opening? Is it safe to open a mobile account?
Huawei HMS core joins hands with hypergraph to inject new momentum into 3D GIS
Material design component - use bottomsheet to show extended content (I)
回顾数据脱敏系统
Key points of security agreement
ADO. Net SqlConnection object usage summary
如何提升数据质量
Use pair to do unordered_ Key value of map
SQL数据分析之流程控制语句【if,case...when详解】
Resumption of attack and defense drill
Pytorch learning record
时间复杂度与空间复杂度
【QT】QtCreator卸载与安装(非正常状态)
Which securities company is the best to open a stock account? Is there a security guarantee
下载在线视频 m3u8使用教程
PyTorch学习记录