当前位置:网站首页>Data consanguinity use case and extension practice
Data consanguinity use case and extension practice
2022-06-09 05:35:00 【Zhuojiu South Street】
Catalog
Preface
Data kinship describes the source and destination of data , And the conversion of data in multiple processing processes . Data consanguinity is an important basic ability in an organization to make data valuable . This article starts with the overview of byte data link , This paper introduces the application scenario of data kinship in bytes , overall design , Data model and measurement indicators .
Byte jitter data link introduction
In order to clarify the scope of the discussion , Let's first introduce the byte data link .
There are two sources of byte data :
01、 End data :APP and Web The end passes through the buried point SDK Sent , after LogService, Eventually fall into MQ;
02、 Business data :APP,Web And third-party services , Services through various applications , Eventually fall into RDS,RDS Data in , after Binlog The way , Remittance MQ;
MQ Data in , stay MQ There is a diversion process between , Do conversion format , Flow splitting, etc .
The core of offline data warehouse is Hive, The data is finally imported into it by various means , Use mainstream HiveSQL or SparkJob Do business processing , Flow downstream Clickhouse Other storage .
The core of real-time data warehouse is MQ, Use mainstream FlinkSQL Or general FlinkJob Do the processing , During this period, it is done with all kinds of storage SideJoin Rich data , Finally write to various storage .
There are three types of typical data exports :
01、 Index system : A set of data with strong business attributes , such as “ Tiktok ”
02、 Report system : In visual form , Various dimensions show the data before or after processing
03、 Data services : With API Further processing and data acquisition in the form of call
In bytes , The system boundary of data kinship is : from RDS and MQ Start , All the way through various calculations and storage , Final import index 、 Report and data service system .
Application scenario of blood relationship
Before discussing the technical details , We need to clarify the application scenario and business value of blood relationship first , Further clarify the problems that need to be solved in data kinship . Different application scenarios , For the consumption mode of kinship data , Blood coverage , The quality appeal of blood relationship , Will be different .
| field | Examples of scenes | Scene description | Features of the scene | |
|---|---|---|---|---|
| Data assets | Reference heat calculation | Assets are frequently consumed and widely quoted , It is a favorable proof of its own authority , Similar to... In web page references PageRank value , According to the downstream blood relationship of the assets , Defines the asset definition reference heat value . Hot assets , More worthy of trust . | Offline mass consumption of kinship data ; The wider the coverage, the better ; A few mistakes don't have a bad effect | |
| Data assets | Understand the data context | When looking for data , By checking the blood relationship of a data asset , To learn more about its “ All men are mortal ”, You can better judge whether the current assets are your own needs , Or is it trustworthy . It's like knowing someone , He can get a lot of information from his friends around him , To this man “ Life ” Good addition . | Obtain blood relationship data in real time ; The wider the coverage, the better ; A few mistakes don't have a bad effect | |
| Data development | The impact analysis | Assets are frequently consumed and widely quoted , It is a favorable proof of its own authority , Similar to... In web page references PageRank value , According to the downstream blood relationship of the assets , Defines the asset definition reference heat value . Hot assets , More worthy of trust . | Obtain blood relationship data in real time ; The wider the coverage, the better ; Wrong blood relationship may cause serious accidents | |
| Data development | Attribution analysis | When something goes wrong with a task , By viewing the tasks or assets upstream of the blood relationship , Find out the root cause of the problem | Obtain blood relationship data in real time ; The wider the coverage, the better ; Kinship errors will affect efficiency | |
| Data governance | Link state tracking | Pick known core tasks in advance , Through kinship , Automatically sort out the core links , And make key governance and guarantee | Offline mass consumption of kinship data ; Cover the core link ; Wrong blood relationship may cause serious accidents | |
| Data governance | Data warehouse management | Standardized management of data warehouse , Including but not limited to : Unreasonable reverse reference in data warehouse hierarchy ; The stratification of data warehouse is unreasonable ; Redundant tables and links, etc | Offline mass consumption of kinship data ; Cover offline and real-time data warehouse ; A few mistakes don't have a bad effect | |
| Data security | Safety compliance check | The asset itself has a security level , The security level of assets should not be lower than that of upstream assets , Otherwise, there will be the risk of authority disclosure . Based on blood relationship , By scanning the downstream of high security assets , To eliminate security compliance risks | Offline mass consumption of kinship data ; Cover offline and real-time data warehouse ; Errors may pose a security risk | |
| Data security | Tag spread | First, automatically identify according to the rules ( Or artificial ) Security labels for some assets , Based on blood relationship , Automatically propagate tags to a wider range of assets downstream | Offline mass consumption of kinship data ; Cover offline and real-time data warehouse ; A small amount of inaccuracy will not cause adverse effects |
The overall design of data kinship system
01 - overview
Through the discussion of byte blood link and application scenario , Two key points that need to be considered in the overall design of bleeding edge can be summarized :
Extensibility : In bytes , The business is complex and huge , In the whole data link , There are dozens of kinds of storage used , There are dozens of subdivided task types , Blood system needs to be able to flexibly support various storage and task types
Open integration : When consuming blood , Scene with real-time query , There are also scenes of offline consumption , It is also possible that the downstream system will expand based on the current data

The overall architecture of byte data kinship system can be divided into three parts :
Task access : somehow , Get task information from task management system
Blood relationship analysis : By parsing the information in the task , Get blood data
Export data : Responsible for storing blood relationship data in Data Catalog In the system , And for downstream system consumption
02 - Task access
There are two key design considerations :
Two optional links are provided , To meet the different requirements of different downstream systems for real-time data :
Near real time link : The task management system writes the message of task modification to MQ, Blood supply module consumption
Offline link : The blood relationship module periodically calls the task management system API Interface , Pull the full amount ( Or increment ) Task information , To deal with
To define uniformly Task Model , And pass TaskType To distinguish between different types of tasks , Ensure the scalability of subsequent processing :
Different task management systems , It is possible to manage the same type of tasks , For example, they all support FlinkSQL Type of task ; The same task management system , Sometimes different types of tasks are supported , For example, it also supports writing FlinkSQL and HiveSQL
Add task management system or task type , You can add TaskType
03 - Blood relationship analysis
There are two key design considerations :
Define a unified blood relationship data model LineageInfo
For different TaskType, Flexibly customize different parsing implementations , Also support different TaskType A strategy for analyzing the bottom of the case that can be taken . such as :
SQL Class task : such as HiveSQL And FlinkSQL, Would call SQL Class parsing service
Data Transfer Service(DTS) class : Resolve the configuration in the task , Establish the blood relationship between the source and the target
Other tasks : For example, some common tasks register dependencies and outputs , The control surface of the report system will provide the library table information of the report source
04 - Export data
Produced by kinship analysis LineageInfo, Will be sent to DataCatalog System , Three integration modes are supported :
about Data Catalog Blood related API call , Pull the required blood relationship data in real time
consumption MQ Blood modification increment message in , Construct other peripheral systems with near real-time capability
Offline blood relationship export data in the consumption warehouse , Do analysis, sorting and other business
Blood data model
Definition of kinship data model , It is one of the key designs to ensure system scalability and facilitate downstream consumption integration .
01 - overview

We use the graph data model to model the whole blood system . The graph contains two types of nodes and two types of edges :
Data nodes : The abstraction of the medium in which data is stored , Such as a Hive surface , Or is it Hive A column of the table
Task node : For the task ( Or link ) The abstraction of , For example, a HiveSQL Script
From the data node to the edge of the task node : Represents a consumption relationship , The task reads the data of this data node
From the task node to the edge of the data node : Represents a production relationship , The task produces the data of this data node
Unify task nodes and data nodes into a single graph 2 Some advantages :
Unify the life cycle of blood relationship with the life cycle of task , Update the blood relationship by updating the edge associated with the task
It can flexibly support different scenarios of cutting in from tasks and data nodes . For example, in the field of data assets, most of them cut into the data node , In the field of data development, most of the scenes are from the task , Different application scenarios can be traversed flexibly on a large map
02 - Field (Column) Grade blood relationship
The field blood relationship is the boundary condition in the blood relationship model , Take it out alone and discuss it briefly . At the time of implementation , There are two alternative ideas :
| programme | advantage | Inferiority | remarks |
|---|---|---|---|
| Reuse task nodes , Add specially defined edges to the relationship between fields | Intuitively easier to understand | The number of edge types may explode , Writing and traversal are complex | The number of edge types may explode , Writing and traversal are complex |
| Add redundant task nodes between fields , Semantics of reuse edge | Unified the data model and traversal process . | Redundant task nodes | Generally, the task nodes between fields have no practical significance , If you want to know the relationship introduced by what task , You can query the edge between the virtual node and the task node more than once . |
| We finally adopted the 2 Kind of plan . |
Blood relationship measure
When actually promoting kinship , The most frequently asked question by users is , How about the blood quality , Can their scene be used . In the face of this soul torture , It's too expensive for each user to evaluate it separately , So we spent a lot of energy discussing and exploring the three most commonly used technical indicators , To prove the quality of blood . According to these indicators, users can , Judge whether your scene is applicable .
01 - Accuracy rate
Definition : Suppose that the actual input and output of a task are consistent with the upstream and downstream of the task in the blood relationship , Neither missing nor redundant , The blood relationship of this task is accurate , The proportion of blood related accurate tasks in the total tasks is the blood related accuracy rate .
Accuracy is the most concerned index of users , Like the impact analysis scenario of data development , The loss of consanguinity may cause important tasks not to be notified , Online accidents .
Different types of tasks , The logic of kinship analysis is different , The logic of calculating accuracy is also different :
SQL Class task : such as HiveSQL and FlinkSQL Mission , Blood comes from SQL Parsing , When SQL The quality assurance given by the parsing service is , Successfully resolved SQL Mission , The consanguinity must be accurate , So the kinship accuracy of such tasks , It can be transformed into SQL The success rate of parsing .
Data integration (DTS) Class task : such as MySQL->Hive This kind of channel task , Consanguinity comes from the configuration of upstream and downstream mapping relationship of user registration , The accuracy of this kind of blood relationship , It can be converted into the success rate of task configuration resolution .
Script tasks : such as shell,python Task, etc , These consanguinity comes from the task output registered by the user , The accuracy of this kind of blood relationship , It can be converted into the correct proportion of registered output .
Pay attention to a problem , The accuracy calculation mentioned above , When transforming, there is a premise and assumption , Is that the program runs in the way we assume , This is not always the case . In fact, this matter does not need to be particularly tangled , Blood is like any other program we're running , May be due to procedures bug, Environment configuration , Boundary input, etc , Produce unexpected results .
As a supplement to accuracy , We have three ways in practice , Find out the blood relationship in question as soon as possible :
Manual calibration : By constructing test cases to verify that other systems are the same , The accuracy of consanguinity can also be verified by constructing use cases . Operating time , We will sample some of the tasks running online , Manually verify whether the analysis result is correct , When necessary , Meeting mock Drop output , Continuous operation verification .
Verification of buried point data : Partial storage in bytes will generate access to buried point data , By cleaning these buried point data , Blood links of some scenes can be analyzed , To verify the correctness of blood relationship output in the program . such as ,HDFS The buried point data can be used to verify many Hive The consanguinity of related links .
User feedback : The accuracy verification of the whole blood set is a vast process , But a business scenario specific to a user , The problem is much simpler . In practice , We will have in-depth cooperation with some business parties , Check the accuracy of blood relationship together , And fix the problem .
02 - coverage
Definition : When at least one blood link is associated with the asset , Called assets covered by blood . The proportion of assets covered by blood in the concerned assets is the blood coverage rate .
Kinship coverage is a relatively coarse-grained index . As a supplement to accuracy , Users can know the currently supported asset types and task types through coverage , And the scope of each coverage .
In the internal , We define coverage indicators for two purposes , First, delineate the asset set we are more concerned about , The second is to find the missing task type of the whole block in the system .
With Hive Table as an example , Byte production environment Hive It has reached the level of hundreds of thousands of tables , A large part of it , It will not be used and concerned for a long time . When calculating kinship coverage , We will circle some of them according to the rules , such as , In the past 7 One day, there is data written , As denominator , On top of that , Look at the blood coverage .
When blood coverage is low , It usually means that we have missed a certain task type or the selected asset range is unreasonable . for instance , At the beginning , We found that MQ The kinship coverage is only in single digits , Found after analysis , We missed the streaming data integration task defined in another format .
03 - timeliness
Definition : Change from task , To the end-to-end delay of the final response to the blood storage system .
For some user scenarios , The timeliness of kinship is not particularly important , Belong to bonus item , But some scenarios are strongly dependent . The timeliness of different task types will be different .
Impact analysis scenarios in the field of data development , It is one of the scenes with high real-time requirements for blood relationship . When users circle the influence scope of modification , If you can only pull the state up to yesterday , There will be serious business accidents .
The bottleneck of improving timeliness , Usually not in the blood service , It's whether the task management system can modify the task in near real time , Send out in the form of notice .
future
Next , The work of byte beating data platform in data kinship , Will mainly focus on three directions :
First , Is to continuously improve the accuracy of blood relationship . At present, our blood relationship accuracy has been improved to a usable state , But it still needs manual verification and repair . How to continuously and steadily improve accuracy , Is an important direction of exploration .
secondly , It is the standardized construction of blood relationship . In addition to the data blood relationship , Application level kinship is also very important , In terms of solutions , We want to be universal and standard . At present, the business party will splice some links in its own business field based on data kinship , After standardization , This part of the use case can reuse the whole infrastructure , Just show it at the level of customized products .
Last , It is to strengthen the support of external Ecology . Subdivided in two directions , One is to explore common SQL Kindred analysis engine , At present , A new access SQL Class Engine blood , The cost is relatively high ; The second is to support the end-to-end blood relationship of products on open source or public cloud
边栏推荐
- Yolov5-6.0 series | yolov5 module design
- In 2022, the database audit manufacturer will choose cloud housekeeper! Powerful!
- Esmascript 6.0 advanced
- 力扣今日题-1037. 有效的回旋镖
- Recurrence and solution of long jump in data warehouse
- pytorch with Automatic Mixed Precision(AMP)
- 数据血缘用例与扩展实践
- 冒泡排序,打印菱形,打印直角三角形,打印倒三角,打印等边三角形,打印九九乘法表
- Yolov5-6.0系列 | yolov5的模块设计
- Alibaba cloud AI training camp -sql basics 6: test questions
猜你喜欢

MRNA factory| quantitative detection of LNP encapsulated RNA content by ribogreen

Heqibao's trip to Chongqing ~

Common interview questions

In latex, \cdots is followed by a sentence. What's wrong with the format of the following sentence.

Mysql5.7 one master multi slave configuration

数据治理:如何提高企业数据质量?

Product weekly report issue 29 | creation center optimization: the sending assistant adds the quality score detection function, and the blog adds the historical version of the content

Several implementation methods of redis distributed lock

Gstreamer应用开发实战指南(二)

Alibaba cloud AI training camp - SQL basics 4: set operation - addition and subtraction of tables, join, etc
随机推荐
Alibaba cloud AI training camp - machine learning 3:lightgbm
matlab----多项式、函数
array
Gstreamer应用开发实战指南(一)
pytorch with Automatic Mixed Precision(AMP)
Xtrabackup backup and recovery
YOLOv5的Tricks | 【Trick7】指数移动平均(Exponential Moving Average,EMA)
Leetcode 929. Unique email address
Swift protocol
Mysql5.7 dual master and dual slave configuration
Gradient accumulation setting for pytorch DDP acceleration
数据血缘用例与扩展实践
【IT】福昕pdf保持工具选择
和琪宝的重庆之旅~
Lambda anonymous function
pytorch DDP加速之gradient accumulation设置
Bubble sort, print diamond, print right triangle, print inverted triangle, print equilateral triangle, print 99 multiplication table
When classical music meets NFT
Tricks | [trick6] learning rate adjustment strategy of yolov5 (one cycle policy, cosine annealing, etc.)
Yolov5-6.0 series | yolov5 model network construction