当前位置:网站首页>Data consanguinity use case and extension practice

Data consanguinity use case and extension practice

2022-06-09 05:35:00 Zhuojiu South Street

Preface

Data kinship describes the source and destination of data , And the conversion of data in multiple processing processes . Data consanguinity is an important basic ability in an organization to make data valuable . This article starts with the overview of byte data link , This paper introduces the application scenario of data kinship in bytes , overall design , Data model and measurement indicators .

Byte jitter data link introduction

In order to clarify the scope of the discussion , Let's first introduce the byte data link .
 Insert picture description here
There are two sources of byte data :

01、 End data :APP and Web The end passes through the buried point SDK Sent , after LogService, Eventually fall into MQ;

02、 Business data :APP,Web And third-party services , Services through various applications , Eventually fall into RDS,RDS Data in , after Binlog The way , Remittance MQ;

MQ Data in , stay MQ There is a diversion process between , Do conversion format , Flow splitting, etc .

The core of offline data warehouse is Hive, The data is finally imported into it by various means , Use mainstream HiveSQL or SparkJob Do business processing , Flow downstream Clickhouse Other storage .

The core of real-time data warehouse is MQ, Use mainstream FlinkSQL Or general FlinkJob Do the processing , During this period, it is done with all kinds of storage SideJoin Rich data , Finally write to various storage .

There are three types of typical data exports :

01、 Index system : A set of data with strong business attributes , such as “ Tiktok ”

02、 Report system : In visual form , Various dimensions show the data before or after processing

03、 Data services : With API Further processing and data acquisition in the form of call

In bytes , The system boundary of data kinship is : from RDS and MQ Start , All the way through various calculations and storage , Final import index 、 Report and data service system .

Application scenario of blood relationship

Before discussing the technical details , We need to clarify the application scenario and business value of blood relationship first , Further clarify the problems that need to be solved in data kinship . Different application scenarios , For the consumption mode of kinship data , Blood coverage , The quality appeal of blood relationship , Will be different .

field Examples of scenes Scene description Features of the scene
Data assets Reference heat calculation Assets are frequently consumed and widely quoted , It is a favorable proof of its own authority , Similar to... In web page references PageRank value , According to the downstream blood relationship of the assets , Defines the asset definition reference heat value . Hot assets , More worthy of trust . Offline mass consumption of kinship data ; The wider the coverage, the better ; A few mistakes don't have a bad effect
Data assets Understand the data context When looking for data , By checking the blood relationship of a data asset , To learn more about its “ All men are mortal ”, You can better judge whether the current assets are your own needs , Or is it trustworthy . It's like knowing someone , He can get a lot of information from his friends around him , To this man “ Life ” Good addition . Obtain blood relationship data in real time ; The wider the coverage, the better ; A few mistakes don't have a bad effect
Data development The impact analysis Assets are frequently consumed and widely quoted , It is a favorable proof of its own authority , Similar to... In web page references PageRank value , According to the downstream blood relationship of the assets , Defines the asset definition reference heat value . Hot assets , More worthy of trust . Obtain blood relationship data in real time ; The wider the coverage, the better ; Wrong blood relationship may cause serious accidents
Data development Attribution analysis When something goes wrong with a task , By viewing the tasks or assets upstream of the blood relationship , Find out the root cause of the problem Obtain blood relationship data in real time ; The wider the coverage, the better ; Kinship errors will affect efficiency
Data governance Link state tracking Pick known core tasks in advance , Through kinship , Automatically sort out the core links , And make key governance and guarantee Offline mass consumption of kinship data ; Cover the core link ; Wrong blood relationship may cause serious accidents
Data governance Data warehouse management Standardized management of data warehouse , Including but not limited to : Unreasonable reverse reference in data warehouse hierarchy ; The stratification of data warehouse is unreasonable ; Redundant tables and links, etc Offline mass consumption of kinship data ; Cover offline and real-time data warehouse ; A few mistakes don't have a bad effect
Data security Safety compliance check The asset itself has a security level , The security level of assets should not be lower than that of upstream assets , Otherwise, there will be the risk of authority disclosure . Based on blood relationship , By scanning the downstream of high security assets , To eliminate security compliance risks Offline mass consumption of kinship data ; Cover offline and real-time data warehouse ; Errors may pose a security risk
Data security Tag spread First, automatically identify according to the rules ( Or artificial ) Security labels for some assets , Based on blood relationship , Automatically propagate tags to a wider range of assets downstream Offline mass consumption of kinship data ; Cover offline and real-time data warehouse ; A small amount of inaccuracy will not cause adverse effects

The overall design of data kinship system

01 - overview

Through the discussion of byte blood link and application scenario , Two key points that need to be considered in the overall design of bleeding edge can be summarized :

Extensibility : In bytes , The business is complex and huge , In the whole data link , There are dozens of kinds of storage used , There are dozens of subdivided task types , Blood system needs to be able to flexibly support various storage and task types

Open integration : When consuming blood , Scene with real-time query , There are also scenes of offline consumption , It is also possible that the downstream system will expand based on the current data

 Insert picture description here
The overall architecture of byte data kinship system can be divided into three parts :

Task access : somehow , Get task information from task management system

Blood relationship analysis : By parsing the information in the task , Get blood data

Export data : Responsible for storing blood relationship data in Data Catalog In the system , And for downstream system consumption

02 - Task access

There are two key design considerations :

Two optional links are provided , To meet the different requirements of different downstream systems for real-time data :

Near real time link : The task management system writes the message of task modification to MQ, Blood supply module consumption

Offline link : The blood relationship module periodically calls the task management system API Interface , Pull the full amount ( Or increment ) Task information , To deal with

To define uniformly Task Model , And pass TaskType To distinguish between different types of tasks , Ensure the scalability of subsequent processing :

Different task management systems , It is possible to manage the same type of tasks , For example, they all support FlinkSQL Type of task ; The same task management system , Sometimes different types of tasks are supported , For example, it also supports writing FlinkSQL and HiveSQL

Add task management system or task type , You can add TaskType

03 - Blood relationship analysis

There are two key design considerations :
Define a unified blood relationship data model LineageInfo
For different TaskType, Flexibly customize different parsing implementations , Also support different TaskType A strategy for analyzing the bottom of the case that can be taken . such as :
SQL Class task : such as HiveSQL And FlinkSQL, Would call SQL Class parsing service
Data Transfer Service(DTS) class : Resolve the configuration in the task , Establish the blood relationship between the source and the target
Other tasks : For example, some common tasks register dependencies and outputs , The control surface of the report system will provide the library table information of the report source

04 - Export data

Produced by kinship analysis LineageInfo, Will be sent to DataCatalog System , Three integration modes are supported :
about Data Catalog Blood related API call , Pull the required blood relationship data in real time
consumption MQ Blood modification increment message in , Construct other peripheral systems with near real-time capability
Offline blood relationship export data in the consumption warehouse , Do analysis, sorting and other business

Blood data model

Definition of kinship data model , It is one of the key designs to ensure system scalability and facilitate downstream consumption integration .

01 - overview

 Insert picture description here

We use the graph data model to model the whole blood system . The graph contains two types of nodes and two types of edges :

Data nodes : The abstraction of the medium in which data is stored , Such as a Hive surface , Or is it Hive A column of the table

Task node : For the task ( Or link ) The abstraction of , For example, a HiveSQL Script

From the data node to the edge of the task node : Represents a consumption relationship , The task reads the data of this data node

From the task node to the edge of the data node : Represents a production relationship , The task produces the data of this data node

Unify task nodes and data nodes into a single graph 2 Some advantages :

Unify the life cycle of blood relationship with the life cycle of task , Update the blood relationship by updating the edge associated with the task

It can flexibly support different scenarios of cutting in from tasks and data nodes . For example, in the field of data assets, most of them cut into the data node , In the field of data development, most of the scenes are from the task , Different application scenarios can be traversed flexibly on a large map

02 - Field (Column) Grade blood relationship

The field blood relationship is the boundary condition in the blood relationship model , Take it out alone and discuss it briefly . At the time of implementation , There are two alternative ideas :

programme advantage Inferiority remarks
Reuse task nodes , Add specially defined edges to the relationship between fields Intuitively easier to understand The number of edge types may explode , Writing and traversal are complex The number of edge types may explode , Writing and traversal are complex
Add redundant task nodes between fields , Semantics of reuse edge Unified the data model and traversal process . Redundant task nodes Generally, the task nodes between fields have no practical significance , If you want to know the relationship introduced by what task , You can query the edge between the virtual node and the task node more than once .
We finally adopted the 2 Kind of plan .

Blood relationship measure

When actually promoting kinship , The most frequently asked question by users is , How about the blood quality , Can their scene be used . In the face of this soul torture , It's too expensive for each user to evaluate it separately , So we spent a lot of energy discussing and exploring the three most commonly used technical indicators , To prove the quality of blood . According to these indicators, users can , Judge whether your scene is applicable .

01 - Accuracy rate

Definition : Suppose that the actual input and output of a task are consistent with the upstream and downstream of the task in the blood relationship , Neither missing nor redundant , The blood relationship of this task is accurate , The proportion of blood related accurate tasks in the total tasks is the blood related accuracy rate .
Accuracy is the most concerned index of users , Like the impact analysis scenario of data development , The loss of consanguinity may cause important tasks not to be notified , Online accidents .

Different types of tasks , The logic of kinship analysis is different , The logic of calculating accuracy is also different :

SQL Class task : such as HiveSQL and FlinkSQL Mission , Blood comes from SQL Parsing , When SQL The quality assurance given by the parsing service is , Successfully resolved SQL Mission , The consanguinity must be accurate , So the kinship accuracy of such tasks , It can be transformed into SQL The success rate of parsing .

Data integration (DTS) Class task : such as MySQL->Hive This kind of channel task , Consanguinity comes from the configuration of upstream and downstream mapping relationship of user registration , The accuracy of this kind of blood relationship , It can be converted into the success rate of task configuration resolution .

Script tasks : such as shell,python Task, etc , These consanguinity comes from the task output registered by the user , The accuracy of this kind of blood relationship , It can be converted into the correct proportion of registered output .

Pay attention to a problem , The accuracy calculation mentioned above , When transforming, there is a premise and assumption , Is that the program runs in the way we assume , This is not always the case . In fact, this matter does not need to be particularly tangled , Blood is like any other program we're running , May be due to procedures bug, Environment configuration , Boundary input, etc , Produce unexpected results .

As a supplement to accuracy , We have three ways in practice , Find out the blood relationship in question as soon as possible :

Manual calibration : By constructing test cases to verify that other systems are the same , The accuracy of consanguinity can also be verified by constructing use cases . Operating time , We will sample some of the tasks running online , Manually verify whether the analysis result is correct , When necessary , Meeting mock Drop output , Continuous operation verification .

Verification of buried point data : Partial storage in bytes will generate access to buried point data , By cleaning these buried point data , Blood links of some scenes can be analyzed , To verify the correctness of blood relationship output in the program . such as ,HDFS The buried point data can be used to verify many Hive The consanguinity of related links .

User feedback : The accuracy verification of the whole blood set is a vast process , But a business scenario specific to a user , The problem is much simpler . In practice , We will have in-depth cooperation with some business parties , Check the accuracy of blood relationship together , And fix the problem .

02 - coverage

Definition : When at least one blood link is associated with the asset , Called assets covered by blood . The proportion of assets covered by blood in the concerned assets is the blood coverage rate .

Kinship coverage is a relatively coarse-grained index . As a supplement to accuracy , Users can know the currently supported asset types and task types through coverage , And the scope of each coverage .

In the internal , We define coverage indicators for two purposes , First, delineate the asset set we are more concerned about , The second is to find the missing task type of the whole block in the system .

With Hive Table as an example , Byte production environment Hive It has reached the level of hundreds of thousands of tables , A large part of it , It will not be used and concerned for a long time . When calculating kinship coverage , We will circle some of them according to the rules , such as , In the past 7 One day, there is data written , As denominator , On top of that , Look at the blood coverage .

When blood coverage is low , It usually means that we have missed a certain task type or the selected asset range is unreasonable . for instance , At the beginning , We found that MQ The kinship coverage is only in single digits , Found after analysis , We missed the streaming data integration task defined in another format .

03 - timeliness

Definition : Change from task , To the end-to-end delay of the final response to the blood storage system .

For some user scenarios , The timeliness of kinship is not particularly important , Belong to bonus item , But some scenarios are strongly dependent . The timeliness of different task types will be different .

Impact analysis scenarios in the field of data development , It is one of the scenes with high real-time requirements for blood relationship . When users circle the influence scope of modification , If you can only pull the state up to yesterday , There will be serious business accidents .

The bottleneck of improving timeliness , Usually not in the blood service , It's whether the task management system can modify the task in near real time , Send out in the form of notice .

future

Next , The work of byte beating data platform in data kinship , Will mainly focus on three directions :

First , Is to continuously improve the accuracy of blood relationship . At present, our blood relationship accuracy has been improved to a usable state , But it still needs manual verification and repair . How to continuously and steadily improve accuracy , Is an important direction of exploration .

secondly , It is the standardized construction of blood relationship . In addition to the data blood relationship , Application level kinship is also very important , In terms of solutions , We want to be universal and standard . At present, the business party will splice some links in its own business field based on data kinship , After standardization , This part of the use case can reuse the whole infrastructure , Just show it at the level of customized products .

Last , It is to strengthen the support of external Ecology . Subdivided in two directions , One is to explore common SQL Kindred analysis engine , At present , A new access SQL Class Engine blood , The cost is relatively high ; The second is to support the end-to-end blood relationship of products on open source or public cloud

原网站

版权声明
本文为[Zhuojiu South Street]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/160/202206090517190607.html