当前位置:网站首页>[datahub] LinkedIn datahub learning notes

[datahub] LinkedIn datahub learning notes

2022-06-23 14:44:00 Koma_ zhe

Preface

With the advancement of digital transformation , Data governance has been put on the agenda by more and more companies . As a new generation of metadata management platform , Datahub It has developed rapidly in the past year , It is likely to replace the old metadata management tools Atlas Trend .

Data governance and metadata management

In the research and development of big data , The raw data has a lot of databases , Data sheet . And after data aggregation , There will be many dimension tables . We lack management of data assets . in fact , Many companies have provided open source solutions to solve the above problems , This is the data discovery and metadata management tool .

Metadata management :

Metadata is generally the data of data . say concretely , It is a static information description of dynamic data . Metadata management is to organize data assets effectively . It uses metadata to help manage their data . It can also help data professionals collect 、 organization 、 Access and enrich metadata , To support data governance .
Common metadata :
In data stream processing , We first need to define data entities at different stages , So there is schema metadata . Then we need to define the processing logic between data entities , be called etl Data processing , Then there is the data entity Relational metadata . For the logical form of these data processing , A scheduler is required to physically execute , So there is Scheduling metadata . After processing the data , Reports need to be published , And then there is Report metadata . For the whole system , Different user entities will be involved , And then there is User metadata .
Of course , These are the most common types of metadata for enterprise data platforms , There are still a lot of other large and small information . therefore , Establishment of metadata system , It is an enterprise level information construction process .
Data classification model “ Metadata 、 Reference data 、 Master data 、 Business data …

Data assets :

May be Oracle A table in the database . In modern enterprises , We have a dazzling array of different types of data assets . It could be a relational database or NoSQL Table in storage 、 Real time streaming data 、 AI Functions in the system 、 Indicators in the indicator platform , Dashboard in data visualization tool .

The functions of metadata management :
  • Search and discovery : Data sheet 、 Field 、 label 、 The use of information
  • Access control : Access control group 、 user 、 Strategy
  • Data consanguinity : Pipeline execution 、 Inquire about
  • Compliance : Data privacy / Classification of compliance note types
  • Data management : Data source configuration 、 Ingestion configuration 、 Keep configuration 、 Data purge policy
  • AI Interpretability 、 Reproducibility : Feature definition 、 Model definition 、 Training run execution 、 Problem statement
  • Data manipulation : Pipeline execution 、 Processed data partition 、 Data statistics
  • Data quality : Data quality rule definition 、 Rule execution results 、 Data statistics
DataHub Technology stack :

DataHub It includes four pieces ,metadata, gms, etl, datahub. The whole system is constructed by gradle.
medata Defining models .metadata There are two data formats used . One is the external access format avro, Very practical . The other is internally improved pdsc Format , It is seldom used outside .
gms Generate services based on models .gms Internal... Is used rest.li, It's another set of internal work restful frame , It's also easy to use , But the application is narrow .
etl Model data processing .etl It's using linkedin Home is best at kafka schema registry And kafka streams.
datahub Offer based on gms Metadata application presentation of .datahub It includes application background service and foreground display , The background service adopts play framework, The front desk service uses ember.js + typescript.

DataHub setup script ( Installation on server ):

Install the reference :
Official website DataHub Quickstart Guide
DataHub Installation configuration details
One stop metadata management platform ——Datahub The book of introduction
Datahub Installation configuration ————— With detailed steps

yum -y install gcc

 Insert picture description here

yum -y install docker

 Insert picture description here

#  start-up docker
sudo systemctl start docker
#  Test for proper installation 
sudo docker run hello-world

 Insert picture description here

#dockerCompose【docker Service choreographer , It is mainly used to build multiple services 】
curl -L "https://github.com/docker/compose/releases/download/1.27.4/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose

chmod +x /usr/local/bin/docker-compose
#  Then start docker
#  The daemons restart 
sudo systemctl daemon-reload
#  restart docker service 
sudo systemctl restart docker
#  Then check the startup 
docker container ls

 Insert picture description here

yum install libffi-devel -y
yum install zlib* -y

Because I used to have python Environmental Science , Skipped installation python:
 Insert picture description here

pip3 install toml

Docker Mirror to accelerate

{
    
  "registry-mirrors": ["http://hub-mirror.c.163.com"]
}
cd /opt
yum -y install git
git --version
git clone git://github.com/linkedin/datahub.git
cd /opt/datahub/docker
source ./quickstart.sh

When executing the above command , I downloaded the project myself and put it on the server ,git clone failed . When executing the last command , Because it's too slow , So it can be configured in advance docker Speed up .

python3 -m pip install --upgrade pip wheel setuptools -i http://pypi.douban.com/simple/ --trusted-host pypi.douban.com
python3 -m pip uninstall datahub acryl-datahub || true  # sanity check - ok if it fails
python3 -m pip install --upgrade acryl-datahub -i http://pypi.douban.com/simple/ --trusted-host pypi.douban.com
python3 -m datahub version

 Insert picture description here
During the above installation process, I also found that my local computer used to look like mysql Port conflict , So I stopped the service first , Later, the port number was changed to 3307, Don't and docker Conflict in .

Docker push Report errors :received unexpected HTTP status: 500 Internal Server Error :( I solve it like this ):

setenforce 0
sed -i 's#SELINUX=enforcing#SELINUX=disabled#g' /etc/selinux/config 
egrep '^SELINUX=' /etc/selinux/config 

 Insert picture description here
 Insert picture description here
If source ./quickstart.sh Successful implementation , You can access datahub 了 , yes ip:9002
Check the installed plug-ins ,Datahub It is a plug-in installation method . You can check the data source acquisition plug-in Source, Transformation plug-in transformer, Get plug-ins Sink.

 python3 -m datahub check plugins

I ingest the data source by myself yml The success of :( Command line execution succeeded , Interface execution failed )
I met oracle The solution to ingestion failure :
 Insert picture description here

oracle:

source:
  type: "oracle"
  config:
    username: "system"
    password: "Admin134"
    database: "prod"
    host_port: "172.16.5.90"

mysql:

source:
  type: "mysql"
  config:
    username: "root"
    password: "123456"
    database: "mysql"
    host_port: "localhost:3307"
sink:
  type: "datahub-rest"
  config:
    server: 'http://localhost:8080'

To be continued

原网站

版权声明
本文为[Koma_ zhe]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/174/202206231352223971.html