当前位置:网站首页>[datahub] LinkedIn datahub learning notes
[datahub] LinkedIn datahub learning notes
2022-06-23 14:44:00 【Koma_ zhe】
DataHub Learning notes
Preface
With the advancement of digital transformation , Data governance has been put on the agenda by more and more companies . As a new generation of metadata management platform , Datahub It has developed rapidly in the past year , It is likely to replace the old metadata management tools Atlas Trend .
Data governance and metadata management
In the research and development of big data , The raw data has a lot of databases , Data sheet . And after data aggregation , There will be many dimension tables . We lack management of data assets . in fact , Many companies have provided open source solutions to solve the above problems , This is the data discovery and metadata management tool .
Metadata management :
Metadata is generally the data of data . say concretely , It is a static information description of dynamic data . Metadata management is to organize data assets effectively . It uses metadata to help manage their data . It can also help data professionals collect 、 organization 、 Access and enrich metadata , To support data governance .
Common metadata :
In data stream processing , We first need to define data entities at different stages , So there is schema metadata . Then we need to define the processing logic between data entities , be called etl Data processing , Then there is the data entity Relational metadata . For the logical form of these data processing , A scheduler is required to physically execute , So there is Scheduling metadata . After processing the data , Reports need to be published , And then there is Report metadata . For the whole system , Different user entities will be involved , And then there is User metadata .
Of course , These are the most common types of metadata for enterprise data platforms , There are still a lot of other large and small information . therefore , Establishment of metadata system , It is an enterprise level information construction process .
Data classification model “ Metadata 、 Reference data 、 Master data 、 Business data …
Data assets :
May be Oracle A table in the database . In modern enterprises , We have a dazzling array of different types of data assets . It could be a relational database or NoSQL Table in storage 、 Real time streaming data 、 AI Functions in the system 、 Indicators in the indicator platform , Dashboard in data visualization tool .
The functions of metadata management :
- Search and discovery : Data sheet 、 Field 、 label 、 The use of information
- Access control : Access control group 、 user 、 Strategy
- Data consanguinity : Pipeline execution 、 Inquire about
- Compliance : Data privacy / Classification of compliance note types
- Data management : Data source configuration 、 Ingestion configuration 、 Keep configuration 、 Data purge policy
- AI Interpretability 、 Reproducibility : Feature definition 、 Model definition 、 Training run execution 、 Problem statement
- Data manipulation : Pipeline execution 、 Processed data partition 、 Data statistics
- Data quality : Data quality rule definition 、 Rule execution results 、 Data statistics
DataHub Technology stack :
DataHub It includes four pieces ,metadata, gms, etl, datahub. The whole system is constructed by gradle.
medata Defining models .metadata There are two data formats used . One is the external access format avro, Very practical . The other is internally improved pdsc Format , It is seldom used outside .
gms Generate services based on models .gms Internal... Is used rest.li, It's another set of internal work restful frame , It's also easy to use , But the application is narrow .
etl Model data processing .etl It's using linkedin Home is best at kafka schema registry And kafka streams.
datahub Offer based on gms Metadata application presentation of .datahub It includes application background service and foreground display , The background service adopts play framework, The front desk service uses ember.js + typescript.
DataHub setup script ( Installation on server ):
Install the reference :
Official website DataHub Quickstart Guide
DataHub Installation configuration details
One stop metadata management platform ——Datahub The book of introduction
Datahub Installation configuration ————— With detailed steps
yum -y install gcc

yum -y install docker

# start-up docker
sudo systemctl start docker
# Test for proper installation
sudo docker run hello-world

#dockerCompose【docker Service choreographer , It is mainly used to build multiple services 】
curl -L "https://github.com/docker/compose/releases/download/1.27.4/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose
chmod +x /usr/local/bin/docker-compose
# Then start docker
# The daemons restart
sudo systemctl daemon-reload
# restart docker service
sudo systemctl restart docker
# Then check the startup
docker container ls

yum install libffi-devel -y
yum install zlib* -y
Because I used to have python Environmental Science , Skipped installation python:
pip3 install toml
{
"registry-mirrors": ["http://hub-mirror.c.163.com"]
}
cd /opt
yum -y install git
git --version
git clone git://github.com/linkedin/datahub.git
cd /opt/datahub/docker
source ./quickstart.sh
When executing the above command , I downloaded the project myself and put it on the server ,git clone failed . When executing the last command , Because it's too slow , So it can be configured in advance docker Speed up .
python3 -m pip install --upgrade pip wheel setuptools -i http://pypi.douban.com/simple/ --trusted-host pypi.douban.com
python3 -m pip uninstall datahub acryl-datahub || true # sanity check - ok if it fails
python3 -m pip install --upgrade acryl-datahub -i http://pypi.douban.com/simple/ --trusted-host pypi.douban.com
python3 -m datahub version

During the above installation process, I also found that my local computer used to look like mysql Port conflict , So I stopped the service first , Later, the port number was changed to 3307, Don't and docker Conflict in .
Docker push Report errors :received unexpected HTTP status: 500 Internal Server Error :( I solve it like this ):
setenforce 0
sed -i 's#SELINUX=enforcing#SELINUX=disabled#g' /etc/selinux/config
egrep '^SELINUX=' /etc/selinux/config


If source ./quickstart.sh Successful implementation , You can access datahub 了 , yes ip:9002
Check the installed plug-ins ,Datahub It is a plug-in installation method . You can check the data source acquisition plug-in Source, Transformation plug-in transformer, Get plug-ins Sink.
python3 -m datahub check plugins
I ingest the data source by myself yml The success of :( Command line execution succeeded , Interface execution failed )
I met oracle The solution to ingestion failure :
oracle:
source:
type: "oracle"
config:
username: "system"
password: "Admin134"
database: "prod"
host_port: "172.16.5.90"
mysql:
source:
type: "mysql"
config:
username: "root"
password: "123456"
database: "mysql"
host_port: "localhost:3307"
sink:
type: "datahub-rest"
config:
server: 'http://localhost:8080'
To be continued
边栏推荐
- 2022 ICT market in China continues to rise and enterprise digital infrastructure is imperative
- Test article
- In this year's English college entrance examination, CMU delivered 134 high scores with reconstruction pre training, significantly surpassing gpt3
- MySQL 创建和管理表
- 【DataHub】LinkedIn DataHub学习笔记
- Un million de bonus vous attend, le premier concours d'innovation et d'application de la Chine Yuan cosmique Joint Venture Black Horse Hot Recruitment!
- 【深入理解TcaplusDB技術】TcaplusDB構造數據
- HCIA network foundation
- How to merge tables when exporting excel tables with xlsx
- 山东:美食“隐藏款”,消费“扫地僧”
猜你喜欢

Instructions for laravel8 Beanstalk

K8s-- deploy stand-alone MySQL and persist it
![[deeply understand tcapulusdb technology] tcapulusdb import data](/img/c5/fe0c9333b46c25be15ed4ba42f7bf8.png)
[deeply understand tcapulusdb technology] tcapulusdb import data

掌舵9年,艾伦研究所创始CEO光荣退休!他曾预言中国AI将领跑世界

When pandas met SQL, a powerful tool library was born

Introduction to helm basics helm introduction and installation

建議自查!MySQL驅動Bug引發的事務不回滾問題,也許你正面臨該風險!

The company has only one test, but the leader asked me to operate 1000 mobile numbers at the same time

2021-05-22

Babbitt | metauniverse daily must read: meta, Microsoft and other technology giants set up the metauniverse Standards Forum. Huawei and Alibaba joined. NVIDIA executives said that they welcomed partic
随机推荐
MySQL 创建和管理表
How to use note taking software flowus and note for interval repetition? Based on formula template
2021-04-15
KDD'22「阿里」推荐系统中的通用序列表征学习
Self inspection is recommended! The transaction caused by MySQL driver bug is not rolled back. Maybe you are facing this risk!
[deeply understand tcapulusdb technology] one click installation of tmonitor background
分布式数据库使用逻辑卷管理存储之扩容
Google Earth engine (GEE) -- Comparative Case Analysis of calculating slope with different methods
【DataHub】LinkedIn DataHub学习笔记
Sqlserver2008r2 failed to install DTS component
2021-04-15
AI talk | data imbalance refinement instance segmentation
Uniswap 收购 NFT交易聚合器 Genie,NFT 交易市场将生变局?
MySQL installation
渗透测试-提权专题
Technology creates value and teaches you how to collect wool
【深入理解TcaplusDB技术】TcaplusDB构造数据
When I went to oppo for an interview, I got numb...
Illustration of ONEFLOW's learning rate adjustment strategy
Selenium Edge的IE模式