当前位置:网站首页>Digital intelligence learning Lake Warehouse Integration Practice and exploration
Digital intelligence learning Lake Warehouse Integration Practice and exploration
2022-06-28 02:37:00 【Digital technology dtwave】
Column words
Shulan technology opens a column 「 Technical school +」, Focus on cutting edge technology , Insight into the wind direction of the industry , Share R & D experience and application practice from the front line .
This column was brought by Bai Song, deputy general manager of Shulan technology R & D center , Share the practice and exploration of Lake warehouse integration .
Introduction
With the accelerating process of social digitalization , Data scale 、 Data types continue to grow rapidly . To meet the demands of more complex business data analysis , Big data infrastructure technology from the database 、 Data warehouse 、 Data Lake 、 And then to the integration of lake and warehouse .
And with many fields 、 The successful landing of the scene , The technical concept of Lake warehouse integration has officially entered the mainstream vision ,AWS、 Ali 、 Huawei 、 Google 、 Tencent and other big companies have launched data Lake service products based on cloud technology . In internationally renowned institutions Gartner released 《Hype Cycle for Data Management 2021》 in , The lake and the warehouse are integrated (Lake house) Included in the technology maturity curve for the first time .
Lake Warehouse Integration essentially breaks the barrier between data warehouse and data lake , Make the fragmented data fusion unified , Reduce relocation in data analysis , Realize unified data management , Conducive to excavation 、 Bring more data value .
Before we begin to explain the integration of lake and warehouse , Let's first understand the data lake and data warehouse .
One 、 Data lake and data warehouse
Refer to the definition of Wikipedia , Data Lake It is a large warehouse for storing all kinds of raw data of enterprises , The data is accessible 、 Handle 、 Analysis and transmission . A data lake is a system or repository that stores data in its natural format , It is usually an object or a file . Data lake is usually a single storage of all data in an enterprise , Include original copies of source system data , And for reporting 、 visualization 、 Analysis and machine learning tasks such as conversion data . Data lakes can include structured data from relational databases , Semi-structured data (CSV、 journal 、XML、JSON), Unstructured data ( E-mail 、 file 、PDF) And binary data ( Images 、 Audio 、 video ).
and Data warehouse It's a theme oriented 、 Integrated 、 Relatively stable 、 Data storage system reflecting historical changes , It aggregates structured data from different sources , It is used for comparison and analysis in the field of business intelligence , A data warehouse is a repository that contains a variety of data , And it is highly modeled .
The differences between data lake and data warehouse are as follows :

In terms of data storage , The data lake can store structured data 、 Semi structured 、 All unstructured data , Data warehouse can only deal with structured data . Data warehouse should sort out data before processing data 、 Defining data Schema The stock in operation can only be performed after , The data Lake synchronizes the original data as it is , This is machine learning for subsequent data lake 、 Data mining brings infinite possibilities .
Due to the requirements of model paradigm, the business of data warehouse can not change randomly , But for the data Lake , Even if the Internet industry continues to have new applications , The business is constantly changing , The data model is constantly changing , But data can still easily enter the data lake , For data collection 、 cleaning 、 Standardized treatment , It can be completely postponed to the time of business needs . So the data lake is relative to the enterprise , More flexible , It can adapt to the changes of front-end business more quickly .
From the specific application of both , A data warehouse stores structured data , Suitable for fast BI And decision support , and The data lake can store data in any format , Often through mining can play a greater role in data .
Although the application scenarios and architectures of data warehouse and data lake are different , But they are not opposites , In some scenarios, the coexistence of the two can bring more benefits to the enterprise , Therefore, the solution of Lake Warehouse Integration .
Two 、 Lake warehouse integration definition
Intuitive to see , Lake warehouse integration is The enterprise oriented data warehouse technology , Combined with low-cost data Lake storage technology , Provide a unified for enterprises 、 Shareable data base , Avoid traditional data lakes 、 Data movement between data warehouses , Put the raw data 、 Processing and cleaning data 、 Modeling data , Common storage in an integrated system “ Lake warehouse ” in , It can realize high concurrency for business 、 Precision 、 High performance historical data 、 Real time data query service , It can also carry analysis reports 、 The batch 、 Data mining and other analytical businesses .
At present, it is generally accepted in the industry : The integration of lake and warehouse needs to get through the two systems of data warehouse and data lake , Let data and computation flow freely between the lake and the barn , So as to build a complete and organic big data technology ecosystem .
The emergence of Lake Warehouse Integration Scheme , Help enterprises build a new 、 Integrated data platform . Through machine learning and AI Algorithm support , Realize data Lake + The closed loop of data warehouse , Improve business efficiency . The capabilities of data lake and data warehouse are fully combined , Form complementarities , At the same time, it connects with the diversified computing ecology of the upper layer .
at present , The three main open source data Lake schemes in the market are : Delta、Apache Iceberg and Apache hudi.
Delta yes Apache Spark Behind the scenes, commercial companies Databricks To launch the , Domestic Internet companies use less ;Apache Hudi It is a kind of for analytical business 、 Scan optimized data storage abstraction , It enables DFS Data sets support changes within minutes of delay , It also supports the incremental processing of the data set by the downstream system ;Apache Iceberg It is an open table format for large-scale data analysis scenarios , At present, the community's attention can't compare with Delta, It doesn't work as well Hudi Enrich , But it has a highly abstract and very elegant design .
Digital habitat platform, the core product of digital LAN technology, is based on Iceberg and Hudi Building a data Lake , The following describes the practical experience of Shulan technology in the integration of Lake warehouse , It mainly includes technical architecture 、 Data into the lake 、 Data warehouse construction .
3、 ... and 、 Practice of Lake warehouse integration
In the hucang integrated solution provided by Solana Technology , Data is uniformly stored in Iceberg+HDFS On , And use Flink、Spark、Trino Three different engines access the data in the lake to provide different types of services .
be based on Flink+Iceberg To build a quasi real-time data warehouse integrating Lake warehouse , Original T+1 The offline data warehouse is made into a quasi real-time data warehouse , Improve the timeliness of data warehouse as a whole , To better support upstream and downstream businesses . In the data warehouse processing layer , It can be used Trino Make some simple queries , and Iceberg Also support Streaming read, So in the middle layer of the system, you can also directly access Flink, use Flink Do some batch or streaming tasks , The intermediate results are further calculated and then output to the downstream .
In a specific example , hold Mysql Medium demo_users The table data passes through Flink CDC Real time Lake entry , Fall into the data warehouse ODS In the table of layers , Then based on ODS Layer and a real-time stream flink_demo_complaint_data Conduct Join Operation generation DWD Table of layers .

Sample code :

CDC The data entered the lake successfully Iceberg after , We'll also get through the common computing engines , for example Presto(Trino)、Spark、Hive etc. , They can read it in real time Iceberg The latest data in the table .

The figure below is based on Trino Quasi-real-time query Iceberg Table data , Provide customized services to business parties SQL Type of API.

meanwhile , Aiming at the common pain points in the lake warehouse solution in the current market : Lack of data cache layer , Resulting in slower data access ; Lack of a unified programming model , For example, for batches and streams, write Spark and Flink There are two types of SQL. Shulan technology is introduced into Shuqi platform Alluxio and Beam Solve such problems .

be based on Alluxio Data caching
In the current architecture , Computing engine layer passes Iceberg or Hudi Provided API To operate the underlying file storage system , This results in slow data reading and writing . therefore , Consider introducing Alluxio As a data orchestration layer . Speed up the reading and writing of data lake , When Spark、Flink or Trino When asked about file systems ,Alluxio Act as a virtual distributed storage system to accelerate data , And coexist with each computing cluster .
be based on Beam The unified programming model
Apache Beam It's an open source unified programming model , Unified flow and batch , Abstract a unified API Interface . And the generated distributed data processing task should be able to execute in each distributed engine ( for example Spark、Flink) On the implementation , Users can freely switch the execution engine and execution environment of distributed data processing tasks . Therefore, it is planned to introduce Beam As a unified programming model , In the data, the platform product layer provides a unified IDE.

The official website of Shulan technology _ Let the data work
边栏推荐
- 原理图合并中的技巧
- How to use metauniverse technology to create a better real world
- Embedded must learn! Detailed explanation of hardware resource interface - based on arm am335x development board (Part 2)
- Machine learning (x) reinforcement learning
- 架构高可靠性应用知识图谱 ----- 微服务架构图谱
- 畢業總結
- 4G-learn from great partners
- 【历史上的今天】5 月 29 日:共享软件先驱诞生;ChromeBox 推出;VoodooPC 创始人出生
- 【历史上的今天】6 月 23 日:图灵诞生日;互联网奠基人出生;Reddit 上线
- How technicians become experts in technical field
猜你喜欢

MySQL collection, here are all the contents you want

【历史上的今天】6 月 18 日:京东诞生;网店平台 Etsy 成立;Facebook 发布 Libra 白皮书

The system administrator has set the system policy to prohibit this installation. Solution

To understand what is synchronous, asynchronous, serial, parallel, concurrent, process, thread, and coroutine

Jenkins - accédez à la variable de paramètre personnalisée Jenkins, en traitant les espaces dans la valeur de la variable

File transfer protocol --ftp

High reliability application knowledge map of Architecture -- the path of architecture evolution

Based on am335x development board arm cortex-a8 -- acontis EtherCAT master station development case

【历史上的今天】5 月 31 日:Amiga 之父诞生;BASIC 语言的共同开发者出生;黑莓 BBM 停运

CVPR22收录论文|基于标签关系树的层级残差多粒度分类网络
随机推荐
How technicians become experts in technical field
What is a web crawler
一种低成本增长私域流量,且维护简单的方法
数据清洗工具flashtext,效率直接提升了几十倍数
文件傳輸協議--FTP
flask基础:模板继承+静态文件配置
【历史上的今天】6 月 16 日:甲骨文成立;微软 MSX 诞生;快速傅里叶变换发明者出生
Leetcode topic [array] -228- summary interval
【云原生】-Docker安装部署分布式数据库 OceanBase
Fundamentals of scala (3): operators and process control
KVM related
The interviewer asked: can you simulate the new operator of JS
Embedded must learn! Detailed explanation of hardware resource interface - based on arm am335x development board (Part 2)
To understand what is synchronous, asynchronous, serial, parallel, concurrent, process, thread, and coroutine
11 timers for STM32F103
Jenkins - Pipeline syntax
【历史上的今天】6 月 18 日:京东诞生;网店平台 Etsy 成立;Facebook 发布 Libra 白皮书
【历史上的今天】6 月 15 日:第一个手机病毒;AI 巨匠司马贺诞生;Chromebook 发布
Mysql数据库基础:DML数据操作语言
Flutter 使用 CustomPaint 绘制基本图形