当前位置：网站首页>Digital intelligence learning Lake Warehouse Integration Practice and exploration

Digital intelligence learning Lake Warehouse Integration Practice and exploration

2022-06-28 02:37:00 【Digital technology dtwave】

Column words

Shulan technology opens a column 「 Technical school +」, Focus on cutting edge technology , Insight into the wind direction of the industry , Share R & D experience and application practice from the front line .

This column was brought by Bai Song, deputy general manager of Shulan technology R & D center , Share the practice and exploration of Lake warehouse integration .

Introduction

With the accelerating process of social digitalization , Data scale 、 Data types continue to grow rapidly . To meet the demands of more complex business data analysis , Big data infrastructure technology from the database 、 Data warehouse 、 Data Lake 、 And then to the integration of lake and warehouse .

And with many fields 、 The successful landing of the scene , The technical concept of Lake warehouse integration has officially entered the mainstream vision ,AWS、 Ali 、 Huawei 、 Google 、 Tencent and other big companies have launched data Lake service products based on cloud technology . In internationally renowned institutions Gartner released 《Hype Cycle for Data Management 2021》 in , The lake and the warehouse are integrated （Lake house） Included in the technology maturity curve for the first time .

Lake Warehouse Integration essentially breaks the barrier between data warehouse and data lake , Make the fragmented data fusion unified , Reduce relocation in data analysis , Realize unified data management , Conducive to excavation 、 Bring more data value .

Before we begin to explain the integration of lake and warehouse , Let's first understand the data lake and data warehouse .

One 、 Data lake and data warehouse

Refer to the definition of Wikipedia , Data Lake It is a large warehouse for storing all kinds of raw data of enterprises , The data is accessible 、 Handle 、 Analysis and transmission . A data lake is a system or repository that stores data in its natural format , It is usually an object or a file . Data lake is usually a single storage of all data in an enterprise , Include original copies of source system data , And for reporting 、 visualization 、 Analysis and machine learning tasks such as conversion data . Data lakes can include structured data from relational databases , Semi-structured data （CSV、 journal 、XML、JSON）, Unstructured data （ E-mail 、 file 、PDF） And binary data （ Images 、 Audio 、 video ）.

and Data warehouse It's a theme oriented 、 Integrated 、 Relatively stable 、 Data storage system reflecting historical changes , It aggregates structured data from different sources , It is used for comparison and analysis in the field of business intelligence , A data warehouse is a repository that contains a variety of data , And it is highly modeled .

The differences between data lake and data warehouse are as follows :

Insert picture description here

In terms of data storage , The data lake can store structured data 、 Semi structured 、 All unstructured data , Data warehouse can only deal with structured data . Data warehouse should sort out data before processing data 、 Defining data Schema The stock in operation can only be performed after , The data Lake synchronizes the original data as it is , This is machine learning for subsequent data lake 、 Data mining brings infinite possibilities .

Due to the requirements of model paradigm, the business of data warehouse can not change randomly , But for the data Lake , Even if the Internet industry continues to have new applications , The business is constantly changing , The data model is constantly changing , But data can still easily enter the data lake , For data collection 、 cleaning 、 Standardized treatment , It can be completely postponed to the time of business needs . So the data lake is relative to the enterprise , More flexible , It can adapt to the changes of front-end business more quickly .

From the specific application of both , A data warehouse stores structured data , Suitable for fast BI And decision support , and The data lake can store data in any format , Often through mining can play a greater role in data .

Although the application scenarios and architectures of data warehouse and data lake are different , But they are not opposites , In some scenarios, the coexistence of the two can bring more benefits to the enterprise , Therefore, the solution of Lake Warehouse Integration .

Two 、 Lake warehouse integration definition

Intuitive to see , Lake warehouse integration is The enterprise oriented data warehouse technology , Combined with low-cost data Lake storage technology , Provide a unified for enterprises 、 Shareable data base , Avoid traditional data lakes 、 Data movement between data warehouses , Put the raw data 、 Processing and cleaning data 、 Modeling data , Common storage in an integrated system “ Lake warehouse ” in , It can realize high concurrency for business 、 Precision 、 High performance historical data 、 Real time data query service , It can also carry analysis reports 、 The batch 、 Data mining and other analytical businesses .

At present, it is generally accepted in the industry ： The integration of lake and warehouse needs to get through the two systems of data warehouse and data lake , Let data and computation flow freely between the lake and the barn , So as to build a complete and organic big data technology ecosystem .

The emergence of Lake Warehouse Integration Scheme , Help enterprises build a new 、 Integrated data platform . Through machine learning and AI Algorithm support , Realize data Lake + The closed loop of data warehouse , Improve business efficiency . The capabilities of data lake and data warehouse are fully combined , Form complementarities , At the same time, it connects with the diversified computing ecology of the upper layer .

at present , The three main open source data Lake schemes in the market are : Delta、Apache Iceberg and Apache hudi.

Delta yes Apache Spark Behind the scenes, commercial companies Databricks To launch the , Domestic Internet companies use less ;Apache Hudi It is a kind of for analytical business 、 Scan optimized data storage abstraction , It enables DFS Data sets support changes within minutes of delay , It also supports the incremental processing of the data set by the downstream system ;Apache Iceberg It is an open table format for large-scale data analysis scenarios , At present, the community's attention can't compare with Delta, It doesn't work as well Hudi Enrich , But it has a highly abstract and very elegant design .

Digital habitat platform, the core product of digital LAN technology, is based on Iceberg and Hudi Building a data Lake , The following describes the practical experience of Shulan technology in the integration of Lake warehouse , It mainly includes technical architecture 、 Data into the lake 、 Data warehouse construction .

3、 ... and 、 Practice of Lake warehouse integration

In the hucang integrated solution provided by Solana Technology , Data is uniformly stored in Iceberg+HDFS On , And use Flink、Spark、Trino Three different engines access the data in the lake to provide different types of services .
Insert picture description here
be based on Flink+Iceberg To build a quasi real-time data warehouse integrating Lake warehouse , Original T+1 The offline data warehouse is made into a quasi real-time data warehouse , Improve the timeliness of data warehouse as a whole , To better support upstream and downstream businesses . In the data warehouse processing layer , It can be used Trino Make some simple queries , and Iceberg Also support Streaming read, So in the middle layer of the system, you can also directly access Flink, use Flink Do some batch or streaming tasks , The intermediate results are further calculated and then output to the downstream .

In a specific example , hold Mysql Medium demo_users The table data passes through Flink CDC Real time Lake entry , Fall into the data warehouse ODS In the table of layers , Then based on ODS Layer and a real-time stream flink_demo_complaint_data Conduct Join Operation generation DWD Table of layers .

Insert picture description here
Sample code ：

Insert picture description here
CDC The data entered the lake successfully Iceberg after , We'll also get through the common computing engines , for example Presto(Trino)、Spark、Hive etc. , They can read it in real time Iceberg The latest data in the table .

Insert picture description here
The figure below is based on Trino Quasi-real-time query Iceberg Table data , Provide customized services to business parties SQL Type of API.

Insert picture description here
meanwhile , Aiming at the common pain points in the lake warehouse solution in the current market ： Lack of data cache layer , Resulting in slower data access ; Lack of a unified programming model , For example, for batches and streams, write Spark and Flink There are two types of SQL. Shulan technology is introduced into Shuqi platform Alluxio and Beam Solve such problems .

Insert picture description here

be based on Alluxio Data caching
In the current architecture , Computing engine layer passes Iceberg or Hudi Provided API To operate the underlying file storage system , This results in slow data reading and writing . therefore , Consider introducing Alluxio As a data orchestration layer . Speed up the reading and writing of data lake , When Spark、Flink or Trino When asked about file systems ,Alluxio Act as a virtual distributed storage system to accelerate data , And coexist with each computing cluster .
be based on Beam The unified programming model
Apache Beam It's an open source unified programming model , Unified flow and batch , Abstract a unified API Interface . And the generated distributed data processing task should be able to execute in each distributed engine ( for example Spark、Flink) On the implementation , Users can freely switch the execution engine and execution environment of distributed data processing tasks . Therefore, it is planned to introduce Beam As a unified programming model , In the data, the platform product layer provides a unified IDE.

Insert picture description here
The official website of Shulan technology _ Let the data work