当前位置:网站首页>One of the data Lake series | you must love to read the history of minimalist data platforms, from data warehouse, data lake to Lake warehouse

One of the data Lake series | you must love to read the history of minimalist data platforms, from data warehouse, data lake to Lake warehouse

2022-07-01 14:20:00 InfoQ

1.  Write it at the front

We are in an era of big data , The amount of data in enterprises has exploded . How to deal with the challenge of massive data storage and processing , Build a good data platform , It is a key problem for an enterprise . From data warehouse 、 Data Lake , Up to now, the lake warehouse is integrated , New methods and technologies for building data platforms in the industry are emerging in endlessly .

Understand the evolution behind these methods and technologies 、 key problem 、 Core technology principles , It can help enterprises better build data platforms . This is also the original intention of Baidu AI Cloud to launch the data Lake series .

This series of articles will consist of several parts :

This article will be the beginning of the whole series of data Lake , Introduce the history of data platform technology and some key technical problems encountered in the development process .

The following content will be divided into two themes , This paper introduces the core technical principles and best practices of the data platform from the two perspectives of storage and Computing , And Baidu AI Cloud's thinking on these issues .

2.  Value of data

"Data is the new oil." — Clive Humby, 2006

Clive Humby stay  2006  Say this in  “ The data is new oil ”  after , Quickly become the consensus of everyone . The track of this guy's life is the best footnote in the era of big data , He was first a mathematician , Later, he and his wife jointly founded a data company , Later, an investment fund focusing on the data field was established . When I say that ,Clive Humby  He is trying hard to sell the company he and his wife founded to the capital market . Capital markets like such simple and powerful golden sentences , His company is  5  Years later, it sold at a good price .

For the owner of the data 、 For practitioners in the data industry , This sentence only tells half the truth .Michael Palmer  Added to this sentence :

"Data is just like crude. It's valuable, but if unrefined it cannot really be used. It has to be changed into gas, plastic, chemicals, etc to create a valuable entity that drives profitable activity; so must data be broken down, analysed for it to have value." — Michael Palmer

simply , Namely  “ Data needs refining to release real value ”.

For an enterprise , The easiest thing to understand and do is  “ big data ” 3  A word  “ Big ”  word , Realize that the data of all links of operation may contain revenue 、 After the mystery of the increase in the number of users , A large amount of raw data is often accumulated . These raw data are crude oil , Although precious , But it contains a lot of noise 、 impurities , Even wrong , The internal relationship between different data is not obvious . This is a long way from the mystery that needs to be excavated . To understand these mysteries, we need to continue  “ refine ”, Is to use appropriate methods to sort out the original data 、 Purification 、 Combine 、 analysis , Remove the turnip and save the green 、 The whole , Reveal the really valuable part of the data , Finally, it will become the driving force of business growth .

Support this  “ refine ”  The whole process infrastructure is an enterprise data platform .

Data platform is to data , It's like refinery to crude oil . With the explosive growth of enterprise data , And more and more enterprises go to the cloud , Data storage faced by data platforms 、 The challenge of data processing is growing , What kind of technology is used to build and iterate this platform has always been a hot topic in the industry , New technologies and ideas continue to emerge . These technologies can be summarized as data warehouse  (Data Warehouse)  And data Lake  (Data Lake)  There are two typical routes . In recent years, the boundary between these two routes has become increasingly blurred in the process of evolution , Gradually move towards integration , Begin to form the so-called modern data architecture  (Modern Data Architecture), Also known as Lake Warehouse Integration  (Data Lakehouse).

3.  Composition of data platform

Before discussing specific technical issues , Let's take a look at what the data platform in the industry looks like :
Data platform  =  The storage system  +  Calculation engine  +  Interface

The functions of these parts can be summarized as follows .

3.1  data storage

Data storage solves the problem of storing raw materials , It has a long time span 、 The sources are scattered 、 Features of centralized storage .

  • “ A long time span ” It means that the data storage should save the whole historical data as much as possible . The importance of historical data to enterprises , lie in  “ Learn from history ”, Observe the trend of data from a longer time dimension 、 Health and other information .
  • “ The sources are scattered ” Because the source of data is usually various business systems , May be  MySQL、Oracle  Data in such a relational database , It may also be the log recorded by the business system . Enterprises may also purchase or collect third-party data sets as a supplement to internal data . The data platform needs to be able to import data from different sources , As for the format of storage after import , Different technical schemes have their own requirements .
  • “ Centralized storage ” It's to build  single source of truth, No matter where the source of the data is , After being incorporated into the data platform , The data platform is the only credible source . Here more refers to logical centralized storage , Physically, there is also the possibility of dispersion , For example, an enterprise adopts a multi Cloud Architecture , Store data in different cloud vendors , The data platform shields the actual location of the data from the data user . Centralized storage also means finer control , Prevent data usage rights from expanding to unnecessary scope .

3.2  Calculation engine

The goal of computing engine is to extract effective information from data storage . Unfortunately , At present, there is no unified computing engine in the industry , According to the model 、 Effectiveness 、 Data volume requirements , Often use different solutions . Typical , For deep learning tasks, use  TensorFlow、PyTorch、PaddlePaddle  Other framework , Offline calculation such as data mining adopts  Hadoop MapReduce、Spark  Equal engine , Business intelligence analysis uses  Apache Doris  etc.  MPP  Data warehouse .

Different computing engines have different requirements for the format of data storage :

  • Some computing engines support open and low-level interfaces , The requirements for data format are very loose . for example  Hadoop MapReduce、Spark  It can be read directly  HDFS  Documents in , What is the format of these data , The engine itself doesn't care , The business decides how to interpret and apply the data . Of course , There are some formats ( Such as  Apache Parquet  etc. ) It's widely used , On top of the underlying interface , The engine can choose to encapsulate the processing logic for a specific format , Reduce the repeated development cost of business .
  • Another part of the computing engine is relatively closed , Only limited data formats can be supported , Don't even expose the internal data format , All external data must be imported before processing . for example ,Apache Doris  How the data is stored is up to the system , The advantage is that it can make storage and computing work more closely , Better performance .

Some valuable data produced by the computing engine in the computing process , Generally, it will also be saved back to the data store , To facilitate the use of other businesses .

3.3  Interface

The interface determines how users of the data platform use the computing engine . The most popular is  SQL  Language interface . Some computing engines also provide programming interfaces with different encapsulation levels . For an enterprise , The fewer types of interfaces provided, the more common , The more friendly it is to users .

4.  Two routes of data platform : Data warehouse and data Lake

4.1  Data warehouse

null
Data warehouse appeared much earlier than data Lake . The initial scenario was business intelligence  (Business Intelligence), Simply put, the management of the enterprise hopes to have a dashboard that is convenient for viewing all kinds of business data , Show some statistics 、 Trend data , The source of the data is  ERP、CRM、 Business database . In order to make this requirement work well , The best way is to collect data from various data sources within the enterprise and archive them on a single site , And maintain historical data , Let the relevant query requirements be solved on this site . This unified site is data warehouse .

The mainstream data warehouse implementation is based on “ On line analytical processing  (Online Analytical Processing, OLAP)” technology . Before the birth of data warehouse , Business has been widely used  MySQL、Orcale  Relational database , Such databases are based on “ Online transaction processing  (On-Line Transactional Processing, OLTP)” technology .OLTP  The data in the database has a fixed format , Well organized , Supported by  SQL  The query language is easy to use and understand . meanwhile , It is also one of the most important data sources of data warehouse . therefore , Use it directly  OLTP  It is a natural idea to build a data warehouse based on database . But soon , You will find that data warehouse has its own business characteristics , be based on  OLTP  There's a bottleneck ,OLAP  Get the opportunity of independent development :

  • One side ,OLTP  The data storage mode of the database is row oriented  (row-oriented), The data of one row is stored together , Even if only a few fields are needed when reading , You need to read out the whole row of data and then extract the required fields . Data warehouse tables usually have many fields , This leads to low reading efficiency . Column oriented  (column-oriented)  Data storage mode of , Store different columns or column families separately , When reading, you can only read the required part , This method can effectively reduce the amount of data read , More friendly to data warehouse scenarios .
  • On the other hand , Tradition  OLTP  The database depends on the hardware configuration of a single machine  scale-up  To improve processing power , The upper limit is lower . In the data warehouse scenario, the amount of data read by a query is very large , Call the same read logic repeatedly on the same field , It is very suitable for single machine 、 Multi machine parallel processing optimization , Take advantage of the cluster  scale-out  Processing power to shorten the query time . This is it.  MPP (Massively Parallel Processor)  The core idea of computing engine .

therefore , The characteristic of modern data warehouse architecture is distributed 、 The column type storage 、MPP  Calculation engine . After the user initiates the calculation task , Data warehouse MPP  The computing engine splits Computing , Each node is responsible for processing a part , Parallel computing between nodes , The final summary result is output to the user . Data warehouse is typical  “Schema-on-Write”  Pattern , The stored data is required to be processed into a predefined format when written , namely schema. This is like the administrator of the data warehouse determining the style of a packing box in advance , All the goods  ( data )  It must be packed in a box , You can enter the warehouse neatly .

The original data of the data source is often different from the well-defined  schema  Differences exist , Therefore, the imported data needs to be  ETL  The process ,ETL  It's extraction (Extract)、 transformation  (Transform)、 load  (Load)  Abbreviation of these three steps .Extract  Stage read from the original data source for data cleaning (data cleansing)  Correct the mistakes 、 repeat . Then enter  Transform  Stage , Do the necessary processing to convert the data into the specified schema. Last , Data warehousing  Load  To the data warehouse .

4.2  Data Lake

null
Bayan Obo mine in Inner Mongolia Autonomous Region , The only one in the world that includes  17  Minerals of rare earth elements . In a  60  For years , This mine has been mined as iron ore , Later, with the improvement of the strategic value of rare earth , And the progress of Mining Technology , Before transforming into the largest rare earth mineral deposit in China .

Telling this story is to illustrate the importance of raw data , Raw data is like Bayan Obo mine , In addition to the iron that has been found , There may also be rich reserves of rare earth . Data warehouse  “Schema-on-Write”  Patterns require us to know exactly what we are digging before processing data , When time passes , When only the historical data is saved in the data warehouse , We may not even know what rare earths have been discarded .

Better retain more original data , Avoid losing important unknown information , This is the original intention of the data Lake concept . Data Lake advocates all data , Whether it's the structured data of the database , Or video 、 picture 、 Unstructured data such as logs , They are stored in their original format in a unified storage base . Various data sources , Like rivers , Converge to this unified  “ lake ”  China is integrated , All data users are controlled by this “ lake ”  Unified water supply .

Due to the lack of clear structural information , Data Lake use  “Schema-on-Read”  Pattern , After reading the data, the user converts it into the corresponding structure for processing . And data warehouse  “Schema-on-Write”  comparison , The data processing flow becomes  ELT, namely  Transform  The stage is Load  Later .

“Schema-on-Read” Because the structure is very loose , There are fewer constraints on the computing engine , In fact, the industry has developed a variety of computing engines according to different scenarios .

Traditional data Lake , It is the equivalent of big data system , Mainly experienced  “ Save and calculate in one ”  and  “ Deposit is separate ”  Two phases :

Stage  1: Data Lake integrating storage and Computing .

At this stage, the enterprise is based on  Hadoop  Ecological development data Lake , Use  HDFS  As data storage , Use Hadoop MapReduce、Spark  Wait for the computing engine , Computing and storage resources are on the same batch of machines , Expanding the cluster will expand the computing power and capacity at the same time . After the development of Cloud Computing , This architecture is offline  IDC  The computer room is moved to the cloud intact .

Stage  2: Storage and calculation separation data Lake .

After a period of practice , The integrated storage and computing architecture has encountered a bottleneck , Mainly reflected in several aspects :

  • Computing and storage cannot be expanded separately , In reality, most users' needs for these two resources are not matched , The integrated storage and computing architecture will inevitably lead to the waste of one kind of resources .
  • After the explosive growth of storage capacity and the number of files ,HDFS  Of  NameNode  The single point architecture has encountered the bottleneck of metadata performance , Enterprises upgrade  NameNode  Node configuration 、 Multiple sets  HDFS  Cluster or  HDFS Federation  To alleviate the problem , But it failed to fundamentally solve this problem , It brings a great burden to the operation and maintenance personnel of the data platform .
  • Storage cost is also a pain point of the integrated storage and computing architecture .HDFS  Of  3  The replication mechanism is not suitable for storing colder data , It costs at least twice as much as the erasure mechanism . We also face the problem of replica amplification on the cloud , The cloud disk provided by the cloud manufacturer itself has a copy mechanism , Use cloud disk to build  HDFS  The actual number of copies of is higher , Probably up to  9  copy .

In the process of solving these problems, people have noticed the object storage services of cloud vendors . This service provides an almost infinite expansion of performance and capacity 、 The cost is low 、serverless  Storage system . Except in some file system interfaces  POSIX  Compatibility  ( Such as atomic  rename、 Read while writing )  There are shortcomings , This service solves the above pain points , yes  HDFS  A suitable substitute for . actually , The next generation  HDFS  System OZone  The system also uses the idea of object storage for reference to solve the above problems .

With
Object storage
Based on the data lake was born “ Deposit is separate “ framework
. The separation of storage and computing is characterized by the independent expansion of computing resources and storage resources .
Storage in the storage computing separation architecture is an object storage service provided by cloud manufacturers . And self built  HDFS、OZone  comparison , The biggest advantage of cloud vendors comes from their scale . Cloud vendors need large enough clusters to store massive amounts of user data , More data , The larger the size of the cluster , node 、 The more devices , The higher the overall performance it can provide . For a single user , can “ To borrow ” To be built by ourselves on the same scale  HDFS  Higher performance . Large enough storage resource pool , It is the premise and foundation for the storage and computing separation architecture to work .

It solves scalability in object storage 、 performance 、 On the basis of cost ,serverless  Its product form makes it easy for the computing engine of the data lake to scale its computing power independently , You can even allocate computing resources when computing is needed , Destroy the resource immediately after the calculation , Only pay for the resources used , Be optimal in terms of cost and efficiency .
This is impossible in the era before the separation of storage and computing architecture and Cloud Computing .

For cloud vendors , This architectural shift has made object storage services the focus of the stage , Making cloud manufacturers sweet is also testing their technical strength , Boasted brags must be cashed one by one without discount .
The main challenges here include :

  • scale . Number of customers  PB  Dozens of  PB, Many customers share resource pools , Cumulatively, the storage capacity of objects can easily reach  EB  level ,
    The corresponding metadata scale has reached trillions
    . The service of a single cluster is good  EB  Class capacity 、 Trillions of metadata , You need a very good hard core architecture design , Every part of the system does not have a short board of scalability .
  • stability . Support  EB  Class capacity 、 Trillions of metadata , The number of machines in each cluster reaches tens of thousands or even hundreds of thousands , Under the huge machine base , Hardware failure 、 Software failures are commonplace .
    Reduce or even eliminate the influence of these uncontrollable factors , Provide stable latency and throughput levels 、 Lower long tail
    , What we strive for is high-quality engineering implementation and operation and maintenance capabilities .
  • Compatibility . Although object storage as data Lake storage has become a consensus , But the software in the big data system , No matter because of the historical burden , Or because it really can't be transformed ,
    In some scenarios, it will still rely on  HDFS  Some unique abilities
    . for example ,Spark  rely on  rename  Submit tasks , Take advantage of  HDFS rename  Faster execution speed and atomicity guarantee , But in the ancestor of object storage  AWS S3  in ,rename  Is not supported , Only rough through “ Copy  +  Delete ” simulation , The execution speed is very slow and there is no atomic guarantee . If the general level of object storage of various cloud vendors is  70%  Replace  HDFS, The rest of the  30%  The part of depends on how the manufacturer further solves the part with poor compatibility , So that the storage and computing separation architecture can be implemented more thoroughly .

4.3  Data warehouse  VS  Data Lake

Data warehouse and data lake can be summarized as :

Data warehouse  =  Structured data storage system  +  Built in computing engine  + SQL  Interface
Data Lake  =  Raw data storage system  +  Multiple computing engines  +  contain  SQL  Multiple interfaces including

Data warehouse and data lake are like mobile phones  iOS  and  Andriod:

  • Data warehouse is like  iOS, It is a relatively closed system , Data inflow and outflow 、 There are many scene constraints , But it's better to be easy to use , The closed system has stronger control , It is easier to make storage format 、 Optimization of performance such as computing parallelism , It is still dominant in some query scenarios that require extreme performance .
  • Data lake is like  Android, Emphasize openness , Almost delegate the right of choice to users , Mobile phone manufacturers you can choose  ( Calculation engine )  Also many , But using it well requires a certain professional ability of users , Bad use will have side effects , It's easy to lead to  “ Data swamp  (Data Swamp)”.

5.  Modern data platform : The lake and the warehouse are integrated

5.1  The dilemma faced by the data Lake

Data lake will  “ What data to store 、 How to use data ”  The decision-making power of is returned to the user , The constraints are very loose . However, if the user fails to manage the data when entering the lake , Useful, useless 、 High quality and low-quality data were thrown into the brain , It's easy to find the data you need when using . In the long run , The data lake has become a huge garbage dump , The standard name is  “ Data swamp ”.

In order to avoid the data Lake finally becoming a data swamp , Several important problems need to be solved :

problem  1: Data quality issues .

Only by  “Schema-on-Read”  Directly deal with the data in the original format when calculating , Filter out useless information , This work needs to be done repeatedly every time , It not only reduces the speed of calculation, but also wastes computational power .

A feasible way , It refers to the practice of data warehouse in the data Lake , By one or more rounds  ETL  Do some pre-processing on the original data , Converting data into a computing engine is more friendly 、 Data with higher data quality . The original data is not deleted , and  ETL  The generated data is also stored in the data Lake , This preserves the original data , It also ensures the efficiency of calculation .

problem  2: Metadata  (metadata)  Management issues .

Metadata is data that describes data , Its importance to data lies in its responsibility to answer those important philosophical questions  “ Who am I ? Where am I? ? Where am I from ?”. Format information of data  ( For example, the field definition of a database table file ) 、 Location information of data  ( For example, where is the data stored )、 The blood relationship of the data  ( For example, what upstream data is the data processed from )  And so on all need to rely on metadata to explain .

Establishing perfect metadata for the data lake can help users better use data . Generally, metadata is divided into two parts , It's all important . One is the centralized data directory  (Data Catalog)  service , Generally, such services have some capabilities of automatic analysis and fuzzy search , It is used to manage and discover what data is in the data Lake . The other is the built-in metadata of the data , These metadata can ensure that even if the data is moved, it can accurately interpret the data . For example , A data catalogue is like a bookshelf in a library , Through sorting and archiving books , Can quickly locate the location of books ; The metadata built in the data is like the directory of a book , Through the catalogue, you can quickly understand what the book contains , On which page ; When a book is moved from one shelf to another , What changed was the position of the book , The contents of the book have not changed .

Metadata management also needs to solve the problem of data permissions . The storage system that the bottom layer of the data Lake depends on , Whether it's HDFS, Or object storage , The data permissions provided are based on directories and files , Granularity is not consistent with the requirements of the upper business . for instance , A picture recognition  AI  There are many small files in the data set of the task , These small documents should be considered as a whole , non-existent “ A user has access to some of these files , No permission to access another part of the file ” The phenomenon of . Another example is , A file stores the data of business orders , For salespeople 、 Company executives , The range of data that can be viewed is different . These all require more detailed permission control .

problem  3: Data version problem .

Data entering the lake is usually not a one-off deal , Import once and never update . for example , Collect data from the online user order database to the data lake for subsequent analysis , New orders need to be synchronized continuously . The easiest way to solve the problem of multiple import is to import the whole volume every time , But this method is obviously too rough , Will increase resource consumption , Data import is also time-consuming .

therefore , Supporting incremental update of data is an important capability of data Lake . There are some thorny problems , Include :1)  How to handle read requests when updating ;2)  How to recover after the update operation is interrupted ;3) How to identify incomplete update operations ;4)  How to recover after the data is polluted by a wrong operation . The answer to these thorny questions in databases and data warehouses is  ACID. The table format appearing in the field of data Lake in recent years  (Table Format), Such as  Apache Iceberg、Apache Hudi、Delta Lake  Dedicated to complement these capabilities for object storage , It has become an important part of the data Lake .

problem  4: Data flow problems .

The real scene is complex and changeable , Real time data processing 、 Accuracy requirements vary , Therefore, the industry has developed many computing engines . If these computing engines speak their own words , Only recognize the storage format defined by yourself , When the same data is processed by different computing engines , It needs to be done repeatedly  Schema-on-Read  perhaps  ETL, A lot of resources are wasted . This is obviously unreasonable .

No translation required , It is ideal that everyone can speak Mandarin . In the development of big data , Gradually formed some commonly used data formats (Apache Parquet、Apache ORC  etc. )  And table format (Apache Iceberg、Apache Hudi、Delta Lake  etc. ), These technologies are gradually supported by more and more computing engines , In a sense, it acts as the role of Putonghua in the field of data Lake , Improved data flow problems .

5.2  The trend of lake and Warehouse Integration

The data lake is in the iterative process , The boundary between and data warehouse is getting blurred , Gradually showing a trend of integration

  • In the process of solving the data swamp , In order to make a very loose ecosystem better used , The practice of the industry actually imposes many restrictions on the use of data lakes . What's interesting is that , These constraints are very similar to many things that the original data warehouse does , for example ETL、ACID、 Authority control, etc . This makes the data Lake show some characteristics of data warehouse .
  • The industry is trying a circle of non  SQL  After various programming interfaces and interaction modes , Found many scenes ,SQL  Is still the best choice . Data warehouses have become more and more open in recent years , Some data formats commonly used for data Lake 、 The support of table format is getting better , In addition to the built-in  ETL  Support for , They can also be directly treated as external sources . These trends show that , Data warehouse is an important computing engine , It can grow on the data Lake .
  • Data warehouse also faces the limitation of integrating storage and Computing , It is also iterating to the storage and computing separation architecture . Some systems adopt the design of cold heat separation , The thermal data is saved on the high-speed media of the local node , Cold data sinks into the data Lake , Strike a balance between performance and cost . Other more thorough cloud native data warehouse systems , The full amount of data is in the data Lake , Through the cache of local nodes to make up for the problem of data Lake speed , This design can simplify the structure of data warehouse , Let the data warehouse no longer need to pay attention to the problem of data reliability , At the same time, multiple read-only clusters can share the same data .
  • Some important technologies and methods in the field of data warehouse , It can also be used for reference by the big data computing engine on the data Lake , vice versa . For example, in  ClickHouse  And other mature application of computing engine acceleration technology in data warehouse , Such as vectorization calculation  (vectorization)、LLVM JIT, Be used for reference to realize  Spark  Of  Native  engine , And the original  JVM  Engine comparison ,Native  The hardware resource utilization of the engine is higher , Faster calculation .

In addition to the data warehouse 、 Beyond big data , There are other important types of computing in the enterprise , The most common is AI、HPC  High performance computing .
The performance advantage of data lake is high throughput , Metadata performance and latency are average , And high-performance computing is critical to metadata performance 、 There are strict requirements for delay . therefore , Enterprises also need to maintain one or more sets of high-speed file storage systems for such businesses outside the data Lake  (Lustre、BeeGFS  etc. )
.
In essence , The framework used in high-performance computing is also some kind of computing engine , The source and output of data are also part of enterprise digital assets , How to integrate this part of business into the data lake system is an important issue .
The answer to this question is similar to the separation of storage and calculation in data warehouse , Solutions are also interlinked , There are also two routes :

  • Cold heat separation design . The high-speed file storage system takes the data Lake as the cold data layer .
  • Design cloud native file system based on data Lake . Although this kind of file system provides file system interface , But it is actually a cache acceleration system , It's using “ Cache layer  +  Data Lake ” The architecture of . The cache layer maintains the cache of hot data on demand on the computing node or hardware close to the computing node . Data lake stores full amount of data , Ensure the reliability of the data . Once the data in the cache system is eliminated or lost , Data can still be reloaded from the data Lake .

The saying of Lake warehouse integration was first  Databricks  Proposed , There are still differences in the industry , Other competitors will try to avoid using this term ,AWS  It adopts modern data architecture  (Modern Data Architecture)  That's what I'm saying . But no matter how named , The integration of lake and warehouse represents the form of the next stage of the data Lake , Its essence is the ultimate one-stop data platform for enterprises .

This data platform is a  all-in-one  Storage infrastructure for , It meets all the data storage needs of the enterprise , It can not only meet the low-cost storage needs , It can also meet the needs of high performance . secondly , Data platform goes beyond data warehouse 、 The scope of big data , The data warehouse is running on it 、 big data 、AI、HPC  And other computing engines , These different computing engines can consume and produce mutually understandable data structures , The flow of data between businesses is barrier free .

5.3  Lake warehouse integrated structure

null
Based on the previous discussion , We can use the data platform formula to simply summarize the lake Warehouse Integration :

Storage system part , Object storage has become the de facto data Lake standard storage , Its ecological prosperity is far more than other kinds of cloud storage products . Storage problems that cannot be well solved for object storage , You need to match the appropriate metadata layer and acceleration layer .

  • Aiming at the problem of data swamp , The metadata layer establishes the necessary data quality 、 Metadata management 、 version management 、 Data circulation mechanism , So that businesses within the enterprise can easily use high-quality data .
  • For some pairs of metadata 、 Delay businesses with higher requirements , Such as data warehouse 、AI、HPC  etc. , The acceleration layer serves as a supplement to object storage , Generally, high-speed file system or cache system is used , The deployment is close to the computing node , Metadata and data can flow automatically between the acceleration layer and the data Lake . For ease of use , The acceleration layer will also be matched with the upper job scheduling system , To make the data flow work more intelligent and simple . for example , Preheat the data in advance through the job scheduling system , After the data is preheated to the cache , Job scheduling system began to allocate computing resources to perform Computing , Thus, you can enjoy faster access speed than the data Lake .

Computing engine part , There is a data warehouse 、 big data 、AI、HPC  And other engines . Data flow is the most basic problem . Besides , Another important problem is the scheduling and management of these computing engines . From a resource perspective , These computing engines mainly consume  CPU、GPU  And other kinds of computing resources , Have the foundation of resource sharing , Improving the overall utilization of resources means cost savings for users . There are two ways to solve this problem :

  • One means is , For a specific computing engine , Use cloud vendor hosting or  serverless  Instead of self built , The services of cloud manufacturers have built-in elastic shrinkage capability , Pay as you go , The utilization rate of relevant resources can be controlled within an appropriate range , Avoid the problem of resource sharing .
  • Another means is , The computing engine maintained by users uses a unified scheduling and resource management platform to allocate resources , This aspect  Kubenetes  It's the most popular choice , If a computing engine does not support deployment on it , It's just a matter of time . Cloud vendors usually also provide optimized Kubenetes  Version or service for users to choose .

Interface part , Actually, it depends on the specific computing engine , It works  SQL  Represents the scene  SQL  Is the best choice , Other scenarios require users to be familiar with the programming interface of the engine .

6.  summary

The amount of enterprise data has exploded , Business scenarios are becoming more and more complex , Driving the continuous change of data platform technology . Data warehouse 、 Data Lake , The technical routes of these two data platforms have fully demonstrated their respective advantages and disadvantages in the past practice , In recent years, it has begun to integrate , Learn from others' strong points and close the gap , Iterate to the so-called Lake warehouse integration or modern data architecture .

Emerging new technologies 、 The new method , It is the crystallization of the collective wisdom of countless practitioners , The open tone is the catalyst for all this . The openness of this field is reflected in many aspects :

  • Data is open . Computing engines are becoming more and more open , Some standard data formats are generally supported , Data flow is becoming easier , The business selects the most appropriate engine to handle computing tasks on demand .
  • Technology is open . Most of the important technologies in the integrated technology architecture of Lake warehouse , Both exist in the form of open source projects , No enterprise can monopolize intellectual property . The manufacturer's distribution and open source versions can replace each other , The choice lies with the user . The opening of technology has also promoted cross domain technology integration , Different fields learn from each other's methods and technologies , Improve the strength and make up the weakness , Play a  1 + 1 > 2  The effect of .
  • The infrastructure is open . In the lake warehouse integrated solution , Cloud vendors play an important role , Provides object storage 、 Hosting big data services and other infrastructure . These infrastructures are compatible with industry standards , There are also open source alternatives , Customers can easily build hybrid clouds 、 Multi Cloud Architecture , Make better use of the elasticity of the cloud .

In this open tone , The whole industry , Whether users or platforms , They all have their own thoughts and views on the data platform . We also hope to take this opportunity to express some of our views , On the one hand, I hope to provide some shallow insights to the industry , On the other hand, looking back at the future is also a snow claw for yourself .

Next in this series , It will focus on the two themes of storage and Computing , Introduce the core technical principles and best practices in the data platform , And Baidu AI Cloud's thinking on these issues . Help readers form a systematic understanding of the data Lake , There are more ideas when doing the construction of data platform .

---- End ----
Please follow the WeChat public account “ Baidu AI Cloud technology station ”
So as not to miss the following highlights


原网站

版权声明
本文为[InfoQ]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/182/202207011412170295.html