当前位置:网站首页>Application practice | the efficiency of the data warehouse system has been comprehensively improved! Data warehouse construction based on Apache Doris in Tongcheng digital Department

Application practice | the efficiency of the data warehouse system has been comprehensively improved! Data warehouse construction based on Apache Doris in Tongcheng digital Department

2022-07-07 22:06:00 InfoQ

Reading guide : Tongcheng mathematics department was established in  2015  year , It is the tourism industry financial service platform under Tongcheng Group .2020  year , The same process number department is based on  Apache Doris  Rich data access methods 、 Excellent parallel computing ability 、 Minimalist operation and maintenance features , introduce  Apache Doris  Carry out data warehouse structure 2.0  Build . This article details the architecture 1.0  To  2.0  The evolution process and  Doris  Application practice of , I hope that's helpful .

author
Senior engineer of big data in the same Engineering Department   Uranus

Business background

Business Introduction

Tongcheng Digital Technology Co., Ltd. is a tourism industry financial service platform under Tongcheng Group , The predecessor is Tongcheng Jinfu , Officially established in  2015  year . The number of subjects in the same process is “ Digital technology leads the tourism industry ” For the vision , Adhere to the power of science and Technology , Enable China's tourism industry .

at present , Our business covers industrial financial services 、 Consumer financial services 、 Financial technology, digital technology and other sectors , The cumulative service covers more than ten million users and  76  city .

null
chart 1.1  Business scenario - Business Introduction

Business needs

It mainly includes four categories :

  • Kanban class : It mainly includes business real-time cockpit and  T+1  Business Kanban, etc .
  • Early warning : Mainly including risk control fusing 、 Abnormal funds and flow monitoring .
  • Analysis class : It mainly includes timely data query and analysis and temporary data retrieval .
  • Financial class : It mainly includes clearing and payment reconciliation requirements .

null
chart 1.2  Business scenario - Business needs

Integrate the above business needs , We have built the system architecture .

Architecture evolution  1.0

Workflow

null
chart 2.1  Architecture evolution - framework 1.0

framework 1.0  It was very popular in previous years  SteamSets  and  Apache Kudu  As the core of the first generation architecture .

The architecture passes  StreamSets  Do database  Binlog  Write in real time after acquisition  Apache Kudu  in , Finally through  Apache Impala  And visualization tools to query and use . This process has long architecture links and  SteamSets  Poor reusability of some configurations , in addition  Apache Kudu  There are some performance bottlenecks between multi table Association and large table Association , And right  IO  High requirements .

chart 2.1  The application of real-time computing process in the lower part is similar to that in the upper part , In real-time computing , Buried point data is sent to  Kafka  It will pass  Flink  Do real-time calculations , And the calculated result data will fall into the analysis library and  Hive  The library is used for data warehouse Association .

Advantages and disadvantages

null
chart 2.2  Architecture evolution   Advantages and disadvantages

advantage :

  • framework 1.0  I chose  CDH  Family bucket .CDH  It provides many big data components , Can be integrated with each other and put into use , At the same time, its configuration is relatively flexible .
  • The use of  SteamSets  Support visual drag and drop and configuration development , So developers are right  SteamSets  The degree of acceptance is high ..

Insufficient :

  • Too many components are introduced , Maintenance costs increase ; When there is a problem with the data , The troubleshooting and repair link is relatively long .
  • Various technical architectures and long development links , It improves the learning cost and requirements of warehouse staff , Data warehouse personnel need to convert in different places before developing , The development process is not smooth 、 Reduced development efficiency .
  • Apache Kudu  In large table Association  Join  The performance is not satisfactory .
  • Because the architecture uses  CDH  structure , Offline cluster and real-time cluster are not separated , Form resources to compete with each other ; In the process of offline batch running  IO  Or high disk consumption , The timeliness of real-time data cannot be guaranteed .
  • although  SteamSets  Equipped with early warning capability , But the job recovery ability is still relatively lacking . When configuring multiple tasks  JVM  The consumption is high , Resulting in slow recovery .

Architecture evolution  2.0

Workflow

Because of the architecture 1.0  The disadvantages far outweigh the advantages , stay  2020 year , We investigated many components in the market for real-time development , Found out  Apache Doris, Through investigation and comparison , The final decision will be  Apache Doris  Introduced architecture 2.0.

null

chart 3.1  Architecture evolution - framework 2.0

introduce  Apache Doris  after , The following modifications have been made to the overall architecture :

  • adopt  Canal  Of  CDC  Ability , take  MySQL  Data collection to  Kafka  in . because  Apache Doris  And  Kafka  The degree of fit is high , It's easy to use  Routine Load  Load and access data .
  • The data link of the original offline calculation is slightly adjusted . For storage in  Hive  Data in ,Apahce Doris  Supported by  Broker Load  take  Hive  Data import , Therefore, the data of offline clusters can be directly loaded into  Doris  In .

The selection  Doris

null
chart 3.2  framework 2.0- The selection Doris

In the process of model selection ,Apache
 Doris  The overall performance is amazing :

  • Data access :
      Provides rich data import methods , It can support the access of many data sources .
  • Data connection :
     Doris  Support  JDBC  And  ODBC  And so on , Yes  BI  The visual display of the tool is relatively friendly , Can easily communicate with  BI  Tools to connect , in addition  Doris  Realized  MySQL  Protocol layer , Through various  Client  Tools directly access  Doris.
  • SQL  grammar :
     Doris  Support standards  SQL, Grammar to  MySQL  compatible , The learning cost is low for the warehouse staff ;
  • MPP  Parallel computing :
     Doris  be based on  MPP  The architecture provides excellent parallel computing capability , For large table  Join  Very good support .
  • The most important point :
     Doris  Official documents are very sound , For users, it's faster to get started .

When investigating the system selection , We also understand  ClickHouse,ClickHouse  Yes  CPU  The utilization rate is high , It performs well in single table query , But in multi query high  QPS  Under the circumstances of poor performance .

Combine the above factors , In the end, we chose  Apache Doris.

Doris  Deployment architecture

null
chart 3.3  framework 2.0-Doris Deployment architecture

Apache Doris  The deployment architecture is extremely simple , Mainly  FE  and  BE:

FE  It is the front-end node , It mainly carries out the access requested by users 、 Metadata and cluster management and query plan generation .

BE  It is the back-end node , Mainly for data storage and query plan generation and execution .

Doris
 
Operation and maintenance is very simple :

3  In June, we carried out a rolling migration of the machines in the computer room ,12  platform  Doris  All node machines will be migrated within three days , The overall operation is relatively simple , It is mainly used for removing the machine from the rack 、 Move and put on shelves ;FE  The expansion and contraction actions don't take much time , Only used  Add  And  Drop  Wait for simple instructions .

Particular attention
: Try not to use something like  Drop  Wait for instructions directly to  BE  To operate . When using  Drop  When the instruction is forcibly deleted ,Doris  You will be prompted and asked to manually confirm whether to delete , Data cannot be recovered after forced deletion . Therefore, it is recommended to use contact mode to offline nodes , After the data migration is completed , You can directly  BE  The node joins again , More flexible .

Doris  Real time system architecture

null
chart 3.4 Doris Real time system architecture

data source :
  In real-time system architecture , The data source comes from industrial finance 、 Consumer finance 、 Risk control data and other business lines , adopt  Canal  and  API  Interface for collection .

Data collection :
 Canal  adopt  Canal- Admin  After data collection , Send data to  Kafka  In the message queue , Re pass  Routine Load  Access to  Doris  colony .

Doris  Several positions :
 Doris  The cluster builds three layers of data warehouse , Namely : Used  Unique  Model  DWD  Detail level  、 Aggregate  Model  DWS  Summary level and  ADS  application layer .

Data applications :
  The architecture is applied to real-time Kanban 、 Data timeliness analysis and data service .

Doris  New data warehouse features

null
chart 3.5 Doris New data warehouse features

The data import method is simple , According to different scenarios  3  There are two different import methods :

  • Routine Load: It is mainly used for business data access and consumption  Kafka  The Resident Mission of exists . When we submit  Rountine Load  When the task ,Doris  There will be a resident process for real-time consumption  Kafka , Constantly from  Kafka  Read data from and import it into  Doris in .
  • Broker Load: Perform offline data import tasks such as basic dimension tables and historical data .
  • Insert Into: Used for regular batch operation , Responsible for  DWD  Layer data processing , formation  DWS  Layers and  ADS  layer .

Doris  Good data model , Improve our development efficiency :

  • Unique  Model in  DWD  When the layer is accessed , It can effectively prevent repeated consumption of data .
  • Aggregate  Models are used as aggregations . stay  Doris  in ,Aggregate  Support is like  Sum、Replace、Min 、Max 4  Two ways of aggregation model , Used in the process of aggregation  Aggregate  The underlying model can be reduced by a large part  SQL  The amount of code , No longer need to do it yourself  Sum、Min、Max  Wait for the action , From  DWD  Layer to  DWS/ADS  Layer is friendly .

Doris  Low threshold for use , High query efficiency :

  • Support  MySQL  agreement , Support standards  SQL, Query syntax is highly compatible  MySQL, Friendly to analysts .
  • Support materialized views and  Rollup  Physicochemical index . The bottom layer of materialized view is similar  Cube  The concept of and the process of precomputation , And  Kylin  The way of exchanging space for time is similar , Special tables are generated at the bottom , When the materialized view is hit in the query, it will respond quickly .

hot tip : Although materialized views are very helpful , But when used too much , Each materialized view requires additional storage space , Data import will lead to inefficiency .

Doris  Minimalist system architecture , Low operation and maintenance cost :

  • The system only has  BE  and  FE  Two modules , Do not rely on such as  Zookeeper  Wait for tripartite components , Simple deployment .
  • in the light of  FE  and  BE  The operation of is monitored and configured , When an exception occurs, it will restart in time .

Doris  Summary of experience

null
chart 4.1  How to use more friendly Doris

In the use of  Apache Doris  In the process of , We have sorted out some experience , Help developers use more friendly  Doris .
For developers , The most concerned places are the following :

  • In terms of development :
      How to access external data  Doris  And quickly realize  ETL  Development , This will affect the report output speed of developers .
  • Scheduling management :
      Developers don't want to go online after the development is completed , There are errors or unstable situations , It is necessary to ensure the stability of task scheduling and scheduling recovery ability .
  • Data query :
      Due to the separation between production and office network , The office network cannot directly use the connection of the production network , And the network partition cannot be solved through the form of client , Only through  Web  Form solution , How to query and analyze safely and conveniently has become a concern of developers .
  • Cluster management :
      When abnormal conditions occur in the cluster, it can be captured in time and handled automatically .

To make a long story short , We hope to build a
high efficiency 、 High-quality , high stability
The platform of .

Doris  Development optimization

According to several issues concerned by developers , We did some development optimization .

Data access

In terms of data access, we have done semi-automatic related work and made rapid generation components , According to the data source / Table to generate  Routine Load  Script , As long as the  Kafka  Of  Broker  or  Topic  It can be quickly formed by modifying  Routine Load  Mission .Broker Load  Task and  Routine Load  similar , After selecting the data warehouse source, it can be generated in time  Broker Load  Required scripts . In the access  Doris  You need to create a table in advance , Similar operations can be carried out in this regard , Create statements quickly through source generation .

null
chart 5.1  Data platform - Doris Development

The above mainly uses the underlying metadata , Get different metadata according to different data sources, and then you can quickly generate tasks .

Submit action and maintenance management

After the task is generated , We are  Routine Load  Aspects are encapsulated . because  Routine Load  It's the resident process , We only need to submit again , The state will become  Running , If there is an abnormal state, it will be detected , Monitoring will be shown to you later .

null
chart 5.2  Data platform - Doris Development

Monitoring and management

We can submit  Routine Load  Query and check for exceptions , At the same time, we can focus on  Routine Load  Join the monitoring , The monitoring will automatically scan the task on a regular basis , When a problem occurs, it will prompt and try to pull the task back .

Broker Load  Tasks can also be monitored . Aim at  Broker Load Label  The name cannot be repeated , We take generation  UUID  How to solve , In order to better help you improve the use experience .

null
chart 5.3  Data platform - Doris Development

As shown above , We can do it in  Routine Load  Pause and terminate in , Help you to better use the developed homework and management .

Self research query page , Integrate  Doris Help  function

Due to the isolation between production and office network segment , We can only pass  Web  The query . We have tried to use  Hue  Integrate  Doris  Query scheme ,Doris  Supported by  MySQL  Protocol connection to  Hue , But if we integrate  Hue  Words , Everyone can pass  Hue  Inquire about  Doris  The data of , Security cannot be guaranteed , Unable to meet our requirements for permission .

null
chart 5.4  Data platform -Doris Data query

So we developed a query page in our own platform to solve this problem . The left part of the figure can be based on  DB  List all the tables below , The right part is the query analysis page and query results , We developed it by ourselves, which is similar to  Navicat  Client software .

At the same time, we are right  Doris Help  Functions are integrated , People don't know how to use  Doris  Provide help when . Through integration  Doris Help, We can use the keyword search function to query syntax and examples to solve problems .

Even without integration  Doris Help, It can also be in  FE  The node comes with  Web  Page to view ,FE  The built-in node can view the information of the whole cluster and has  Help  Functional  Web  page . After we realize the self research query page and integrate  Doris Help  after , You can use it directly , To skip the need to use  Admin  Account connection can be used  FE  Steps for .

Doris  Cluster monitoring page

At the same time, we developed  Doris  Cluster monitoring page , You can see in the cluster monitoring page  FE 、BE  as well as  Broker  Node status of . When an abnormal condition occurs in the cluster , The monitoring system will send an automatic reminder and try to pull up the cluster , At the same time, you can also observe the health of nodes in the form of pages .

null
chart 5.5  Data platform - Doris Cluster monitoring

about  Doris  For upper applications , Mainly depends on  Doris  Provided  API  And instruction completion  Doris  The application action of the upper layer , All we do is to  Doris  The instructions provided are more user-friendly integration and paged display .

The benefits of the new architecture

null
chart 6.1  The benefits of the new architecture

  • Data access :
      Pass at an early stage  SteamSets  During the process of data access, it is necessary to manually establish  Kudu  surface . Due to the lack of tools , The whole process of creating tables and tasks requires  20-30  minute . Now we can realize fast data access through platform and fast construction statement , The access process of each table starts from the previous 20-30 Minutes to the present  3-5  minute , Improved performance  5-6  times .
  • Data development :
      When performing aggregation or other actions in the early architecture , You need to write a lot of long articles  SQL  Code . Use  Doris after , We can use it directly  Doris  The built-in  Unique、Aggregate  And other data models that can well support log scenarios  Duplicate  Model , stay  ETL  Greatly accelerate the development process .
  • Query analysis :
     Doris  The bottom layer has materialized views and  Rollup  Materialized index and other functions , It can improve query efficiency , meanwhile  Doris  The bottom layer has carried out many optimization strategies for large table Association , Such as  Runtime Filter  And other things  Join  And custom optimization strategies . Compare with  Doris,Apache Kudu  You need more in-depth optimization experience to better use .
  • Data reports :
      First use  Kudu  Report query requires  1-2  Minutes to finish rendering , and  Doris  The response speed is second level or even millisecond level .
  • Environmental maintenance :
     Doris  No,  Hadoop  Complexity of ecosystem , The whole link is clear , The maintenance cost is much lower than  Hadoop, Especially in the process of cluster migration ,Doris  The convenience of operation and maintenance is particularly prominent .

Future outlook

null
chart 7.1  Future outlook

  • Try to introduce  Doris Manager:
      There are ongoing discussions in the community about  Doris Manager  Propaganda , In the future, we are also ready to introduce and actively participate in  Doris Manager  Cluster maintenance and management .
  • Implementation is based on  Flink CDC  Data access :
      The current architecture does not introduce  Flink CDC , But continue to use  Canal  Collect to  Kafka  Then collect  Doris  Architecture in , The link is relatively long . use  Flink CDC  Although we can continue to streamline the overall architecture , But you still need to write a certain amount of code , about  BI  People feel unfriendly when using it directly , We hope that the warehouse staff only need  SQL  Or complete the operation on the page, you can use . stay  3.0  Architecture Planning , We plan to introduce  Flink CDC  Function and expand the upper application .Flink CDC  The introduction of brings “ Fast is slow , Slow is fast ” The idea of , Of course Flink The development speed of the community is very fast , Only after fully learning from everyone's experience , Can be introduced more friendly , And iterate and optimize the architecture in the process of learning experience .
  • Follow the community iteration plan :
      What we are using  Doris  The version is relatively old , Now the new version  Doris  In memory management 、 Query performance has been greatly improved , In the future, we will follow the community iteration rhythm to upgrade the cluster and reflect new features .
  • Strengthen the construction of relevant systems :
      Our current indicator system management, such as report metadata 、 The maintenance and management of business metadata still need to be improved . Data quality monitoring , Although it currently includes the function of data quality monitoring , However, the whole platform monitoring and data automation monitoring still need to be strengthened and improved .

Join the community

Welcome more partners who love open source to join  Apache Doris  Community , Participate in community building , Except in the  GitHub  Ascending  PR  or  Issue  outside , You are also welcome to actively participate in the daily construction of the community , such as :

Join the community
Solicitation activities
, Perform technical analysis 、 Applied practice and other articles ; Participate as an instructor  Doris  Community online and offline activities ; actively participate in  Doris  Questions and answers from community users .

Last , Welcome more open source technology enthusiasts to join us  Apache Doris  Community , Grow up hand in hand , Build community ecology .


SelectDB  Is an open source technology company , Committed to  Apache Doris  The community provides a full-time engineer 、 A team of product managers and support engineers , Prosper the open source community ecology , Create an international industry standard in the field of real-time analytical databases . be based on  Apache Doris  R & D of a new generation of cloud native real-time data warehouse  SelectDB, Running on multiple clouds , Provide users and customers with out of the box capability .
原网站

版权声明
本文为[InfoQ]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/188/202207071844216907.html