当前位置:网站首页>Application practice | the efficiency of the data warehouse system has been comprehensively improved! Data warehouse construction based on Apache Doris in Tongcheng digital Department

Application practice | the efficiency of the data warehouse system has been comprehensively improved! Data warehouse construction based on Apache Doris in Tongcheng digital Department

2022-07-08 00:29:00 SelectDB

Reading guide : Tongcheng mathematics department was established in 2015 year , It is the tourism industry financial service platform under Tongcheng Group .2020 year , The same process number department is based on Apache Doris Rich data access methods 、 Excellent parallel computing ability 、 Minimalist operation and maintenance features , introduce Apache Doris Carry out data warehouse structure 2.0 Build . This article details the architecture 1.0 To 2.0 The evolution process and Doris Application practice of , I hope that's helpful .

author Senior engineer of big data in the same Engineering Department Uranus

Business background

Business Introduction

Tongcheng Digital Technology Co., Ltd. is a tourism industry financial service platform under Tongcheng Group , The predecessor is Tongcheng Jinfu , Officially established in 2015 year . The number of subjects in the same process is “ Digital technology leads the tourism industry ” For the vision , Adhere to the power of science and Technology , Enable China's tourism industry .

at present , Our business covers industrial financial services 、 Consumer financial services 、 Financial technology, digital technology and other sectors , The cumulative service covers more than ten million users and 76 city .

 picture

chart 1.1 Business scenario - Business Introduction

Business needs

It mainly includes four categories :

  • Kanban class : It mainly includes business real-time cockpit and T+1 Business Kanban, etc .
  • Early warning : Mainly including risk control fusing 、 Abnormal funds and flow monitoring .
  • Analysis class : It mainly includes timely data query and analysis and temporary data retrieval .
  • Financial class : It mainly includes clearing and payment reconciliation requirements .

 picture

chart 1.2 Business scenario - Business needs

Integrate the above business needs , We have built the system architecture .

Architecture evolution 1.0

Workflow

 picture

chart 2.1 Architecture evolution - framework 1.0

framework 1.0 It was very popular in previous years SteamSets and Apache Kudu As the core of the first generation architecture .

The architecture passes StreamSets Do database Binlog Write in real time after acquisition Apache Kudu in , Finally through Apache Impala And visualization tools to query and use . This process has long architecture links and SteamSets Poor reusability of some configurations , in addition Apache Kudu There are some performance bottlenecks between multi table Association and large table Association , And right IO High requirements .

chart 2.1 The application of real-time computing process in the lower part is similar to that in the upper part , In real-time computing , Buried point data is sent to Kafka It will pass Flink Do real-time calculations , And the calculated result data will fall into the analysis library and Hive The library is used for data warehouse Association .

Advantages and disadvantages

 picture

chart 2.2 Architecture evolution Advantages and disadvantages

advantage :

  • framework 1.0 I chose CDH Family bucket .CDH It provides many big data components , Can be integrated with each other and put into use , At the same time, its configuration is relatively flexible .
  • The use of SteamSets Support visual drag and drop and configuration development , So developers are right SteamSets The degree of acceptance is high ..

Insufficient :

  • Too many components are introduced , Maintenance costs increase ; When there is a problem with the data , The troubleshooting and repair link is relatively long .
  • Various technical architectures and long development links , It improves the learning cost and requirements of warehouse staff , Data warehouse personnel need to convert in different places before developing , The development process is not smooth 、 Reduced development efficiency .
  • Apache Kudu In large table Association Join The performance is not satisfactory .
  • Because the architecture uses CDH structure , Offline cluster and real-time cluster are not separated , Form resources to compete with each other ; In the process of offline batch running IO Or high disk consumption , The timeliness of real-time data cannot be guaranteed .
  • although SteamSets Equipped with early warning capability , But the job recovery ability is still relatively lacking . When configuring multiple tasks JVM The consumption is high , Resulting in slow recovery .

Architecture evolution 2.0

Workflow

Because of the architecture 1.0 The disadvantages far outweigh the advantages , stay 2020 year , We investigated many components in the market for real-time development , Found out Apache Doris, Through investigation and comparison , The final decision will be Apache Doris Introduced architecture 2.0.

 picture

chart 3.1 Architecture evolution - framework 2.0

introduce Apache Doris after , The following modifications have been made to the overall architecture :

  • adopt Canal Of CDC Ability , take MySQL Data collection to Kafka in . because Apache Doris And Kafka The degree of fit is high , It's easy to use Routine Load Load and access data .
  • The data link of the original offline calculation is slightly adjusted . For storage in Hive Data in ,Apahce Doris Supported by Broker Load take Hive Data import , Therefore, the data of offline clusters can be directly loaded into Doris In .

The selection Doris

 picture

chart 3.2 framework 2.0- The selection Doris

In the process of model selection ,Apache Doris The overall performance is amazing :

  • Data access : Provides rich data import methods , It can support the access of many data sources .
  • Data connection : Doris Support JDBC And ODBC And so on , Yes BI The visual display of the tool is relatively friendly , Can easily communicate with BI Tools to connect , in addition Doris Realized MySQL Protocol layer , Through various Client Tools directly access Doris.
  • SQL grammar : Doris Support standards SQL, Grammar to MySQL compatible , The learning cost is low for the warehouse staff ;
  • MPP Parallel computing : Doris be based on MPP The architecture provides excellent parallel computing capability , For large table Join Very good support .
  • The most important point : Doris Official documents are very sound , For users, it's faster to get started .

When investigating the system selection , We also understand ClickHouse,ClickHouse Yes CPU The utilization rate is high , It performs well in single table query , But in multi query high QPS Under the circumstances of poor performance .

Combine the above factors , In the end, we chose Apache Doris.

Doris Deployment architecture

 picture

chart 3.3 framework 2.0-Doris Deployment architecture

Apache Doris The deployment architecture is extremely simple , Mainly FE and BE:

FE It is the front-end node , It mainly carries out the access requested by users 、 Metadata and cluster management and query plan generation .

BE It is the back-end node , Mainly for data storage and query plan generation and execution .

Doris Operation and maintenance is very simple :

3 In June, we carried out a rolling migration of the machines in the computer room ,12 platform Doris All node machines will be migrated within three days , The overall operation is relatively simple , It is mainly used for removing the machine from the rack 、 Move and put on shelves ;FE The expansion and contraction actions don't take much time , Only used Add And Drop Wait for simple instructions .

Particular attention : Try not to use something like Drop Wait for instructions directly to BE To operate . When using Drop When the instruction is forcibly deleted ,Doris You will be prompted and asked to manually confirm whether to delete , Data cannot be recovered after forced deletion . Therefore, it is recommended to use contact mode to offline nodes , After the data migration is completed , You can directly BE The node joins again , More flexible .

Doris Real time system architecture

 picture

chart 3.4 Doris Real time system architecture

data source : In real-time system architecture , The data source comes from industrial finance 、 Consumer finance 、 Risk control data and other business lines , adopt Canal and API Interface for collection .

Data collection : Canal adopt Canal- Admin After data collection , Send data to Kafka In the message queue , Re pass Routine Load Access to Doris colony .

Doris Several positions : Doris The cluster builds three layers of data warehouse , Namely : Used Unique Model DWD Detail level 、 Aggregate Model DWS Summary level and ADS application layer .

Data applications : The architecture is applied to real-time Kanban 、 Data timeliness analysis and data service .

Doris New data warehouse features

 picture

chart 3.5 Doris New data warehouse features

The data import method is simple , According to different scenarios 3 There are two different import methods :

  • Routine Load: It is mainly used for business data access and consumption Kafka The Resident Mission of exists . When we submit Rountine Load When the task ,Doris There will be a resident process for real-time consumption Kafka , Constantly from Kafka Read data from and import it into Doris in .
  • Broker Load: Perform offline data import tasks such as basic dimension tables and historical data .
  • Insert Into: Used for regular batch operation , Responsible for DWD Layer data processing , formation DWS Layers and ADS layer .

Doris Good data model , Improve our development efficiency :

  • Unique Model in DWD When the layer is accessed , It can effectively prevent repeated consumption of data .
  • Aggregate Models are used as aggregations . stay Doris in ,Aggregate Support is like Sum、Replace、Min 、Max 4 Two ways of aggregation model , Used in the process of aggregation Aggregate The underlying model can be reduced by a large part SQL The amount of code , No longer need to do it yourself Sum、Min、Max Wait for the action , From DWD Layer to DWS/ADS Layer is friendly .

Doris Low threshold for use , High query efficiency :

  • Support MySQL agreement , Support standards SQL, Query syntax is highly compatible MySQL, Friendly to analysts .
  • Support materialized views and Rollup Physicochemical index . The bottom layer of materialized view is similar Cube The concept of and the process of precomputation , And Kylin The way of exchanging space for time is similar , Special tables are generated at the bottom , When the materialized view is hit in the query, it will respond quickly .

hot tip : Although materialized views are very helpful , But when used too much , Each materialized view requires additional storage space , Data import will lead to inefficiency .

Doris Minimalist system architecture , Low operation and maintenance cost :

  • The system only has BE and FE Two modules , Do not rely on such as Zookeeper Wait for tripartite components , Simple deployment .
  • in the light of FE and BE The operation of is monitored and configured , When an exception occurs, it will restart in time .

Doris Summary of experience

 picture

chart 4.1 How to use more friendly Doris

In the use of Apache Doris In the process of , We have sorted out some experience , Help developers use more friendly Doris . For developers , The most concerned places are the following :

  • In terms of development : How to access external data Doris And quickly realize ETL Development , This will affect the report output speed of developers .
  • Scheduling management : Developers don't want to go online after the development is completed , There are errors or unstable situations , It is necessary to ensure the stability of task scheduling and scheduling recovery ability .
  • Data query : Due to the separation between production and office network , The office network cannot directly use the connection of the production network , And the network partition cannot be solved through the form of client , Only through Web Form solution , How to query and analyze safely and conveniently has become a concern of developers .
  • Cluster management : When abnormal conditions occur in the cluster, it can be captured in time and handled automatically .

To make a long story short , We hope to build a high efficiency 、 High-quality , high stability The platform of .

Doris Development optimization

According to several issues concerned by developers , We did some development optimization .

Data access

In terms of data access, we have done semi-automatic related work and made rapid generation components , According to the data source / Table to generate Routine Load Script , As long as the Kafka Of Broker or Topic It can be quickly formed by modifying Routine Load Mission .Broker Load Task and Routine Load similar , After selecting the data warehouse source, it can be generated in time Broker Load Required scripts . In the access Doris You need to create a table in advance , Similar operations can be carried out in this regard , Create statements quickly through source generation .

 picture

chart 5.1 Data platform - Doris Development

The above mainly uses the underlying metadata , Get different metadata according to different data sources, and then you can quickly generate tasks .

Submit action and maintenance management

After the task is generated , We are Routine Load Aspects are encapsulated . because Routine Load It's the resident process , We only need to submit again , The state will become Running , If there is an abnormal state, it will be detected , Monitoring will be shown to you later .

 picture

chart 5.2 Data platform - Doris Development

Monitoring and management

We can submit Routine Load Query and check for exceptions , At the same time, we can focus on Routine Load Join the monitoring , The monitoring will automatically scan the task on a regular basis , When a problem occurs, it will prompt and try to pull the task back .

Broker Load Tasks can also be monitored . Aim at Broker Load Label The name cannot be repeated , We take generation UUID How to solve , In order to better help you improve the use experience .

 picture

chart 5.3 Data platform - Doris Development

As shown above , We can do it in Routine Load Pause and terminate in , Help you to better use the developed homework and management .

Self research query page , Integrate Doris Help function

Due to the isolation between production and office network segment , We can only pass Web The query . We have tried to use Hue Integrate Doris Query scheme ,Doris Supported by MySQL Protocol connection to Hue , But if we integrate Hue Words , Everyone can pass Hue Inquire about Doris The data of , Security cannot be guaranteed , Unable to meet our requirements for permission .

 picture

chart 5.4 Data platform -Doris Data query

So we developed a query page in our own platform to solve this problem . The left part of the figure can be based on DB List all the tables below , The right part is the query analysis page and query results , We developed it by ourselves, which is similar to Navicat Client software .

At the same time, we are right Doris Help Functions are integrated , People don't know how to use Doris Provide help when . Through integration Doris Help, We can use the keyword search function to query syntax and examples to solve problems .

Even without integration Doris Help, It can also be in FE The node comes with Web Page to view ,FE The built-in node can view the information of the whole cluster and has Help Functional Web page . After we realize the self research query page and integrate Doris Help after , You can use it directly , To skip the need to use Admin Account connection can be used FE Steps for .

Doris Cluster monitoring page

At the same time, we developed Doris Cluster monitoring page , You can see in the cluster monitoring page FE 、BE as well as Broker Node status of . When an abnormal condition occurs in the cluster , The monitoring system will send an automatic reminder and try to pull up the cluster , At the same time, you can also observe the health of nodes in the form of pages .

 picture

chart 5.5 Data platform - Doris Cluster monitoring

about Doris For upper applications , Mainly depends on Doris Provided API And instruction completion Doris The application action of the upper layer , All we do is to Doris The instructions provided are more user-friendly integration and paged display .

The benefits of the new architecture

 picture

chart 6.1 The benefits of the new architecture

  • Data access : Pass at an early stage SteamSets During the process of data access, it is necessary to manually establish Kudu surface . Due to the lack of tools , The whole process of creating tables and tasks requires 20-30 minute . Now we can realize fast data access through platform and fast construction statement , The access process of each table starts from the previous 20-30 Minutes to the present 3-5 minute , Improved performance 5-6 times .
  • Data development : When performing aggregation or other actions in the early architecture , You need to write a lot of long articles SQL Code . Use Doris after , We can use it directly Doris The built-in Unique、Aggregate And other data models that can well support log scenarios Duplicate Model , stay ETL Greatly accelerate the development process .
  • Query analysis : Doris The bottom layer has materialized views and Rollup Materialized index and other functions , It can improve query efficiency , meanwhile Doris The bottom layer has carried out many optimization strategies for large table Association , Such as Runtime Filter And other things Join And custom optimization strategies . Compare with Doris,Apache Kudu You need more in-depth optimization experience to better use .
  • Data reports : First use Kudu Report query requires 1-2 Minutes to finish rendering , and Doris The response speed is second level or even millisecond level .
  • Environmental maintenance : Doris No, Hadoop Complexity of ecosystem , The whole link is clear , The maintenance cost is much lower than Hadoop, Especially in the process of cluster migration ,Doris The convenience of operation and maintenance is particularly prominent .

Future outlook

 picture

chart 7.1 Future outlook

  • Try to introduce Doris Manager: There are ongoing discussions in the community about Doris Manager Propaganda , In the future, we are also ready to introduce and actively participate in Doris Manager Cluster maintenance and management .
  • Implementation is based on Flink CDC Data access : The current architecture does not introduce Flink CDC , But continue to use Canal Collect to Kafka Then collect Doris Architecture in , The link is relatively long . use Flink CDC Although we can continue to streamline the overall architecture , But you still need to write a certain amount of code , about BI People feel unfriendly when using it directly , We hope that the warehouse staff only need SQL Or complete the operation on the page, you can use . stay 3.0 Architecture Planning , We plan to introduce Flink CDC Function and expand the upper application .Flink CDC The introduction of brings “ Fast is slow , Slow is fast ” The idea of , Of course Flink The development speed of the community is very fast , Only after fully learning from everyone's experience , Can be introduced more friendly , And iterate and optimize the architecture in the process of learning experience .
  • Follow the community iteration plan : What we are using Doris The version is relatively old , Now the new version Doris In memory management 、 Query performance has been greatly improved , In the future, we will follow the community iteration rhythm to upgrade the cluster and reflect new features .
  • Strengthen the construction of relevant systems : Our current indicator system management, such as report metadata 、 The maintenance and management of business metadata still need to be improved . Data quality monitoring , Although it currently includes the function of data quality monitoring , However, the whole platform monitoring and data automation monitoring still need to be strengthened and improved .

Join the community

Welcome more partners who love open source to join Apache Doris Community , Participate in community building , Except in the GitHub Ascending PR or Issue outside , You are also welcome to actively participate in the daily construction of the community , such as :

Join the community Solicitation activities , Perform technical analysis 、 Applied practice and other articles ; Participate as an instructor Doris Community online and offline activities ; actively participate in Doris Questions and answers from community users .

Last , Welcome more open source technology enthusiasts to join us Apache Doris Community , Grow up hand in hand , Build community ecology .

 picture

SelectDB Is an open source technology company , Committed to Apache Doris The community provides a full-time engineer 、 A team of product managers and support engineers , Prosper the open source community ecology , Create an international industry standard in the field of real-time analytical databases . be based on Apache Doris R & D of a new generation of cloud native real-time data warehouse SelectDB, Running on multiple clouds , Provide users and customers with out of the box capability .

Related links :

SelectDB Official website :

https://selectdb.com

Apache Doris Official website :

http://doris.apache.org

Apache Doris Github:

https://github.com/apache/doris

Apache Doris Developer mail group :

[email protected]

原网站

版权声明
本文为[SelectDB]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/189/202207072213358846.html