当前位置:网站首页>Application practice | the efficiency of the data warehouse system has been comprehensively improved! Data warehouse construction based on Apache Doris in Tongcheng digital Department
Application practice | the efficiency of the data warehouse system has been comprehensively improved! Data warehouse construction based on Apache Doris in Tongcheng digital Department
2022-07-08 00:29:00 【SelectDB】
Reading guide : Tongcheng mathematics department was established in 2015 year , It is the tourism industry financial service platform under Tongcheng Group .2020 year , The same process number department is based on Apache Doris Rich data access methods 、 Excellent parallel computing ability 、 Minimalist operation and maintenance features , introduce Apache Doris Carry out data warehouse structure 2.0 Build . This article details the architecture 1.0 To 2.0 The evolution process and Doris Application practice of , I hope that's helpful .
author | Senior engineer of big data in the same Engineering Department Uranus
Business background
Business Introduction
Tongcheng Digital Technology Co., Ltd. is a tourism industry financial service platform under Tongcheng Group , The predecessor is Tongcheng Jinfu , Officially established in 2015 year . The number of subjects in the same process is “ Digital technology leads the tourism industry ” For the vision , Adhere to the power of science and Technology , Enable China's tourism industry .
at present , Our business covers industrial financial services 、 Consumer financial services 、 Financial technology, digital technology and other sectors , The cumulative service covers more than ten million users and 76 city .
chart 1.1 Business scenario - Business Introduction
Business needs
It mainly includes four categories :
- Kanban class : It mainly includes business real-time cockpit and T+1 Business Kanban, etc .
- Early warning : Mainly including risk control fusing 、 Abnormal funds and flow monitoring .
- Analysis class : It mainly includes timely data query and analysis and temporary data retrieval .
- Financial class : It mainly includes clearing and payment reconciliation requirements .
chart 1.2 Business scenario - Business needs
Integrate the above business needs , We have built the system architecture .
Architecture evolution 1.0
Workflow
chart 2.1 Architecture evolution - framework 1.0
framework 1.0 It was very popular in previous years SteamSets and Apache Kudu As the core of the first generation architecture .
The architecture passes StreamSets Do database Binlog Write in real time after acquisition Apache Kudu in , Finally through Apache Impala And visualization tools to query and use . This process has long architecture links and SteamSets Poor reusability of some configurations , in addition Apache Kudu There are some performance bottlenecks between multi table Association and large table Association , And right IO High requirements .
chart 2.1 The application of real-time computing process in the lower part is similar to that in the upper part , In real-time computing , Buried point data is sent to Kafka It will pass Flink Do real-time calculations , And the calculated result data will fall into the analysis library and Hive The library is used for data warehouse Association .
Advantages and disadvantages
chart 2.2 Architecture evolution Advantages and disadvantages
advantage :
- framework 1.0 I chose CDH Family bucket .CDH It provides many big data components , Can be integrated with each other and put into use , At the same time, its configuration is relatively flexible .
- The use of SteamSets Support visual drag and drop and configuration development , So developers are right SteamSets The degree of acceptance is high ..
Insufficient :
- Too many components are introduced , Maintenance costs increase ; When there is a problem with the data , The troubleshooting and repair link is relatively long .
- Various technical architectures and long development links , It improves the learning cost and requirements of warehouse staff , Data warehouse personnel need to convert in different places before developing , The development process is not smooth 、 Reduced development efficiency .
- Apache Kudu In large table Association Join The performance is not satisfactory .
- Because the architecture uses CDH structure , Offline cluster and real-time cluster are not separated , Form resources to compete with each other ; In the process of offline batch running IO Or high disk consumption , The timeliness of real-time data cannot be guaranteed .
- although SteamSets Equipped with early warning capability , But the job recovery ability is still relatively lacking . When configuring multiple tasks JVM The consumption is high , Resulting in slow recovery .
Architecture evolution 2.0
Workflow
Because of the architecture 1.0 The disadvantages far outweigh the advantages , stay 2020 year , We investigated many components in the market for real-time development , Found out Apache Doris, Through investigation and comparison , The final decision will be Apache Doris Introduced architecture 2.0.
chart 3.1 Architecture evolution - framework 2.0
introduce Apache Doris after , The following modifications have been made to the overall architecture :
- adopt Canal Of CDC Ability , take MySQL Data collection to Kafka in . because Apache Doris And Kafka The degree of fit is high , It's easy to use Routine Load Load and access data .
- The data link of the original offline calculation is slightly adjusted . For storage in Hive Data in ,Apahce Doris Supported by Broker Load take Hive Data import , Therefore, the data of offline clusters can be directly loaded into Doris In .
The selection Doris
chart 3.2 framework 2.0- The selection Doris
In the process of model selection ,Apache Doris The overall performance is amazing :
- Data access : Provides rich data import methods , It can support the access of many data sources .
- Data connection : Doris Support JDBC And ODBC And so on , Yes BI The visual display of the tool is relatively friendly , Can easily communicate with BI Tools to connect , in addition Doris Realized MySQL Protocol layer , Through various Client Tools directly access Doris.
- SQL grammar : Doris Support standards SQL, Grammar to MySQL compatible , The learning cost is low for the warehouse staff ;
- MPP Parallel computing : Doris be based on MPP The architecture provides excellent parallel computing capability , For large table Join Very good support .
- The most important point : Doris Official documents are very sound , For users, it's faster to get started .
When investigating the system selection , We also understand ClickHouse,ClickHouse Yes CPU The utilization rate is high , It performs well in single table query , But in multi query high QPS Under the circumstances of poor performance .
Combine the above factors , In the end, we chose Apache Doris.
Doris Deployment architecture
chart 3.3 framework 2.0-Doris Deployment architecture
Apache Doris The deployment architecture is extremely simple , Mainly FE and BE:
FE It is the front-end node , It mainly carries out the access requested by users 、 Metadata and cluster management and query plan generation .
BE It is the back-end node , Mainly for data storage and query plan generation and execution .
Doris Operation and maintenance is very simple :
3 In June, we carried out a rolling migration of the machines in the computer room ,12 platform Doris All node machines will be migrated within three days , The overall operation is relatively simple , It is mainly used for removing the machine from the rack 、 Move and put on shelves ;FE The expansion and contraction actions don't take much time , Only used Add And Drop Wait for simple instructions .
Particular attention : Try not to use something like Drop Wait for instructions directly to BE To operate . When using Drop When the instruction is forcibly deleted ,Doris You will be prompted and asked to manually confirm whether to delete , Data cannot be recovered after forced deletion . Therefore, it is recommended to use contact mode to offline nodes , After the data migration is completed , You can directly BE The node joins again , More flexible .
Doris Real time system architecture
chart 3.4 Doris Real time system architecture
data source : In real-time system architecture , The data source comes from industrial finance 、 Consumer finance 、 Risk control data and other business lines , adopt Canal and API Interface for collection .
Data collection : Canal adopt Canal- Admin After data collection , Send data to Kafka In the message queue , Re pass Routine Load Access to Doris colony .
Doris Several positions : Doris The cluster builds three layers of data warehouse , Namely : Used Unique Model DWD Detail level 、 Aggregate Model DWS Summary level and ADS application layer .
Data applications : The architecture is applied to real-time Kanban 、 Data timeliness analysis and data service .
Doris New data warehouse features
chart 3.5 Doris New data warehouse features
The data import method is simple , According to different scenarios 3 There are two different import methods :
- Routine Load: It is mainly used for business data access and consumption Kafka The Resident Mission of exists . When we submit Rountine Load When the task ,Doris There will be a resident process for real-time consumption Kafka , Constantly from Kafka Read data from and import it into Doris in .
- Broker Load: Perform offline data import tasks such as basic dimension tables and historical data .
- Insert Into: Used for regular batch operation , Responsible for DWD Layer data processing , formation DWS Layers and ADS layer .
Doris Good data model , Improve our development efficiency :
- Unique Model in DWD When the layer is accessed , It can effectively prevent repeated consumption of data .
- Aggregate Models are used as aggregations . stay Doris in ,Aggregate Support is like Sum、Replace、Min 、Max 4 Two ways of aggregation model , Used in the process of aggregation Aggregate The underlying model can be reduced by a large part SQL The amount of code , No longer need to do it yourself Sum、Min、Max Wait for the action , From DWD Layer to DWS/ADS Layer is friendly .
Doris Low threshold for use , High query efficiency :
- Support MySQL agreement , Support standards SQL, Query syntax is highly compatible MySQL, Friendly to analysts .
- Support materialized views and Rollup Physicochemical index . The bottom layer of materialized view is similar Cube The concept of and the process of precomputation , And Kylin The way of exchanging space for time is similar , Special tables are generated at the bottom , When the materialized view is hit in the query, it will respond quickly .
hot tip : Although materialized views are very helpful , But when used too much , Each materialized view requires additional storage space , Data import will lead to inefficiency .
Doris Minimalist system architecture , Low operation and maintenance cost :
- The system only has BE and FE Two modules , Do not rely on such as Zookeeper Wait for tripartite components , Simple deployment .
- in the light of FE and BE The operation of is monitored and configured , When an exception occurs, it will restart in time .
Doris Summary of experience
chart 4.1 How to use more friendly Doris
In the use of Apache Doris In the process of , We have sorted out some experience , Help developers use more friendly Doris . For developers , The most concerned places are the following :
- In terms of development : How to access external data Doris And quickly realize ETL Development , This will affect the report output speed of developers .
- Scheduling management : Developers don't want to go online after the development is completed , There are errors or unstable situations , It is necessary to ensure the stability of task scheduling and scheduling recovery ability .
- Data query : Due to the separation between production and office network , The office network cannot directly use the connection of the production network , And the network partition cannot be solved through the form of client , Only through Web Form solution , How to query and analyze safely and conveniently has become a concern of developers .
- Cluster management : When abnormal conditions occur in the cluster, it can be captured in time and handled automatically .
To make a long story short , We hope to build a high efficiency 、 High-quality , high stability The platform of .
Doris Development optimization
According to several issues concerned by developers , We did some development optimization .
Data access
In terms of data access, we have done semi-automatic related work and made rapid generation components , According to the data source / Table to generate Routine Load Script , As long as the Kafka Of Broker or Topic It can be quickly formed by modifying Routine Load Mission .Broker Load Task and Routine Load similar , After selecting the data warehouse source, it can be generated in time Broker Load Required scripts . In the access Doris You need to create a table in advance , Similar operations can be carried out in this regard , Create statements quickly through source generation .
chart 5.1 Data platform - Doris Development
The above mainly uses the underlying metadata , Get different metadata according to different data sources, and then you can quickly generate tasks .
Submit action and maintenance management
After the task is generated , We are Routine Load Aspects are encapsulated . because Routine Load It's the resident process , We only need to submit again , The state will become Running , If there is an abnormal state, it will be detected , Monitoring will be shown to you later .
chart 5.2 Data platform - Doris Development
Monitoring and management
We can submit Routine Load Query and check for exceptions , At the same time, we can focus on Routine Load Join the monitoring , The monitoring will automatically scan the task on a regular basis , When a problem occurs, it will prompt and try to pull the task back .
Broker Load Tasks can also be monitored . Aim at Broker Load Label The name cannot be repeated , We take generation UUID How to solve , In order to better help you improve the use experience .
chart 5.3 Data platform - Doris Development
As shown above , We can do it in Routine Load Pause and terminate in , Help you to better use the developed homework and management .
Self research query page , Integrate Doris Help function
Due to the isolation between production and office network segment , We can only pass Web The query . We have tried to use Hue Integrate Doris Query scheme ,Doris Supported by MySQL Protocol connection to Hue , But if we integrate Hue Words , Everyone can pass Hue Inquire about Doris The data of , Security cannot be guaranteed , Unable to meet our requirements for permission .
chart 5.4 Data platform -Doris Data query
So we developed a query page in our own platform to solve this problem . The left part of the figure can be based on DB List all the tables below , The right part is the query analysis page and query results , We developed it by ourselves, which is similar to Navicat Client software .
At the same time, we are right Doris Help Functions are integrated , People don't know how to use Doris Provide help when . Through integration Doris Help, We can use the keyword search function to query syntax and examples to solve problems .
Even without integration Doris Help, It can also be in FE The node comes with Web Page to view ,FE The built-in node can view the information of the whole cluster and has Help Functional Web page . After we realize the self research query page and integrate Doris Help after , You can use it directly , To skip the need to use Admin Account connection can be used FE Steps for .
Doris Cluster monitoring page
At the same time, we developed Doris Cluster monitoring page , You can see in the cluster monitoring page FE 、BE as well as Broker Node status of . When an abnormal condition occurs in the cluster , The monitoring system will send an automatic reminder and try to pull up the cluster , At the same time, you can also observe the health of nodes in the form of pages .
chart 5.5 Data platform - Doris Cluster monitoring
about Doris For upper applications , Mainly depends on Doris Provided API And instruction completion Doris The application action of the upper layer , All we do is to Doris The instructions provided are more user-friendly integration and paged display .
The benefits of the new architecture
chart 6.1 The benefits of the new architecture
- Data access : Pass at an early stage SteamSets During the process of data access, it is necessary to manually establish Kudu surface . Due to the lack of tools , The whole process of creating tables and tasks requires 20-30 minute . Now we can realize fast data access through platform and fast construction statement , The access process of each table starts from the previous 20-30 Minutes to the present 3-5 minute , Improved performance 5-6 times .
- Data development : When performing aggregation or other actions in the early architecture , You need to write a lot of long articles SQL Code . Use Doris after , We can use it directly Doris The built-in Unique、Aggregate And other data models that can well support log scenarios Duplicate Model , stay ETL Greatly accelerate the development process .
- Query analysis : Doris The bottom layer has materialized views and Rollup Materialized index and other functions , It can improve query efficiency , meanwhile Doris The bottom layer has carried out many optimization strategies for large table Association , Such as Runtime Filter And other things Join And custom optimization strategies . Compare with Doris,Apache Kudu You need more in-depth optimization experience to better use .
- Data reports : First use Kudu Report query requires 1-2 Minutes to finish rendering , and Doris The response speed is second level or even millisecond level .
- Environmental maintenance : Doris No, Hadoop Complexity of ecosystem , The whole link is clear , The maintenance cost is much lower than Hadoop, Especially in the process of cluster migration ,Doris The convenience of operation and maintenance is particularly prominent .
Future outlook
chart 7.1 Future outlook
- Try to introduce Doris Manager: There are ongoing discussions in the community about Doris Manager Propaganda , In the future, we are also ready to introduce and actively participate in Doris Manager Cluster maintenance and management .
- Implementation is based on Flink CDC Data access : The current architecture does not introduce Flink CDC , But continue to use Canal Collect to Kafka Then collect Doris Architecture in , The link is relatively long . use Flink CDC Although we can continue to streamline the overall architecture , But you still need to write a certain amount of code , about BI People feel unfriendly when using it directly , We hope that the warehouse staff only need SQL Or complete the operation on the page, you can use . stay 3.0 Architecture Planning , We plan to introduce Flink CDC Function and expand the upper application .Flink CDC The introduction of brings “ Fast is slow , Slow is fast ” The idea of , Of course Flink The development speed of the community is very fast , Only after fully learning from everyone's experience , Can be introduced more friendly , And iterate and optimize the architecture in the process of learning experience .
- Follow the community iteration plan : What we are using Doris The version is relatively old , Now the new version Doris In memory management 、 Query performance has been greatly improved , In the future, we will follow the community iteration rhythm to upgrade the cluster and reflect new features .
- Strengthen the construction of relevant systems : Our current indicator system management, such as report metadata 、 The maintenance and management of business metadata still need to be improved . Data quality monitoring , Although it currently includes the function of data quality monitoring , However, the whole platform monitoring and data automation monitoring still need to be strengthened and improved .
Join the community
Welcome more partners who love open source to join Apache Doris Community , Participate in community building , Except in the GitHub Ascending PR or Issue outside , You are also welcome to actively participate in the daily construction of the community , such as :
Join the community Solicitation activities , Perform technical analysis 、 Applied practice and other articles ; Participate as an instructor Doris Community online and offline activities ; actively participate in Doris Questions and answers from community users .
Last , Welcome more open source technology enthusiasts to join us Apache Doris Community , Grow up hand in hand , Build community ecology .
SelectDB Is an open source technology company , Committed to Apache Doris The community provides a full-time engineer 、 A team of product managers and support engineers , Prosper the open source community ecology , Create an international industry standard in the field of real-time analytical databases . be based on Apache Doris R & D of a new generation of cloud native real-time data warehouse SelectDB, Running on multiple clouds , Provide users and customers with out of the box capability .
Related links :
SelectDB Official website :
Apache Doris Official website :
Apache Doris Github:
https://github.com/apache/doris
Apache Doris Developer mail group :
边栏推荐
- 【愚公系列】2022年7月 Go教学课程 006-自动推导类型和输入输出
- Reading notes 004: Wang Yangming's quotations
- C language 001: download, install, create the first C project and execute the first C language program of CodeBlocks
- The difference between -s and -d when downloading packages using NPM
- 【编程题】【Scratch二级】2019.12 绘制十个正方形
- 【GO记录】从零开始GO语言——用GO语言做一个示波器(一)GO语言基础
- 51与蓝牙模块通讯,51驱动蓝牙APP点灯
- 测试流程不完善,又遇到不积极的开发怎么办?
- [programming problem] [scratch Level 2] draw ten squares in December 2019
- “一个优秀程序员可抵五个普通程序员”,差距就在这7个关键点
猜你喜欢
[programming questions] [scratch Level 2] March 2019 garbage classification
某马旅游网站开发(对servlet的优化)
玩转Sonar
51与蓝牙模块通讯,51驱动蓝牙APP点灯
测试流程不完善,又遇到不积极的开发怎么办?
C language 005: common examples
Stm32f1 and stm32cubeide programming example - rotary encoder drive
深潜Kotlin协程(二十二):Flow的处理
Installation and configuration of sublime Text3
“一个优秀程序员可抵五个普通程序员”,差距就在这7个关键点
随机推荐
CoinDesk评波场去中心化进程:让人们看到互联网的未来
5G NR 系统消息
Common selectors are
深潜Kotlin协程(二十三 完结篇):SharedFlow 和 StateFlow
Linkedblockingqueue source code analysis - add and delete
Trust orbtk development issues 2022
How can CSDN indent the first line of a paragraph by 2 characters?
3年经验,面试测试岗20K都拿不到了吗?这么坑?
Development of a horse tourism website (optimization of servlet)
paddle入门-使用LeNet在MNIST实现图像分类方法二
[programming questions] [scratch Level 2] March 2019 garbage classification
玩轉Sonar
“一个优秀程序员可抵五个普通程序员”,差距就在这7个关键点
[programming problem] [scratch Level 2] March 2019 draw a square spiral
Solution to prompt configure: error: curses library not found when configuring and installing crosstool ng tool
QT establish signal slots between different classes and transfer parameters
ABAP ALV LVC template
韦东山第三期课程内容概要
Stm32f1 and stm32cubeide programming example - rotary encoder drive
Summary of weidongshan phase II course content