当前位置:网站首页>Application practice | the efficiency of the data warehouse system has been comprehensively improved! Data warehouse construction based on Apache Doris in Tongcheng digital Department
Application practice | the efficiency of the data warehouse system has been comprehensively improved! Data warehouse construction based on Apache Doris in Tongcheng digital Department
2022-07-08 00:29:00 【SelectDB】
Reading guide : Tongcheng mathematics department was established in 2015 year , It is the tourism industry financial service platform under Tongcheng Group .2020 year , The same process number department is based on Apache Doris Rich data access methods 、 Excellent parallel computing ability 、 Minimalist operation and maintenance features , introduce Apache Doris Carry out data warehouse structure 2.0 Build . This article details the architecture 1.0 To 2.0 The evolution process and Doris Application practice of , I hope that's helpful .
author | Senior engineer of big data in the same Engineering Department Uranus
Business background
Business Introduction
Tongcheng Digital Technology Co., Ltd. is a tourism industry financial service platform under Tongcheng Group , The predecessor is Tongcheng Jinfu , Officially established in 2015 year . The number of subjects in the same process is “ Digital technology leads the tourism industry ” For the vision , Adhere to the power of science and Technology , Enable China's tourism industry .
at present , Our business covers industrial financial services 、 Consumer financial services 、 Financial technology, digital technology and other sectors , The cumulative service covers more than ten million users and 76 city .
chart 1.1 Business scenario - Business Introduction
Business needs
It mainly includes four categories :
- Kanban class : It mainly includes business real-time cockpit and T+1 Business Kanban, etc .
- Early warning : Mainly including risk control fusing 、 Abnormal funds and flow monitoring .
- Analysis class : It mainly includes timely data query and analysis and temporary data retrieval .
- Financial class : It mainly includes clearing and payment reconciliation requirements .
chart 1.2 Business scenario - Business needs
Integrate the above business needs , We have built the system architecture .
Architecture evolution 1.0
Workflow
chart 2.1 Architecture evolution - framework 1.0
framework 1.0 It was very popular in previous years SteamSets and Apache Kudu As the core of the first generation architecture .
The architecture passes StreamSets Do database Binlog Write in real time after acquisition Apache Kudu in , Finally through Apache Impala And visualization tools to query and use . This process has long architecture links and SteamSets Poor reusability of some configurations , in addition Apache Kudu There are some performance bottlenecks between multi table Association and large table Association , And right IO High requirements .
chart 2.1 The application of real-time computing process in the lower part is similar to that in the upper part , In real-time computing , Buried point data is sent to Kafka It will pass Flink Do real-time calculations , And the calculated result data will fall into the analysis library and Hive The library is used for data warehouse Association .
Advantages and disadvantages
chart 2.2 Architecture evolution Advantages and disadvantages
advantage :
- framework 1.0 I chose CDH Family bucket .CDH It provides many big data components , Can be integrated with each other and put into use , At the same time, its configuration is relatively flexible .
- The use of SteamSets Support visual drag and drop and configuration development , So developers are right SteamSets The degree of acceptance is high ..
Insufficient :
- Too many components are introduced , Maintenance costs increase ; When there is a problem with the data , The troubleshooting and repair link is relatively long .
- Various technical architectures and long development links , It improves the learning cost and requirements of warehouse staff , Data warehouse personnel need to convert in different places before developing , The development process is not smooth 、 Reduced development efficiency .
- Apache Kudu In large table Association Join The performance is not satisfactory .
- Because the architecture uses CDH structure , Offline cluster and real-time cluster are not separated , Form resources to compete with each other ; In the process of offline batch running IO Or high disk consumption , The timeliness of real-time data cannot be guaranteed .
- although SteamSets Equipped with early warning capability , But the job recovery ability is still relatively lacking . When configuring multiple tasks JVM The consumption is high , Resulting in slow recovery .
Architecture evolution 2.0
Workflow
Because of the architecture 1.0 The disadvantages far outweigh the advantages , stay 2020 year , We investigated many components in the market for real-time development , Found out Apache Doris, Through investigation and comparison , The final decision will be Apache Doris Introduced architecture 2.0.
chart 3.1 Architecture evolution - framework 2.0
introduce Apache Doris after , The following modifications have been made to the overall architecture :
- adopt Canal Of CDC Ability , take MySQL Data collection to Kafka in . because Apache Doris And Kafka The degree of fit is high , It's easy to use Routine Load Load and access data .
- The data link of the original offline calculation is slightly adjusted . For storage in Hive Data in ,Apahce Doris Supported by Broker Load take Hive Data import , Therefore, the data of offline clusters can be directly loaded into Doris In .
The selection Doris
chart 3.2 framework 2.0- The selection Doris
In the process of model selection ,Apache Doris The overall performance is amazing :
- Data access : Provides rich data import methods , It can support the access of many data sources .
- Data connection : Doris Support JDBC And ODBC And so on , Yes BI The visual display of the tool is relatively friendly , Can easily communicate with BI Tools to connect , in addition Doris Realized MySQL Protocol layer , Through various Client Tools directly access Doris.
- SQL grammar : Doris Support standards SQL, Grammar to MySQL compatible , The learning cost is low for the warehouse staff ;
- MPP Parallel computing : Doris be based on MPP The architecture provides excellent parallel computing capability , For large table Join Very good support .
- The most important point : Doris Official documents are very sound , For users, it's faster to get started .
When investigating the system selection , We also understand ClickHouse,ClickHouse Yes CPU The utilization rate is high , It performs well in single table query , But in multi query high QPS Under the circumstances of poor performance .
Combine the above factors , In the end, we chose Apache Doris.
Doris Deployment architecture
chart 3.3 framework 2.0-Doris Deployment architecture
Apache Doris The deployment architecture is extremely simple , Mainly FE and BE:
FE It is the front-end node , It mainly carries out the access requested by users 、 Metadata and cluster management and query plan generation .
BE It is the back-end node , Mainly for data storage and query plan generation and execution .
Doris Operation and maintenance is very simple :
3 In June, we carried out a rolling migration of the machines in the computer room ,12 platform Doris All node machines will be migrated within three days , The overall operation is relatively simple , It is mainly used for removing the machine from the rack 、 Move and put on shelves ;FE The expansion and contraction actions don't take much time , Only used Add And Drop Wait for simple instructions .
Particular attention : Try not to use something like Drop Wait for instructions directly to BE To operate . When using Drop When the instruction is forcibly deleted ,Doris You will be prompted and asked to manually confirm whether to delete , Data cannot be recovered after forced deletion . Therefore, it is recommended to use contact mode to offline nodes , After the data migration is completed , You can directly BE The node joins again , More flexible .
Doris Real time system architecture
chart 3.4 Doris Real time system architecture
data source : In real-time system architecture , The data source comes from industrial finance 、 Consumer finance 、 Risk control data and other business lines , adopt Canal and API Interface for collection .
Data collection : Canal adopt Canal- Admin After data collection , Send data to Kafka In the message queue , Re pass Routine Load Access to Doris colony .
Doris Several positions : Doris The cluster builds three layers of data warehouse , Namely : Used Unique Model DWD Detail level 、 Aggregate Model DWS Summary level and ADS application layer .
Data applications : The architecture is applied to real-time Kanban 、 Data timeliness analysis and data service .
Doris New data warehouse features
chart 3.5 Doris New data warehouse features
The data import method is simple , According to different scenarios 3 There are two different import methods :
- Routine Load: It is mainly used for business data access and consumption Kafka The Resident Mission of exists . When we submit Rountine Load When the task ,Doris There will be a resident process for real-time consumption Kafka , Constantly from Kafka Read data from and import it into Doris in .
- Broker Load: Perform offline data import tasks such as basic dimension tables and historical data .
- Insert Into: Used for regular batch operation , Responsible for DWD Layer data processing , formation DWS Layers and ADS layer .
Doris Good data model , Improve our development efficiency :
- Unique Model in DWD When the layer is accessed , It can effectively prevent repeated consumption of data .
- Aggregate Models are used as aggregations . stay Doris in ,Aggregate Support is like Sum、Replace、Min 、Max 4 Two ways of aggregation model , Used in the process of aggregation Aggregate The underlying model can be reduced by a large part SQL The amount of code , No longer need to do it yourself Sum、Min、Max Wait for the action , From DWD Layer to DWS/ADS Layer is friendly .
Doris Low threshold for use , High query efficiency :
- Support MySQL agreement , Support standards SQL, Query syntax is highly compatible MySQL, Friendly to analysts .
- Support materialized views and Rollup Physicochemical index . The bottom layer of materialized view is similar Cube The concept of and the process of precomputation , And Kylin The way of exchanging space for time is similar , Special tables are generated at the bottom , When the materialized view is hit in the query, it will respond quickly .
hot tip : Although materialized views are very helpful , But when used too much , Each materialized view requires additional storage space , Data import will lead to inefficiency .
Doris Minimalist system architecture , Low operation and maintenance cost :
- The system only has BE and FE Two modules , Do not rely on such as Zookeeper Wait for tripartite components , Simple deployment .
- in the light of FE and BE The operation of is monitored and configured , When an exception occurs, it will restart in time .
Doris Summary of experience
chart 4.1 How to use more friendly Doris
In the use of Apache Doris In the process of , We have sorted out some experience , Help developers use more friendly Doris . For developers , The most concerned places are the following :
- In terms of development : How to access external data Doris And quickly realize ETL Development , This will affect the report output speed of developers .
- Scheduling management : Developers don't want to go online after the development is completed , There are errors or unstable situations , It is necessary to ensure the stability of task scheduling and scheduling recovery ability .
- Data query : Due to the separation between production and office network , The office network cannot directly use the connection of the production network , And the network partition cannot be solved through the form of client , Only through Web Form solution , How to query and analyze safely and conveniently has become a concern of developers .
- Cluster management : When abnormal conditions occur in the cluster, it can be captured in time and handled automatically .
To make a long story short , We hope to build a high efficiency 、 High-quality , high stability The platform of .
Doris Development optimization
According to several issues concerned by developers , We did some development optimization .
Data access
In terms of data access, we have done semi-automatic related work and made rapid generation components , According to the data source / Table to generate Routine Load Script , As long as the Kafka Of Broker or Topic It can be quickly formed by modifying Routine Load Mission .Broker Load Task and Routine Load similar , After selecting the data warehouse source, it can be generated in time Broker Load Required scripts . In the access Doris You need to create a table in advance , Similar operations can be carried out in this regard , Create statements quickly through source generation .
chart 5.1 Data platform - Doris Development
The above mainly uses the underlying metadata , Get different metadata according to different data sources, and then you can quickly generate tasks .
Submit action and maintenance management
After the task is generated , We are Routine Load Aspects are encapsulated . because Routine Load It's the resident process , We only need to submit again , The state will become Running , If there is an abnormal state, it will be detected , Monitoring will be shown to you later .
chart 5.2 Data platform - Doris Development
Monitoring and management
We can submit Routine Load Query and check for exceptions , At the same time, we can focus on Routine Load Join the monitoring , The monitoring will automatically scan the task on a regular basis , When a problem occurs, it will prompt and try to pull the task back .
Broker Load Tasks can also be monitored . Aim at Broker Load Label The name cannot be repeated , We take generation UUID How to solve , In order to better help you improve the use experience .
chart 5.3 Data platform - Doris Development
As shown above , We can do it in Routine Load Pause and terminate in , Help you to better use the developed homework and management .
Self research query page , Integrate Doris Help function
Due to the isolation between production and office network segment , We can only pass Web The query . We have tried to use Hue Integrate Doris Query scheme ,Doris Supported by MySQL Protocol connection to Hue , But if we integrate Hue Words , Everyone can pass Hue Inquire about Doris The data of , Security cannot be guaranteed , Unable to meet our requirements for permission .
chart 5.4 Data platform -Doris Data query
So we developed a query page in our own platform to solve this problem . The left part of the figure can be based on DB List all the tables below , The right part is the query analysis page and query results , We developed it by ourselves, which is similar to Navicat Client software .
At the same time, we are right Doris Help Functions are integrated , People don't know how to use Doris Provide help when . Through integration Doris Help, We can use the keyword search function to query syntax and examples to solve problems .
Even without integration Doris Help, It can also be in FE The node comes with Web Page to view ,FE The built-in node can view the information of the whole cluster and has Help Functional Web page . After we realize the self research query page and integrate Doris Help after , You can use it directly , To skip the need to use Admin Account connection can be used FE Steps for .
Doris Cluster monitoring page
At the same time, we developed Doris Cluster monitoring page , You can see in the cluster monitoring page FE 、BE as well as Broker Node status of . When an abnormal condition occurs in the cluster , The monitoring system will send an automatic reminder and try to pull up the cluster , At the same time, you can also observe the health of nodes in the form of pages .
chart 5.5 Data platform - Doris Cluster monitoring
about Doris For upper applications , Mainly depends on Doris Provided API And instruction completion Doris The application action of the upper layer , All we do is to Doris The instructions provided are more user-friendly integration and paged display .
The benefits of the new architecture
chart 6.1 The benefits of the new architecture
- Data access : Pass at an early stage SteamSets During the process of data access, it is necessary to manually establish Kudu surface . Due to the lack of tools , The whole process of creating tables and tasks requires 20-30 minute . Now we can realize fast data access through platform and fast construction statement , The access process of each table starts from the previous 20-30 Minutes to the present 3-5 minute , Improved performance 5-6 times .
- Data development : When performing aggregation or other actions in the early architecture , You need to write a lot of long articles SQL Code . Use Doris after , We can use it directly Doris The built-in Unique、Aggregate And other data models that can well support log scenarios Duplicate Model , stay ETL Greatly accelerate the development process .
- Query analysis : Doris The bottom layer has materialized views and Rollup Materialized index and other functions , It can improve query efficiency , meanwhile Doris The bottom layer has carried out many optimization strategies for large table Association , Such as Runtime Filter And other things Join And custom optimization strategies . Compare with Doris,Apache Kudu You need more in-depth optimization experience to better use .
- Data reports : First use Kudu Report query requires 1-2 Minutes to finish rendering , and Doris The response speed is second level or even millisecond level .
- Environmental maintenance : Doris No, Hadoop Complexity of ecosystem , The whole link is clear , The maintenance cost is much lower than Hadoop, Especially in the process of cluster migration ,Doris The convenience of operation and maintenance is particularly prominent .
Future outlook
chart 7.1 Future outlook
- Try to introduce Doris Manager: There are ongoing discussions in the community about Doris Manager Propaganda , In the future, we are also ready to introduce and actively participate in Doris Manager Cluster maintenance and management .
- Implementation is based on Flink CDC Data access : The current architecture does not introduce Flink CDC , But continue to use Canal Collect to Kafka Then collect Doris Architecture in , The link is relatively long . use Flink CDC Although we can continue to streamline the overall architecture , But you still need to write a certain amount of code , about BI People feel unfriendly when using it directly , We hope that the warehouse staff only need SQL Or complete the operation on the page, you can use . stay 3.0 Architecture Planning , We plan to introduce Flink CDC Function and expand the upper application .Flink CDC The introduction of brings “ Fast is slow , Slow is fast ” The idea of , Of course Flink The development speed of the community is very fast , Only after fully learning from everyone's experience , Can be introduced more friendly , And iterate and optimize the architecture in the process of learning experience .
- Follow the community iteration plan : What we are using Doris The version is relatively old , Now the new version Doris In memory management 、 Query performance has been greatly improved , In the future, we will follow the community iteration rhythm to upgrade the cluster and reflect new features .
- Strengthen the construction of relevant systems : Our current indicator system management, such as report metadata 、 The maintenance and management of business metadata still need to be improved . Data quality monitoring , Although it currently includes the function of data quality monitoring , However, the whole platform monitoring and data automation monitoring still need to be strengthened and improved .
Join the community
Welcome more partners who love open source to join Apache Doris Community , Participate in community building , Except in the GitHub Ascending PR or Issue outside , You are also welcome to actively participate in the daily construction of the community , such as :
Join the community Solicitation activities , Perform technical analysis 、 Applied practice and other articles ; Participate as an instructor Doris Community online and offline activities ; actively participate in Doris Questions and answers from community users .
Last , Welcome more open source technology enthusiasts to join us Apache Doris Community , Grow up hand in hand , Build community ecology .
SelectDB Is an open source technology company , Committed to Apache Doris The community provides a full-time engineer 、 A team of product managers and support engineers , Prosper the open source community ecology , Create an international industry standard in the field of real-time analytical databases . be based on Apache Doris R & D of a new generation of cloud native real-time data warehouse SelectDB, Running on multiple clouds , Provide users and customers with out of the box capability .
Related links :
SelectDB Official website :
Apache Doris Official website :
Apache Doris Github:
https://github.com/apache/doris
Apache Doris Developer mail group :
边栏推荐
- 韦东山第二期课程内容概要
- C# 泛型及性能比较
- 应用实践 | 数仓体系效率全面提升!同程数科基于 Apache Doris 的数据仓库建设
- Single machine high concurrency model design
- 接口测试要测试什么?
- Su embedded training - day4
- When creating body middleware, express Is there any difference between setting extended to true and false in urlencoded?
- 2022-07-07:原本数组中都是大于0、小于等于k的数字,是一个单调不减的数组, 其中可能有相等的数字,总体趋势是递增的。 但是其中有些位置的数被替换成了0,我们需要求出所有的把0替换的方案数量:
- [programming problem] [scratch Level 2] draw ten squares in December 2019
- 3 years of experience, can't you get 20K for the interview and test post? Such a hole?
猜你喜欢
QT adds resource files, adds icons for qaction, establishes signal slot functions, and implements
【obs】官方是配置USE_GPU_PRIORITY 效果为TRUE的
Service Mesh介绍,Istio概述
80% of the people answered incorrectly. Does the leaf on the apple logo face left or right?
3 years of experience, can't you get 20K for the interview and test post? Such a hole?
Qt不同类之间建立信号槽,并传递参数
RPA云电脑,让RPA开箱即用算力无限?
Detailed explanation of interview questions: the history of blood and tears in implementing distributed locks with redis
某马旅游网站开发(登录注册退出功能的实现)
【笔记】常见组合滤波电路
随机推荐
Summary of the third course of weidongshan
Service Mesh介绍,Istio概述
去了字节跳动,才知道年薪 40w 的测试工程师有这么多?
[programming questions] [scratch Level 2] March 2019 garbage classification
The method of server defense against DDoS, Hangzhou advanced anti DDoS IP section 103.219.39 x
【愚公系列】2022年7月 Go教学课程 006-自动推导类型和输入输出
Fully automated processing of monthly card shortage data and output of card shortage personnel information
【编程题】【Scratch二级】2019.09 制作蝙蝠冲关游戏
paddle入门-使用LeNet在MNIST实现图像分类方法二
取消select的默认样式的向下箭头和设置select默认字样
CoinDesk评波场去中心化进程:让人们看到互联网的未来
Is 35 really a career crisis? No, my skills are accumulating, and the more I eat, the better
paddle入门-使用LeNet在MNIST实现图像分类方法一
Flask learning record 000: error summary
paddle一个由三个卷积层组成的网络完成cifar10数据集的图像分类任务
[OBS] the official configuration is use_ GPU_ Priority effect is true
What is load balancing? How does DNS achieve load balancing?
Coindesk comments on the decentralization process of the wave field: let people see the future of the Internet
22年秋招心得
2022-07-07:原本数组中都是大于0、小于等于k的数字,是一个单调不减的数组, 其中可能有相等的数字,总体趋势是递增的。 但是其中有些位置的数被替换成了0,我们需要求出所有的把0替换的方案数量: