当前位置:网站首页>Application practice | the efficiency of the data warehouse system has been comprehensively improved! Data warehouse construction based on Apache Doris in Tongcheng digital Department
Application practice | the efficiency of the data warehouse system has been comprehensively improved! Data warehouse construction based on Apache Doris in Tongcheng digital Department
2022-07-07 22:06:00 【InfoQ】
Business background
Business Introduction
Business needs
- Kanban class : It mainly includes business real-time cockpit and T+1 Business Kanban, etc .
- Early warning : Mainly including risk control fusing 、 Abnormal funds and flow monitoring .
- Analysis class : It mainly includes timely data query and analysis and temporary data retrieval .
- Financial class : It mainly includes clearing and payment reconciliation requirements .
Architecture evolution 1.0
Workflow
Advantages and disadvantages
advantage :
- framework 1.0 I chose CDH Family bucket .CDH It provides many big data components , Can be integrated with each other and put into use , At the same time, its configuration is relatively flexible .
- The use of SteamSets Support visual drag and drop and configuration development , So developers are right SteamSets The degree of acceptance is high ..
Insufficient :
- Too many components are introduced , Maintenance costs increase ; When there is a problem with the data , The troubleshooting and repair link is relatively long .
- Various technical architectures and long development links , It improves the learning cost and requirements of warehouse staff , Data warehouse personnel need to convert in different places before developing , The development process is not smooth 、 Reduced development efficiency .
- Apache Kudu In large table Association Join The performance is not satisfactory .
- Because the architecture uses CDH structure , Offline cluster and real-time cluster are not separated , Form resources to compete with each other ; In the process of offline batch running IO Or high disk consumption , The timeliness of real-time data cannot be guaranteed .
- although SteamSets Equipped with early warning capability , But the job recovery ability is still relatively lacking . When configuring multiple tasks JVM The consumption is high , Resulting in slow recovery .
Architecture evolution 2.0
Workflow
- adopt Canal Of CDC Ability , take MySQL Data collection to Kafka in . because Apache Doris And Kafka The degree of fit is high , It's easy to use Routine Load Load and access data .
- The data link of the original offline calculation is slightly adjusted . For storage in Hive Data in ,Apahce Doris Supported by Broker Load take Hive Data import , Therefore, the data of offline clusters can be directly loaded into Doris In .
The selection Doris
- Data access :Provides rich data import methods , It can support the access of many data sources .
- Data connection :Doris Support JDBC And ODBC And so on , Yes BI The visual display of the tool is relatively friendly , Can easily communicate with BI Tools to connect , in addition Doris Realized MySQL Protocol layer , Through various Client Tools directly access Doris.
- SQL grammar :Doris Support standards SQL, Grammar to MySQL compatible , The learning cost is low for the warehouse staff ;
- MPP Parallel computing :Doris be based on MPP The architecture provides excellent parallel computing capability , For large table Join Very good support .
- The most important point :Doris Official documents are very sound , For users, it's faster to get started .
Doris Deployment architecture
Doris Real time system architecture
Doris New data warehouse features
- Routine Load: It is mainly used for business data access and consumption Kafka The Resident Mission of exists . When we submit Rountine Load When the task ,Doris There will be a resident process for real-time consumption Kafka , Constantly from Kafka Read data from and import it into Doris in .
- Broker Load: Perform offline data import tasks such as basic dimension tables and historical data .
- Insert Into: Used for regular batch operation , Responsible for DWD Layer data processing , formation DWS Layers and ADS layer .
- Unique Model in DWD When the layer is accessed , It can effectively prevent repeated consumption of data .
- Aggregate Models are used as aggregations . stay Doris in ,Aggregate Support is like Sum、Replace、Min 、Max 4 Two ways of aggregation model , Used in the process of aggregation Aggregate The underlying model can be reduced by a large part SQL The amount of code , No longer need to do it yourself Sum、Min、Max Wait for the action , From DWD Layer to DWS/ADS Layer is friendly .
- Support MySQL agreement , Support standards SQL, Query syntax is highly compatible MySQL, Friendly to analysts .
- Support materialized views and Rollup Physicochemical index . The bottom layer of materialized view is similar Cube The concept of and the process of precomputation , And Kylin The way of exchanging space for time is similar , Special tables are generated at the bottom , When the materialized view is hit in the query, it will respond quickly .
- The system only has BE and FE Two modules , Do not rely on such as Zookeeper Wait for tripartite components , Simple deployment .
- in the light of FE and BE The operation of is monitored and configured , When an exception occurs, it will restart in time .
Doris Summary of experience
- In terms of development :How to access external data Doris And quickly realize ETL Development , This will affect the report output speed of developers .
- Scheduling management :Developers don't want to go online after the development is completed , There are errors or unstable situations , It is necessary to ensure the stability of task scheduling and scheduling recovery ability .
- Data query :Due to the separation between production and office network , The office network cannot directly use the connection of the production network , And the network partition cannot be solved through the form of client , Only through Web Form solution , How to query and analyze safely and conveniently has become a concern of developers .
- Cluster management :When abnormal conditions occur in the cluster, it can be captured in time and handled automatically .
Doris Development optimization
Data access
Submit action and maintenance management
Monitoring and management
Self research query page , Integrate Doris Help function
Doris Cluster monitoring page
The benefits of the new architecture
- Data access :Pass at an early stage SteamSets During the process of data access, it is necessary to manually establish Kudu surface . Due to the lack of tools , The whole process of creating tables and tasks requires 20-30 minute . Now we can realize fast data access through platform and fast construction statement , The access process of each table starts from the previous 20-30 Minutes to the present 3-5 minute , Improved performance 5-6 times .
- Data development :When performing aggregation or other actions in the early architecture , You need to write a lot of long articles SQL Code . Use Doris after , We can use it directly Doris The built-in Unique、Aggregate And other data models that can well support log scenarios Duplicate Model , stay ETL Greatly accelerate the development process .
- Query analysis :Doris The bottom layer has materialized views and Rollup Materialized index and other functions , It can improve query efficiency , meanwhile Doris The bottom layer has carried out many optimization strategies for large table Association , Such as Runtime Filter And other things Join And custom optimization strategies . Compare with Doris,Apache Kudu You need more in-depth optimization experience to better use .
- Data reports :First use Kudu Report query requires 1-2 Minutes to finish rendering , and Doris The response speed is second level or even millisecond level .
- Environmental maintenance :Doris No, Hadoop Complexity of ecosystem , The whole link is clear , The maintenance cost is much lower than Hadoop, Especially in the process of cluster migration ,Doris The convenience of operation and maintenance is particularly prominent .
Future outlook
- Try to introduce Doris Manager:There are ongoing discussions in the community about Doris Manager Propaganda , In the future, we are also ready to introduce and actively participate in Doris Manager Cluster maintenance and management .
- Implementation is based on Flink CDC Data access :The current architecture does not introduce Flink CDC , But continue to use Canal Collect to Kafka Then collect Doris Architecture in , The link is relatively long . use Flink CDC Although we can continue to streamline the overall architecture , But you still need to write a certain amount of code , about BI People feel unfriendly when using it directly , We hope that the warehouse staff only need SQL Or complete the operation on the page, you can use . stay 3.0 Architecture Planning , We plan to introduce Flink CDC Function and expand the upper application .Flink CDC The introduction of brings “ Fast is slow , Slow is fast ” The idea of , Of course Flink The development speed of the community is very fast , Only after fully learning from everyone's experience , Can be introduced more friendly , And iterate and optimize the architecture in the process of learning experience .
- Follow the community iteration plan :What we are using Doris The version is relatively old , Now the new version Doris In memory management 、 Query performance has been greatly improved , In the future, we will follow the community iteration rhythm to upgrade the cluster and reflect new features .
- Strengthen the construction of relevant systems :Our current indicator system management, such as report metadata 、 The maintenance and management of business metadata still need to be improved . Data quality monitoring , Although it currently includes the function of data quality monitoring , However, the whole platform monitoring and data automation monitoring still need to be strengthened and improved .
Join the community
边栏推荐
- MIT6.S081-Lab9 FS [2021Fall]
- NVR hard disk video recorder is connected to easycvr through the national standard gb28181 protocol. What is the reason why the device channel information is not displayed?
- QT compile IOT management platform 39 alarm linkage
- 用语雀写文章了,功能真心强大!
- How to integrate Google APIs with Google's application system (1) -introduction to Google APIs
- Index summary (assault version)
- Use br to recover backup data on azure blob storage
- Codemail auto collation code of visual studio plug-in
- Google SEO external chain backlinks research tool recommendation
- SAR image quality evaluation
猜你喜欢
Use blocconsumer to build responsive components and monitor status at the same time
Dry goods sharing | devaxpress v22.1 original help document download collection
Two kinds of updates lost and Solutions
The latest Android interview collection, Android video extraction audio
强化学习-学习笔记9 | Multi-Step-TD-Target
Jerry's manual matching method [chapter]
【Azure微服务 Service Fabric 】如何转移Service Fabric集群中的种子节点(Seed Node)
Use camunda to do workflow design and reject operations
How to turn on win11 game mode? How to turn on game mode in win11
Implementation method of data platform landing
随机推荐
Win11U盘不显示怎么办?Win11插U盘没反应的解决方法
How polardb-x does distributed database hotspot analysis
Time standard library
Restapi version control strategy [eolink translation]
Jerry's initiation of ear pairing, reconnection, and opening of discoverable and connectable cyclic functions [chapter]
为什么Win11不能显示秒数?Win11时间不显示秒怎么解决?
L'enregistreur de disque dur NVR est connecté à easycvr par le Protocole GB 28181. Quelle est la raison pour laquelle l'information sur le canal de l'appareil n'est pas affichée?
What if the win11u disk does not display? Solution to failure of win11 plug-in USB flash disk
你可曾迷茫?曾经的测试/开发程序员,懵懂的小菜C鸟升级......
Latest Android advanced interview questions summary, Android interview questions and answers
Jerry's fast pairing does not support canceling pairing [article]
[200 opencv routines] 223 Polygon fitting for feature extraction (cv.approxpolydp)
Demon daddy B3 read extensively in a small amount, and completed 20000 vocabulary+
Actual combat: sqlserver 2008 Extended event XML is converted to standard table format [easy to understand]
Implementation method of data platform landing
The difference between NPM uninstall and RM direct deletion
Demon daddy A3 stage near normal speed speech flow initial contact
双塔模型的最强出装,谷歌又开始玩起“老古董”了?
大数据开源项目,一站式全自动化全生命周期运维管家ChengYing(承影)走向何方?
Jerry's about TWS channel configuration [chapter]