当前位置：网站首页>Xiaomai technology x hologres: high availability of real-time data warehouse construction of ten billion level advertising

Xiaomai technology x hologres: high availability of real-time data warehouse construction of ten billion level advertising

2022-06-29 17:55:00 【InfoQ】

author ：

Li Yun , Xiaomai Senior Data Warehouse Development Engineer , Head of data warehouse

Raven , Xiaomai shucang Development Engineer ;

One 、 Business Introduction

Xiaomai technology was founded in 2015 year 1 month , Is a company committed to taking digital leadership as an advantage , Mobile Internet technology companies that realize high-quality and self growth of business . Always focus on user value , Data driven , Develop rich tool applications for users 、 Casual games 、 Alpinia 、 Sports and other mobile applications . The vision and mission is to become a global leading developer growth service platform , Xiaomai hopes to empower through standardized products and services , Provide developers with full link solutions , With technology + Service, all-round escort , Combustion supporting products continue to grow , Help developers of tools and casual games improve the success rate of their products .

Xiaomai technology has accumulated development 400 The rest of the products , The cumulative number of users downloading and installing exceeded 700 million , Diurnal activity 500-1000w, Data volume per day 100 Billion +. Around high quality APP、 User growth and commercialization , Through big data technology, the company has successively built a commercial cash flow system 、 Intelligent promotion 、 Financial management, etc 10+ Application system . But the number of users has increased exponentially , The business team makes real-time data 、 The demand for refinement has been raised , Big data systems are beginning to be challenged , How to better empower business growth through data warehouse construction has become an important breakthrough .

Two 、 Development history of Xiaomai digital warehouse ： From divine plan to real-time warehouse integrating flow and batch

To meet the data needs of the business team , Xiaomai big data technology team started to build data warehouse system from the early stage of business development , Transition from the traditional divine strategy stage to offline warehouse , Now, the stable flow batch integrated real-time data warehouse has experienced 3 Stages , From the challenges of business and Technology , Continue to iteratively optimize the data warehouse system , So as to support the rapid growth of business . The following will further introduce the development process of Xiaomai's big data platform ：

1、 Divine strategy stage

In the most primitive stage , The business system is based on divine strategy .APP Direct data access to Shence , At the beginning, I can only see APP Behavior data in , And advertising data , Limited analytical power , inflexible , Cannot customize processing , Unable to integrate and analyze with third-party data , Unable to meet further business requirements . But because the business is still in its infancy , The construction of the data platform is mainly to meet the existing business needs , If the business has special needs , Then set up the corresponding analysis system separately .

2、 Offline warehouse （ introduce MaxCompute）

With the continuous development of the company's business , More and more users of the service , Exponential growth in data volume , There are more and more corresponding business systems , Each system was previously fragmented , And the stability of the system begins to face great challenges . Based on the limitations of divine strategy , The business began to introduce Alibaba cloud MaxCompute、DataWorks And a distributed database （ Hereinafter referred to as a DB） Set up offline data warehouse . The main process of business ：

adopt JDBC The way to pull the Shence compass server , And pass DataWorks Synchronize data offline to MaxCompute;

stay MaxCompute Through the four layer modeling of data warehouse （ODS、DWD、DWS、ADS）, The result data passed DataWorks Offline synchronization to DB;

In a certain DB Various analysis requirements of the docking business system .

The introduction of offline data warehouse basically meets the analysis and decision-making of all roles in the company , But as the business continues to grow , There will be the following problems ：

Between the systems, the Shence system is used to JDBC Pull data in the same way , Excessive reliance on third-party magic , Over coupling , When something goes wrong , The entire calculation process cannot continue , Unable to meet the agile analysis needs of the business . After the divine plan is restored , Manually participate in re running data , A lot of manpower was wasted .

After data statistics , Data warehousing is slow , This greatly affects the running time of the entire link , And the real-time requirement of data computing is getting higher and higher , Unable to support at this stage .

The data volume is growing exponentially , More and more analysis dimensions , The result data basically reached the level of detailed data , An existing data query engine DB It is not enough to support the multi-dimensional analysis of such big data , The biggest challenge is to make low latency for 10 billion scale behavioral data 、 high QPS Query analysis of .

In order to solve the problem that the query engine is not enough to support large amount of data query , Therefore, a lot of pre calculation has been done for the data , Cause computational redundancy , The rising cost of .

At the same time, there are more and more systems , Resulting in O & M costs 、 The development cost also increases linearly , As a result, various demands of the business cannot be met quickly .

Because of the pain points described above , The frequent phenomenon is that data output slows down , Often stuck , Seriously affect business decisions , Frequently complained by the business department , The impact is extremely bad . Based on this , The technical department urgently needs to find a solution .

3、 Stream batch integrated real-time data warehouse （ introduce Hologres+Flink）

In order to better solve business demands , In the third stage, Alibaba cloud's Hologres and Flink, And by the Hologres Replace a DB, Built a real-time data warehouse integrating flow and batch . The main data links are as follows ：

1、 Log data and business data Kafka adopt DataWorks Real time synchronous write MaxCompute, Real time landing ODS layer ; For businesses that require data timeliness , Direct write Flink,Flink In which real-time ETL Handle , And then write Hologres.

2、 Third party data is obtained through DataWorks Offline sync to MaxCompute, stay MaxCompute Data warehouse layering in （ODS、DWD、DWS、ADS） Building , And write the processed data directly into Hologres.

3、 from Hologres Store real-time and offline data , And directly connect with the upper application , Multiple query requirements of the bearer business system , Realize the real-time data warehouse integrating flow and batch .

Through the third stage Hologres+Flink+MaxCompute The construction of real-time data warehouse integrating flow and batch , It has successfully supported many businesses of Xiaomai technology , Including data operation ,BI, Data interface , Business center, etc . The benefits of the new architecture are ：

Clearer data structure ： For different levels of data , They have different scopes , Each data tier has its scope , This makes it easier for businesses to locate and understand when using tables .

Data bloodline ： A business table is provided for business use , But this business table may come from many tables . If there's a problem with one of the source tables , We can locate the problem quickly and accurately , And clearly understand the scope of each table .

Reduce redevelopment ： Data layering normalization , Develop some common middle tier data , Can reduce double counting , Improve the usage of a single business table .

Simplify complex problems ： Divide a complex business into several steps to realize , Each layer deals with a single step , It's simpler and easier to understand . And it's easy to maintain the accuracy of the data , When the data goes wrong , You don't have to fix all the data , Just start with the problem steps and fix it . It's kind of similar Spark RDD Fault tolerance mechanism of .

Reduce business impact ： The business may change frequently , In this way, there is no need to change the service once and re access the data .

Data is more real-time , Business decisions are made more quickly .

Data is decoupled from third parties , More robust .

3、 ... and 、 Why choose Hologres？

choice Hologres It is the conclusion of our investigation and test from many aspects . Next, we will talk about the selection from the two aspects of technology and usage scenarios in combination with the business Hologres Why .

1、 Support high-performance writes and extremely fast complex queries

At first, we were based on DB and Hologres Performance verification is carried out , The core is for queries 、 Write for verification , Because the offline data warehouse stage , The biggest bottleneck of database is query performance and write performance .

Query performance ： Based on the current actual business scenario , Including simple and complex SQL Perform query performance verification , The performance without optimization in the early stage is almost the same , Back to Hologres Table design and underlying optimization for , We verified that Hologres Basically, there can be 4 Double or so , We will also do more performance tuning work with Ali's colleagues later .

Write performance ： Before in a DB On the environment ,MaxCompute Write a DB It's been a long time （1 Billion data in about an hour ）, Especially after the query business comes up , Write performance is slowed down several times , Even downtime . And write MaxCompute Data to Hologres The performance of is very strong , 1 Hundreds of millions of data import 10 It can be completed in more than seconds .

2、 Meet multiple analysis scenarios

combination MaxCompute+Hologres+Flink Built a real-time data warehouse integrating flow and batch , It enriches our system application scenarios , It mainly includes ：

Real time data warehouse ： because Hologres And Flink Good integration , Through real-time data collection ,Flink Real time computing , Write data directly to Hologres in , Real time large screen can be built in real time 、 Real time monitoring and early warning 、 Real-time recommendation 、 Real time training and other applications , Respond quickly to business needs .

MaxCompute Speed up queries ：Hologres Can be directly through the way of appearance , Yes MaxCompute Query the data of , If higher performance is required , You can import the data into Hologres Higher performance query processing in . If it is the former way , You can do this without outputting the data , Query and analyze offline data .

Adaptive advertising analysis scenarios ：Hologres There are many rich analysis functions , Such as retained analysis function and funnel analysis function , This is very applicable to the relevant scenarios of advertising business , There is no need for our secondary development , You can use it directly .

in summary , Both the performance support and the use scenarios are very consistent with the business needs of our company .

Four 、 Ten billion level user behavior analysis best practices

User behavior refers to the behavior generated by the user on the product , Through the analysis of user behavior , To provide auxiliary decision-making for the next operation strategy , At the same time, it also provides direction for product iteration and development . User behavior analysis is a very common scenario in Internet companies , But the core pain point of most businesses is the large amount of user data , Computational logic is complex , The computing performance is not good enough , Often can not get the calculation results in time , So as to influence the next decision .

Xiaomai is in the scenario of advertising crowd data analysis , The amount of data is about tens of billions , And there are many hundreds of millions of rows of large table associated query scenarios , The previous system calculation was difficult , Often questioned by the business . On the current system , We pass the right Hologres Index design and performance tuning of tables in , It has achieved very obvious performance effect , Let's introduce how to implement .

The flow of user behavior analysis is as follows ：

1、MaxCompute To deposit the income statement income_dt_test, The hour cycle is scheduled to Hologres Result sheet holo_ad_income_dt_test

2、Hologres Store user behavior tables holo_dws_usr_label_df, adopt Maxcompute Periodically scheduled writes .

4、 stay Hologres Associate two tables in Join Calculation , Conduct a population analysis , Sample analysis SQL as follows ：

Combine business scenarios to align tables and SQL The following optimization operations are performed ：

1. Because the user income table and the user behavior table need to be associated , So set the distribution field

distribution_key, Ensure that the same records are assigned to the same shard On

, Minimize shuffle, As far as possible Local Join, So set the following distribution keys , It greatly improves the speed of association query ,

CALL set_table_property('holo_dws_usr_label_df', 'distribution_key', 'product_id,device_id');
CALL set_table_property('holo_ad_income_dt_test', 'distribution_key', 'product_id,device_id');

2. Because report filtering often uses product_id、ad_id、position_id Three fields , and bitmap_columns The usage scenario of is equivalent query , So set these three fields to bitmap_columns

CALL SET_TABLE_PROPERTY('public.holo_ad_income_dt_test', 'bitmap_columns', '&quot;product_id:on&quot;,&quot;ad_id:on&quot;,&quot;position_id:on&quot;');

3. A rough estimate of the daily incremental data is 1 Million or so , Therefore, it is set to partition table , Improve query speed , It is not recommended to set partitions when the data volume is small , Otherwise, query performance will be affected .

4. When the number of users is de duplicated , A lot of count(distinct a.device_id）, But it will consume a lot of resources , So we use APPROX_COUNT_DISTINCT (a.device_id) The way , A lot of performance improvements , But some precision will be lost , Through parameters

set

hg_experimental_approx_count_distinct_precision=20 Adjustment accuracy .

By comparing the table structure and SQL The optimization of the , Our advertising crowd data analysis can achieve second level response , Greatly improve the computational efficiency , It can also quickly respond to business needs .

5、 ... and 、Hologres High availability implementation of read / write separation

1、 Optimize the background ： Reading and writing do not separate from each other

With the migration to Hologres More and more businesses , Increasingly frequent write tasks , During the peak period, the instance begins to encounter query exceptions and write task errors . The main reasons are ：

It is offline at about 10 a.m. every day （T+1） Peak period of task writing , During this period, a large number of report statistics tasks are aggregated , Yes Hologres The write operation takes a lot of resources .

The data volume of some write tasks is particularly large , The result data of day increment has reached hundreds of millions , Long write time , Keep taking up resources . Some result tables have too many fields , More than 1000 , It consumes more resources .

While writing , There are parts MaxCompute Read Hologres The task of appearance , This causes the number of connections to increase , Affect other tasks .

The reporting period is also the peak period for business query , A large number of queries are executed at the same time as a large number of writes , interact .

There is an automatic retry mechanism for the write task , Every time oom、timeout Or other abnormal errors , The task will automatically re run and occupy resources , As a result, more and more write tasks in large areas are abnormal .

2、 Optimization means ：Hologres Shared storage instance deployment

under these circumstances , We are right. Hologres The example has been adjusted and optimized , Configured with Hologres Shared storage multiple instances of , Separate reading from writing , Adjust a read-write instance to a master instance and a read-only slave instance , Two instances share the same storage ：

Divide the business into different modules , At the same time, the report background 、tableau、 Migrate read-only queries of production business modules to read-only instances

Synchronization tasks and a small number of read / write tasks remain in the read / write master instance , Different module data is stored in different schema, Easy to manage .

Before adjustment ：

After the adjustment ：

At the same time, we have also made some other optimizations according to the current business situation , Include ：

1、 Large write tasks increase session Level timeout setting ：set statement_timeout = 'Xmin' ;

2、 Before writing, the external table and the internal table ANALYZE , Update statistics to speed up writing ;

3、 Cancel Hologres Automatic retry mechanism for writing tasks , Avoid affecting other subsequent write tasks ;

4、 Reduce unnecessary MaxCompute Read Hologres Operation of appearance data , Reduce the use of connections ;

5、 Some tables with a large amount of data , Stagger write and query peaks , Adjust to write in other time periods ;

3、 The optimization effect ： The system stability is significantly improved

adopt Hologres After the read / write separation instance deployment and related write optimization , The write task no longer affects the query task , Reports and other systems can provide stable query services , At the same time, the use and allocation of write task resources are more reasonable , No more oom Such write exceptions , The system service stability has been greatly improved .

follow-up , We will try to split the business of different modules into different read-only instances , Further enhance the stability of the service , Bring better experience to service users .

6、 ... and 、 Business value

adopt Hologres+Flink+MaxCompute Built a real-time data warehouse platform integrating flow and batch , It supports multiple application field scenarios of Xiaomai , Including monitoring the market ,DMP People and other intelligence , Financial analysis, etc . Significant business benefits include ：

1、 Upper layer services share data

After data sharing, the platform will output services to the outside world , Each line of business does not need to re develop itself , You can quickly get the data support provided by the platform , Reduce data islands .

2、 Hundred million level complex query second level response

adopt Hologres Excellent query performance , And then cooperate to build tables and SQL The means of optimization , It greatly improves the response speed of the report , Even user portraits 、 Behavior analysis and other 100 million level large table complex association queries can also produce results quickly , Has been recognized by the business .

3、 The system has strong read-write separation stability

adopt Hologres How to deploy shared storage instances , Let the business realize read-write separation , At the same time, only one copy of storage is used , It ensures the stability of the system , At the same time, it will not bring additional cost pressure .

understand Hologres：

https://www.aliyun.com/product/bigdata/hologram

原网站

版权声明
本文为[InfoQ]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/180/202206291748171760.html