当前位置:网站首页>For the first time, why not choose "pure medium platform" for byte beating data platform

For the first time, why not choose "pure medium platform" for byte beating data platform

2022-06-12 02:29:00 Deep learning and python

The guest | Luo Xuan

edit | Xue Liang

“ Every ten times of the increase in scale , Many architectural design points need to be readjusted ”.

Face personalization 、 Diversified data , And the data island and business island within the enterprise , If you have an infrastructure that can handle massive amounts of data , To a large extent, we can mine and analyze the valuable information for business development , So as to help enterprises make data-driven decisions faster , Faster roll out to fit users / Products required by customers .

Byte beating data platform team according to business needs , It took seven years to develop and gradually iterate out a set of data platforms , The total amount of data managed by this platform exceeded a few years ago EB Level , At the evening peak of the daily business, the traffic at the buried point has exceeded 1 Billion TPS, There are over 100000 core A single task of requires thousands of machines to calculate .

Such a scale is also very rare in the industry , In order to cope with the large amount of data , The byte beating data platform team did not adopt the traditional data midrange mode , And used “ Zhongtai +BP system ” Pattern , Avoid that the middle office is separated from the business needs .BP Mechanism is an innovation , Be similar to HRBP, Unified management and deployment of tasks in various businesses . be relative to “ Pure middle platform system ”, data BP The advantage of the system is that it is closer to business support , Avoid that the middle office is easy to be separated from the business needs 、 The risk of building a wheel . be relative to “ pure BU system ”, The biggest advantage is high leverage , The platform is easy to empower .

Planning 2022 year 3 month 24-25 Beijing, Japan ArchSummit At the beginning of the global architects Summit , I interviewed Luo Xuan, the head of byte beating data platform , Please tell him about the process and technical details of the construction of byte beating data platform . Luo Xuanzai 2014 Add byte jitter in , Build a big data platform from scratch , Led the team to set up a platform including data collection 、 Building 、 government 、 Applied full link platform products . Advocate data driven business , With data BP Pattern , Agile supports today's headlines 、 Tiktok 、 Watermelon Video 、 Day, night, light year and other major business lines . In the big data architecture 、 product 、 government 、 Security privacy 、 Rich experience in organizational design and other aspects . The following is Luo Xuan's reply .

InfoQ : As the head of byte bounce data platform , Could you please review , How to build a data platform ? What kind of evolution process has it experienced ? What is the background of each upgrade ?

Luo Xuan : The construction process of byte beating data platform may be different from that of other companies . All our construction and evolution logic , It is all about how to support the business in an agile and efficient way , To promote growth . So you will find , From the evolution history of the platform, we can see , Our optimization premise background , Under the rapid development of business , What kind of abilities do we need , To support and drive sustained growth .

since 2014 So far this year , It can be roughly divided into the following stages :

  • Original stage :Hive+ Email report , Heavy use A/B test (2014)

I am a 2014 Add bytes in , Only describe 14 The development after . Before that , Only oneortwo engineers , Part time involved in related matters , So it is basically a state of starting from scratch . When bytes are first added , only one Hive And the most basic report , Include only DAU、 Wait a long time , Reports are sent by mail only , Is a very primitive state . But what's interesting is , At this time , We have started to use it heavily A/B Tested , This is our earliest relatively mature system , I believe it is different from the development order of most companies , Because at that stage , We think The most important thing , Is to make the business quantifiable , And iterate in a very fast trial and error way .

  • Basic capacity building period : Self built products quickly replace commercial products

stay 2015-2016 Year , Business is growing fast , Need more reports 、 indicators , And more flexible analytical capabilities .2015 The daily life of today's headlines has gone through tens of millions , Data volume increases , Put forward higher requirements for the processing capacity of the engine , Also began to consider timeliness , Interactivity, etc , At this time, we use Spark and Storm To do data processing .

here we are 2017 year , The volume of business data represented by Tiktok is expanding , Constantly challenging our capability boundaries . The problem of growing too fast is obvious , On the one hand, it often happens that the speed of resources in place is slower than the business growth , Lack of machinery 、 Rack position and even computer room . We often optimize all links of the data link , Not just because of the cost , More often, it is because of insufficient resources , Which leads us to have to do . Optimize to solve the problem of data volume and analysis efficiency , Become our primary breakthrough point , Made a lot of model selection attempts , Such as Presto、Kylin、Druid and ClickHouse, Also based on these open source engines , Did a lot of secondary development and in-depth optimization . This part of the investment , It continues to this day , Let's also look at some engines such as ClickHouse On the accumulation of , Relatively leading in the industry .

In addition to engine technology , We are also starting to build business oriented data products . Including those that have already provided services to external enterprises Finder( Volcanic engine growth analysis ), It also replaced the commercial version in that year Amplitude, Start to cover all business lines of the company . We did a version of calculation at that time , Calculated according to the whole product line , It can save hundreds of millions of expenses for the company every year , If you press the current data volume , Much more . At the same time , It also includes the data development platform 、 Metadata management 、 Tasks depend on core platform capabilities such as scheduling .

The business form of the company during this period , Also began to become rich , Tiktok 、 Volcano small video 、 Watermelon and so on , It has also begun to produce the demand for China Taiwan integration .

  • Productization and tissue forming : Structure and organization innovation , The platform capability is continuously upgraded

here we are 2018、2019 year , Generation speed of byte new service , It has obviously accelerated . As a middle office team , How to quickly and efficiently support these continuous 、 More and more diversified businesses , Become a very important proposition .

We have made some innovations at the organizational level , Set data BP Mechanism .BP The full name is Business Partner, Be similar to HRBP, The organization form is centralized , Unified management and provisioning , The execution is distributed to various businesses , Solve business problems . The advantage of this organization is , Even though BP The team supports different types of business lines upward , But in fact, it is downward compatible with the underlying capabilities of our platform , Have similar skill stacks , Learning and using the tool engine is efficient and smooth .

As a solution provider of data platform capabilities , data BP Students report on the data platform in the organization , Unified training and scheduling , From the perspective of mutual learning experience , To be familiar with the ability of the middle office , According to the characteristics of different businesses , Flexible combination , Provide comprehensive data solutions , It also ensures reusability , It is not easy to build wheels repeatedly . In specific work , They will concentrate on different business lines , Sit with business classmates , See yourself as part of your line of business , Ensure success with the business .

Data product level , We began to pay more and more attention to “ Commercialization ”, Focus on experience and lower the threshold , Not just basic skills , In this way, a wider group of roles in the company , Can work with a data-driven concept . our ABI product “ Fengshen ” It was launched at this time , This has also become a data product used by almost all byte users . Internally circulated “A/B It's a belief , Feng Shen is a habit ”, It is also known from this period .

  • ToB Service stage :“0987” Quantitative data service standards , Create value for external enterprises

2020 in , We already have two large service objects . One is the business lines that jump to bytes , With data BP Interface for , Provide data services ; The other is for external enterprises , Create value for external customers .

In the internal byte beating , When more and more product lines are supported , We're looking at data BP This model , Put forward a more quantitative service system standard , be called “0987”. These four figures refer to : stability SLA The core indicators should reach 0 An accident , The demand satisfaction rate should reach 90%, Data warehouse construction coverage 80% The analysis needs of , At the same time, the user satisfaction has reached 70%. Service byte internal business , We demand ourselves according to this high standard , At the same time, it is also a self regulatory mechanism , Can effectively prevent self hi , Away from business needs and value .

In terms of external customers , We actually started from 2019 We've been exploring since ToB market . here we are 2020 year ,ToB It has been upgraded into the strategy of byte beating company , The company is incorporated “ Beijing volcano Engine Technology Co., Ltd ”. Volcanic engine is an enterprise level technical service platform of bytecode , The data platform is also an important big data sector , Continue to increase investment . We will provide good products and experience with internal support services , Packaged into a data suite , Providing services through volcanic engines . at present , We have launched two major suites, namely, technology engine and marketing growth , There are also some good benchmarking customers . At the same time, we are also thinking about the data BP Solution capability 、 Experience and methodology , Whether it can help external customers , Let them enjoy the same data service level as Tiktok , Start making some attempts in this area .

InfoQ : As you just mentioned , The platform architecture is not defined from the beginning . We know , The process of continuous architecture upgrade is rarely smooth sailing , Has the byte data platform gone through some detours in the process of architecture evolution ? Can you give me an example ?

Luo Xuan : It's not a detour , But on the way of technological evolution , What core problems need to be solved , As the problem changes , The solution is likely to change . Anyone who has experienced architecture evolution and upgrading will know , Every ten times of the increase in scale , Many architecture design points need to be adjusted . In addition, because it is to change the wheels of the running train , Sometimes it is also necessary to 、ROI Make some trade-offs . for instance , Our user behavior analysis products Finder The underlying query engine used , Has experienced a relatively large adjustment .

At the beginning of the exploration , We are 2016 At the end of the year, the technical model was selected , The query speed and performance are considered 、 Stability and other factors , We think Kylin More in line with the needs of that time . Its advantages are “ fast ”, It can reach the millisecond level , But data needs to be pre aggregated , And the amount of calculation is large , Dimensions and measures also need to be defined in advance . At that time, we adopted some methods , These problems have been temporarily alleviated . But with the expansion of product functions to retention and transformation analysis , This architecture is difficult to achieve interactive response .

To provide more flexibility , We quickly use Spark Made some attempts , Keep the original data 、 Do dictionary coding 、 By user ID Fragmentation 、 Hierarchical caching, etc . However, considering the speed of business development, we need to pursue solutions that are more extreme in terms of resources and performance , After a series of tests , We chose ClickHouse As the basic query engine .ClickHouse At that time, it was far less popular than now , However, we think that it is extremely good at performance optimization in similar scenarios , Streamlined functionality with high quality , Is a very good foundation . In the process of meeting the actual business scenarios , We have also made a lot of in-depth optimization and customized modifications . At present, we have the largest in China ClickHouse colony , The total number of nodes exceeds 15000 individual 、 Manage more data than 600PB、 The largest single cluster size is 2400 More than nodes , Interactive data analysis supporting tens of thousands of employees every day .

This year, , We have also launched the enterprise version ClickHouse, It's called ByteHouse, In addition to the self research table engine 、 Extended data types 、 In addition to the upgrading of core capabilities such as hot and cold data separation , Real time data writing capability is better than native ClickHouse It has also more than tripled .

InfoQ : How much data scale does this architecture currently support ? What are the challenges of large-scale processing ? how ?

Luo Xuan : The total amount of data managed by the data platform , More than a few years ago EB Level up , From the perspective of real-time traffic , At the evening peak of our daily business, our traffic at the buried point has exceeded 1 Billion TPS, There are over 100000 core A single task of requires thousands of machines to calculate . Such a scale is also very rare in the industry , Natural will bring performance 、 Extensibility 、 Real time and other challenges , Some optimization of query engine mentioned above , It is also caused by this . Add the diversity and complexity of the business , In the scheduling of large-scale tasks 、 Operation and maintenance 、 resource optimization 、 Data governance and other dimensions , Encountered many challenges .

for instance , At present, our daily average data processing workload is at the level of millions . From the perspective of task scheduling , Dependency is complex 、 The level is also deep , In order to meet the timeliness requirements , You need to quickly trigger the scheduling execution when the pre dependency is ready . Through the self-developed distributed scheduling system , The second level scheduling capability is realized . It also provides a hierarchical marking mechanism for tasks , combination SLA Sign the system , Through a variety of task resource control methods , Realize the most reasonable allocation of resources , Combine priority weight to ensure SLA Satisfaction rate . It can also be based on the historical situation of the task , Configure unreasonable tasks , Put forward alarm suggestions for configuration optimization , Otherwise, operation and maintenance with a large amount of tasks can easily become a disaster .

InfoQ : In addition to scale and performance , How to do a good job in data management is another problem that we have to face up to . In particular, there are many services like bytes , Enterprises with expanding data types , How to solve this problem ?

Luo Xuan : We are more used to calling it data governance , The meaning is similar to . When the data volume , When the diversity is high , This is really a particularly important thing .

As a whole , Data governance is a long-term process , Our own practice is divided into two stages :

First stage , For our main business , The data governance committee was established , In the form of democratic centralism , Make special diagnosis and treatment , Get the benchmark effect . meanwhile , The best governance practices formed in this process , Into a reusable architecture 、 technological process 、 product , To lower the threshold of governance , For replicability .

Second stage , The middle office governance capability precipitated from the first stage , Continuously empowering innovative businesses , Realize distributed autonomy of business , So they don't all have to rely on specific teams . In the process , There will also be new demand feedback , Let us continue to polish the governance products .

This mechanism has been running steadily , It helps us achieve a relatively high standard of data governance , It also achieves a greater degree of cost resource saving . Having experienced many different types of business , Therefore, it can also ensure the generalization ability of governance products and methodologies . We try our best to reduce the threshold by means of productization , Enable data teams supporting different businesses to be autonomous , It can be said that we are implementing data governance in a more agile way . As a contrast , Some companies may do things more like “ Top engineering ”, Rely more on top-level decision-making throughout the process , On the one hand, it is related to the company culture , On the other hand, we also advocate the idea of data popularization , Make the product tools good enough , Keep the threshold as low as possible .

InfoQ : You mentioned agile many times , Is this a feature of the byte data platform ? What areas does it reflect? ?

Luo Xuan : First of all, byte itself is an agile company . For byte data platforms , It is also a feature , What we pursue is agility and efficiency to support business growth . It can be reflected from several aspects :

  • Organizational agility : be relative to “ Tradition ” Midrange mode , our BP Mode innovation , Support the business more efficiently .
  • Consumption is quick : Optimize the technology engine through continuous upgrading ,PB Second level complex analysis requirements can achieve second level response , Data can reach the second level from generation to availability , Let the business look at the data consumption 、 Count faster .
  • Agile decision making : This is typical of bytes A/B Test culture ,“ Make up your mind in case of trouble A/B”, Replacing subjectivity with objectivity , Assist the front line to make quick decisions , Instead of relying on lengthy layers of clapping . This also makes our A/B The product has been tested at the same time every day for tens of thousands of times .
  • Service agility : Byte business is growing too fast , Business models are very diverse , We must quickly access and serve a new business . Services and tool products are deeply integrated , On the premise of high satisfaction , We quickly support a business , Usually you can pick it up in a week , Start providing basic capabilities .
  • Implementing agile : This can be seen from the distributed data governance just mentioned . We advocate that small teams can also be implemented quickly , No need to spend a lot of time building supporting organizations and systems , Little impact on the business , Good fit , Be quick .
  • Iterative agility : Bytes develop and change very fast , The challenge for us is to iterate quickly to adapt to changes , This also makes our overall iteration more agile . It can also be seen from the development of products ,2016 At the end of the year, our behavior analysis products were still in iterative technology selection ,2017 In, it can cover the internal demand to replace the more mature commercial products .

InfoQ : One embodiment of agile is organizational agility , This is very different from other data platforms , You can go further into the data BP The pattern of ?

Luo Xuan :BP The concept of pattern has been detailed in the above question . be relative to “ Pure middle platform system ”, data BP The advantage of the system is that it is closer to business support , We will sit beside the business and provide services , And take the initiative to assess the satisfaction of the business , Avoid that the middle office is easy to be separated from the business needs 、 The risk of building a wheel . be relative to “ pure BU system ”, The biggest advantage is high leverage , The platform is easy to empower . data BP My classmates are not fighting by themselves , He has a strong team behind him , Strong platform product tool support . The business development curve is steep , Or when strategic priorities change , data BP Students can coordinate resources very quickly .BP Accumulated business support experience , It is also easier to carry out cross product line AC precipitation , Finally, it is reflected in the accumulation of platform products and methodologies .

Push data BP The starting point of the system , On the one hand, when the business volume becomes larger and larger , Only using general platform product technical support can no longer meet the requirements , It needs to be further combined with business characteristics , Provide comprehensive solutions and implement landing capabilities ; On the other hand, it also hopes to learn from each other's strong points and complement each other's weak points between pure China Taizhou chemical and pure business closed-loop , While pursuing reuse , Maximize organizational efficiency . From our practice results over the years , It's still very good , Although there will still be problems , However, all business parties are basically recognized . Recently, we found dozens of businesses as a whole NPS It's reached 70, Whether in the company or in the industry , It is a relatively high value .

InfoQ : There are many capabilities mentioned above , Can you summarize and introduce the architecture of the current byte beating data platform ?

Luo Xuan : Judging from the coarser granularity , The data platform can be divided into two parts , One is the platform capability layer , The other is the solution layer .

Platform capability layer It is mainly about our general product technical capability , Include :

  • Data engine part , There are bytes of large-scale use of the lake warehouse integrated engine LAS and OLAP engine ByteHouse. among ByteHouse This is our year 8 It was launched just this month , The performance and scale are leading in China ;
  • Data construction , Mainly DateLeap, It integrates the definition of data 、 collection 、 verification 、 shunt 、 Management and other one-stop data development and management platforms ;
  • The data application part is mainly divided into :
  • Products for general analysis :ABI( agile BI product , The interior is called Fengshen )、Finder( Behavioral insight analysis products , The internal name is TEA)、Gaia( A product for data portal construction , The business can build a self-service modular portal )、CDP( User data platform , Internal call Mirror, Precipitated various analytical labels )、Tester(A/B The experiment platform , Internal call Libra)
  • Insight products for different business scenarios , Such as hot treasure ( Internal call Pugna, Scenario insights for different businesses , Such as the Tiktok hot list )、 Manage cockpit ( It is used by the business management to monitor various core indicators ) And safety compliant products .

Solution layer , It's our data BP Pattern . On the one hand, data BP The team , Rely on our platform capabilities to provide data solutions for different businesses ; On the other hand , data BP The team can also get more development demands from the business , Thus, our platform capabilities are continuously iterated and optimized .

InfoQ : I just talked about a lot of technical challenges and developments . Technology and business are closely related , Mutually reinforcing . I want you to look at it from a data point of view , You are in the enabling business , Have you encountered some extreme challenges ? Can you give an example to illustrate ?

Luo Xuan : Of course , Technology ultimately has to be valued through the business , There are only complex business scenarios , Will bring enough technical challenges .

Take a special scene .2021 In the Spring Festival Gala of Tiktok , The flood peak reaches several times of the daily flow , In this scenario , We need to provide various real-time index data , It should be used for the real-time update of internal guidance activity strategies , Compare the budget decision of the amount of red packets in the following periods , Also to the outside , For example, the real-time war report data of the Spring Festival gala will be sent to the Spring Festival Gala scene and various media . This is in real time 、 stability 、 Index accuracy 、 The architecture has very high requirements for fault tolerance , However, the whole Spring Festival Gala project has only 27 God , It also adds extra difficulty and pressure .

First , On the flow collection side , We have a good foundation , Byte collection and control of all traffic data , They are all on a unified traffic platform . For the red envelope project of the Spring Festival Gala , We have additionally enhanced the disaster tolerance capability , Three machine room disaster recovery plans have been made , And support one click disaster recovery . For peak flow , We work with relevant teams , It supports the policy of limiting the flow of the service end and avoiding the retry of the client . For flexible degradation under different loads , It also supports buried point sampling and active degradation mechanism .

then , In terms of real-time indicators , We have also precipitated a set of relatively mature , With Flink Real time computing engine and ByteHouse、LAS And other analysis engines . Real time decision-making and war reporting requirements for Spring Festival Gala activities , We used two different technical architectures , One is based on Flink The computing architecture of , The final index is calculated by flow method , Another set is based on ByteHouse Storage architecture , Write detailed data in the storage layer in real time , Aggregate the final indicators during query . At the same time, the two architectures also provide dual computer room and dual link redundant disaster recovery .

Last , In the offline scenario , We also need to have strong hierarchical guarantee and data governance capabilities . During the peak period of business , We need to transfer a lot of offline resources to online business systems , At the same time, it also ensures that the offline data warehouse can still produce on time , Only in this way can products and analysts make a detailed review of the previous day's activities , To guide the next move . This requires hundreds of thousands of data sheets , Million data processing tasks , Flexible tiered provisioning 、 Demotion and rapid recovery , We have done this , Relevant abilities are precipitated in DataLeap In the product .

InfoQ : Byte has many self-developed products in data application , But what is the self-development direction of big data infrastructure considered ?

Luo Xuan : From the perspective of evolution path , There are basically three stages :1. Using open source ;2. Based on open source secondary development ;3. Since the research .

At the very beginning, I sought to solve business problems , The open source community provides many good basic solutions , such as SparkSQL、ClickHouse、Airflow wait , We will try to use it directly first , That's the stage 1. In use , As business complexity increases , In terms of scalability 、 Ease of use 、 Bottlenecks are encountered in the direction of vertical customization and optimization , At this point, we will make a round of technical judgment , If the open source community is at the core 、 The medium and long term are consistent with our expectations , Can walk stage 2, for example SparkSQL、ClickHouse etc. . Otherwise, it will go straight to the stage 3, For example, the data task scheduling system . And some systems , The open source community has no good choice , We will go straight to the stage from the beginning 3, such as A/B Test System . go 2 There are too many changes to the system , Gradually accumulated , Sometimes it tends to 3.

From the current situation , We are a 2+3 Mixed state of . In the process , We have also fed back some specific changes to the open source community . At present, we are also considering opening up some mature self-developed systems as a whole , Feed back to a wider range of developers . Internal discussion is active , You can look forward to it .

InfoQ : Future in ToB The planning , And how to coordinate with the evolution of technology within bytes ?

Luo Xuan : Big ideas , We insist on internal and external unity , Use the same set of product technology system to serve all businesses inside and outside the company . There are several advantages , One is to eat your own dog food , Polish product technology with large internal volume and diversified scenes , Provide more mature products to external customers , It is also a product and technology that helps the internal success of byte beating .

Second, internal service , A broader vision for a long time , More external perspectives . such as , At an early stage, consider how much demand the external market has for this technology , If it is just a customized small scene , Then small investment and external procurement will be used to solve the problem ; If there is a wide range of needs , Then invest heavily , Be the industry leader .

Third, in terms of cost efficiency, it also needs to be better , Be able to reuse resources and experiences . From the specific execution path , There will be some version differences during the use of the product , But it's more because of different scenes , Different stages of development lead to , The core is not to distinguish between internal and external customers , For example, there are differences in technical forms brought about by businesses of different sizes , Tradeoffs between operational ease of use and functional complexity , It is similar to many software Pro and Lite The feeling of version .

InfoQ : Finally, I would like to know what technical directions you are currently focusing on ? What capabilities should future big data developers have ?

Luo Xuan : At present, my main focus on big data technology includes : Real time 、 Intelligence and security privacy compliance . among , Real time focuses on real-time data warehouse 、 Flow batch integration and other technologies ; Intellectualization mainly centers on the intelligent materialized view 、 Query optimizer combined with machine learning 、 Enhance analytical intelligence, Q & A, etc ; Pay more attention to the policy trend in terms of privacy compliance, which leads to the trend of technology and architecture evolution , Including sensitive data discovery 、 Multiparty Computing 、 Data localization 、 Permission optimization, etc .

For big data developers who care about the future development , I think first of all, we need to have a solid basic computer technology reserve , This is a universal capability . Specific to big data technology , One feature is that there are many kinds of open source components , Big data developers should be familiar with the features of these open source components , This is also a good learning process ; Another feature is , We must find scenarios and environments with large data scale to practice and learn , Because it is completely different from the small data scenario technology 、 There are essential differences . There is no challenge in the small data scenario . On this basis , Then pay attention to the development of some cutting-edge directions .

原网站

版权声明
本文为[Deep learning and python]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/03/202203011145393623.html