当前位置:网站首页>Methodology and practice of real-time feature computing platform architecture

Methodology and practice of real-time feature computing platform architecture

2022-06-11 15:29:00 Deep learning and python

author | Lu Mian , The fourth paradigm is open source machine learning database OpenMLDB PMC core member

planning | Liu Yan

In the closed loop of machine learning from development to online , Real time feature computation is an important part of it , It is used to complete real-time feature processing of data . Due to its high timeliness requirements , After the data scientist completes the offline development of the feature script , It often requires a lot of optimization by the engineering team to complete the launch . On the other hand , Because there are two processes of off-line development and engineering on-line , Online and offline computing consistency verification becomes a necessary step , And it will cost a lot of time and manpower . This article will start from the above two pain points , Describe the optimization objectives of real-time feature computing system architecture - Go online immediately after development , And the architecture design principles for this optimization goal . Last , Will be based on open source real-time feature computing solutions OpenMLDB, Describe its architecture design and Optimization in practice .

Background introduction

Machine learning closed loop

today , Machine learning applications have accumulated a wide range of application cases in all walks of life . To sum up , The whole life cycle closed loop of machine learning from development to online can be shown in the following figure (Figure-1) A general description of .

Figure-1: Machine learning closed loop

from Figure-1 You can see , From the horizontal dimension , The whole process of machine learning is divided into Offline development and Online services Two complementary processes . From the vertical dimension , The form of information value bearing will experience from data 、 features 、 And then to the transformation process of the model .

  • data : Raw data information , For example, transaction flow information , Include amount 、 Time 、 Business name, etc .
  • features : More expressive information produced and calculated based on the original data , It is conducive to the production of higher quality models , For example, the average consumption amount of a customer in the past three months . In this paper, the engineering problem of feature calculation is discussed in detail .
  • Model : Through tens of thousands or even hundreds of millions of implicit data rules generated based on features , Describe the law of data essence from the ultra-high dimension , It includes the ability to predict behavior based on data . today , For data and models , There has been full discussion and de facto industry standard treatment . But for features , Today, industry has not yet formed a unified methodology and processing tools . This is mainly because , At the initial stage of AI application , We all focus on perceptual applications based on deep learning , The characteristic engineering process for such applications is relatively standard . But today , Decision scenarios ( Such as risk control 、 Personalized recommendation, etc ) In a large number of enterprise applications .

For decision scenarios , The processing logic of feature engineering is relatively flexible and complex , Therefore, standardized methodology and tools have not yet been formed in this area . This is also the field that this article will focus on , Through the elaboration of design methodology and architecture design practice , Let us deeply understand the real-time feature computing system and its typical use process .

Real time feature computation

This paper mainly focuses on the real-time feature calculation with very strong timeliness , The end-to-end delay of query calculation is generally set at the order of tens of milliseconds . Common computing modes for real-time features , When an event occurs , Based on a point in time moving forward from the current point in time , Form a time window , Perform relevant aggregation calculation in the window .

Here's the picture Figure-2 A typical real-time feature calculation scenario in the field of risk control is listed , It took ten days 、 An hour 、 Five minutes and three time windows , Different aggregation calculations based on Windows .

picture Figure-2: Examples of typical real-time feature calculation in the field of risk control

Real time feature computing has shown its importance in more and more scenes today , Its essence is to grasp the data characteristics in the latest time period , Provide strong support for rapid decision-making . This paper mainly aims at real-time feature calculation , To explain the relevant design concept and architecture .

Online and offline computing consistency architecture

Pain points : Two sets of development processes and consistency verification of online and offline computing

today , Without a proper methodology and tool chain , If it is necessary to develop and launch a set of real-time feature calculation logic , There are three main steps , That is, offline feature script development 、 Online feature code refactoring 、 And online and offline computing logic consistency verification . The relationship between the three is shown in the figure below Figure-3 Shown .

Figure-3: In the absence of appropriate tools, real-time feature calculation is the whole process from development to online

See the table below for the main tasks and participants of these three steps Table-1. from Table-1 Several key messages can be seen in :

  • Because data scientists are generally used to Python And other data analysis tools , Generally, the developed model can not meet the online requirements of real-time feature calculation , Such as low latency 、 High throughput 、 High availability and other performance and operation and maintenance indicators cannot be met .
  • Because data scientists and engineering teams are two teams 、 Two tool chains 、 Development of two systems , Therefore, the consistency check between the two systems becomes essential and very important .
  • According to a large number of Engineering landing cases , Consistency verification involves team communication 、 Requirements alignment 、 Repeated test and confirmation, etc , The labor cost is often the highest among the three steps .

Table 1: In the absence of appropriate tools, real-time feature calculation is the main step from development to online

There are many reasons for inconsistency between online and offline computing logic , such as :

  • Unequal tool capabilities . Now? ,Python It is the preferred tool for most data scientists ; contrary , Engineering teams usually try to use some high-performance databases to translate Python Script . Therefore, the two tools are not equal in expression ability . When SQL When the ability to express cannot meet the needs , There may be compromises in computational logic or the use of C/C++ And other high-performance programming languages to supplement the relevant capabilities .
  • Poor cognition of demand communication . Data scientists and engineering teams may have different perceptions of the definition and processing of data . An online bank in the United States Varo Bank Describes a situation where they do not have the right tools , An inconsistent scene encountered when the real-time feature is online ( For details, please refer to the blog of their engineering team Feature Store: Challenges and Considerations). In the online environment , The engineering team naturally believes that “ The account balance ” The definition of should be the balance in the real-time account ; But for data scientists , Build from offline data “ Real time account balance ” It's actually a very complicated thing , So data scientists use a simpler definition , That is, the balance of the account at the end of yesterday . Obviously , The two have a poor understanding of the account balance , It directly causes the inconsistency of online and offline computing logic . The necessity of online and offline computing logic consistency verification , And the huge labor cost , It is necessary for us to re-examine the whole process of feature computing from development to online . A more rational methodology of life cycle is needed , And the corresponding architecture design , To efficiently support today's rapidly growing number and scale of machine learning landing scenarios .

The goal is : Go online immediately after development

We have realized that , Online and offline computing consistency verification is the bottleneck of the whole system implementation . So ideally , If the overall process needs to be improved , We expect to have an efficient process of development and launch . It is shown in the figure below Figure-4 Shown .

In this set of optimized processes , The scripts of data scientists can be deployed online immediately , Without having to go through a second code refactoring , There is no need for additional online and offline consistency verification . If the methodology based on this process can be implemented , It will greatly improve the overall process of real-time features from development to launch , The labor cost will also be reduced from the past 8 people The month was significantly shortened to 1 people month .

picture Figure-4: Optimization objective of real-time feature computing development cycle : Development is the online process

2.3. Technical requirements

In order to achieve the optimization goal of development and launch , At the same time, it is necessary to ensure the high performance of real-time computing , It can be concluded that the entire architecture needs to meet the following technical requirements :

  • Need one : Low latency for online real-time feature computation 、 High concurrency . If we expect in the optimized process (Figure-4), The scripts of data scientists can be directly online , So we must be very careful to deal with a series of engineering problems of online computing . Its main requirement is to meet low latency 、 High concurrent real-time computing requirements ; Besides , Such as reliability 、 Extensibility 、 Disaster preparedness 、 Issues such as operation and maintenance are also features that require special attention in the actual implementation of the enterprise production environment . Obviously , If we only rely on data science to teach and use Python Write a feature calculation script to go online directly , These conditions cannot be met .
  • Demand two : Online and offline unified programming interface . In order to reduce the cost from overall development to online , We expect that from the perspective of external users , The whole system needs a unified external programming interface , Not like Table-1 As shown in , Two different programming interfaces have been exposed . Based on unified programming interface , Then you no longer need to refactor the code to get the script online .
  • Demand 3 : Online and offline computing consistency guarantee . Our optimization goal is to eliminate the need for additional high-cost online and offline consistency verification . that , How to ensure the consistency of online and offline calculations within the system , It's a problem that has to be solved .

Abstract architecture

Figure-5: The abstract architecture of the real-time feature platform that is launched immediately after development

In order to satisfy in Chapter 2.3 The three technical requirements mentioned in , We built the above Figure-5 The abstract architecture of . You can see , There are three modules in this abstract architecture diagram , To solve the technical challenges we are facing .

The following table lists the functional points of the module and the technical requirements to be solved .

Table-2: The core modules and functions of the real-time feature computing platform architecture

OpenMLDB Architecture design practice of

Based on the above analysis Figure-5 The abstract architecture of , as well as Table-2 The core module functions listed , Here we introduce OpenMLDB Architecture practice .

OpenMLDB (https://github.com/4paradigm/OpenMLDB) Is an open source machine learning database , It mainly builds efficient solutions for feature computing scenarios .

OpenMLDB The architecture design of the system is based on Figure-5 The abstract architecture listed , Through optimization or self-development based on existing open source software , To implement specific functions . The architecture after the visualization is shown in the figure below Figure-6 Shown .

picture Figure-6: OpenMLDB The overall architecture

From the architecture diagram Figure-6 You can see up here ,OpenMLDB There are several key modules , The explanation is as follows :

  • SQL(+):OpenMLDB External exposure SQL As a unified interface . Because of the standard SQL There is no optimization for operation related to feature calculation ( For example, related operations of timing window ), So it's in the standard SQL On the basis of , Support more feature computing friendly syntax functions .
  • Consistent execution plan generator : This is the core module to ensure the consistency of online and offline computing logic . It mainly contains SQL Syntax tree parsing and based on LLVM Execution plan generation module of . among , Unified execution plan generation module , For a given SQL, It can be translated into different execution plans optimized for online and offline , But at the same time, ensure the consistency of the two calculations .
  • Distributed batch processing SQL engine Spark(+): For batch processing for offline development SQL engine ,OpenMLDB be based on Spark The secondary optimization development at the source code level , Efficient support SQL Extended syntax for feature computation in . Be careful , Because the batch engine does not have any data storage requirements , Therefore, there is no dedicated storage engine in the logic , Just read the data from the offline data source for calculation .
  • Distributed time series database : The core real-time computing function is mainly composed of storage engine and real-time SQL The two core modules of the engine host , Together, it forms a distributed high-performance time series database . among ,SQL The engine is self-developed by the development team based on C++ Write a high-performance kernel ; The data storage engine is mainly used to store the latest window data required for feature calculation ( namely Figure-2 Material data in ). Be careful , The temporal database here has a concept of data life cycle (TTL, Time-To-Live), Suppose that our characteristic calculation logic only needs the data of the last three months , Then the old data of more than three months will be automatically cleared . The storage engine has two choices : One is , Memory storage engine developed by the development team (built-in):OpenMLDB To optimize latency and throughput for online processing , The memory based storage scheme is adopted by default , A double-layer jump table is constructed (double-layered skip list) Index structure of . This kind of data structure is especially suitable for quickly finding a key Here is a data sorted by timestamp . This memory index structure can reach the millisecond level in the search delay of sequential data [1], And the performance is much better than the commercial version of the memory database ; Two is , be based on RocksDB External memory storage engine : If the user is not sensitive to performance , But I hope to reduce the cost of memory , Users can also choose to be based on RocksDB External memory storage engine . Through the series connection of the above core components ,OpenMLDB It can realize the final optimization goal of development and launch .

The figure below Figure-7 Sum up OpenMLDB The overall use process from offline development to online deployment . contrast Figure-4 Corresponding optimization process objectives , We can find out , adopt OpenMLDB, From feature development to launch , It has well practiced the core idea of "development is online" .

picture Figure-7: OpenMLDB Usage flow

About OpenMLDB For more information, please refer to the following :

  • Official website :https://openmldb.ai/
  • GitHub: https://github.com/4paradigm/OpenMLDB
  • Docs: https://openmldb.ai/docs/zh/

total junction

This paper summarizes the engineering challenges of building a real-time feature computing platform , And the optimization goal expected by the industry from off-line development to on-line development . Based on goals , The methodology and principles of architecture design are described . Finally, it introduces the optimization goal , Open source solution based on design methodology practice OpenMLDB The overall structure of .

Reference resources :

[1] Cheng Chen, Jun Yang, Mian Lu, Taize Wang, Zhao Zheng, Yuqiang Chen, Wenyuan Dai, Bingsheng He, Weng-Fai Wong, Guoan Wu, Yuping Zhao, and Andy Rudoff. Optimizing in-memory database engine for AI-powered on-line decision augmentation using persistent memory. International Conference on Very Large Data Bases (VLDB) 2021.

The authors introduce :

Lu Mian , He graduated from the computer department of Hong Kong University of science and technology ; Now is OpenMLDB Community PMC core member; Working in the fourth paradigm , Database team and high-performance computing team Tech Lead.

原网站

版权声明
本文为[Deep learning and python]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/162/202206111524295827.html