当前位置:网站首页>Methodology and practice of real-time feature computing platform architecture
Methodology and practice of real-time feature computing platform architecture
2022-06-11 15:29:00 【Deep learning and python】
author | Lu Mian , The fourth paradigm is open source machine learning database OpenMLDB PMC core member
planning | Liu Yan
In the closed loop of machine learning from development to online , Real time feature computation is an important part of it , It is used to complete real-time feature processing of data . Due to its high timeliness requirements , After the data scientist completes the offline development of the feature script , It often requires a lot of optimization by the engineering team to complete the launch . On the other hand , Because there are two processes of off-line development and engineering on-line , Online and offline computing consistency verification becomes a necessary step , And it will cost a lot of time and manpower . This article will start from the above two pain points , Describe the optimization objectives of real-time feature computing system architecture - Go online immediately after development , And the architecture design principles for this optimization goal . Last , Will be based on open source real-time feature computing solutions OpenMLDB, Describe its architecture design and Optimization in practice .
Background introduction
Machine learning closed loop
today , Machine learning applications have accumulated a wide range of application cases in all walks of life . To sum up , The whole life cycle closed loop of machine learning from development to online can be shown in the following figure (Figure-1) A general description of .
Figure-1: Machine learning closed loop
from Figure-1 You can see , From the horizontal dimension , The whole process of machine learning is divided into Offline development and Online services Two complementary processes . From the vertical dimension , The form of information value bearing will experience from data 、 features 、 And then to the transformation process of the model .
- data : Raw data information , For example, transaction flow information , Include amount 、 Time 、 Business name, etc .
- features : More expressive information produced and calculated based on the original data , It is conducive to the production of higher quality models , For example, the average consumption amount of a customer in the past three months . In this paper, the engineering problem of feature calculation is discussed in detail .
- Model : Through tens of thousands or even hundreds of millions of implicit data rules generated based on features , Describe the law of data essence from the ultra-high dimension , It includes the ability to predict behavior based on data . today , For data and models , There has been full discussion and de facto industry standard treatment . But for features , Today, industry has not yet formed a unified methodology and processing tools . This is mainly because , At the initial stage of AI application , We all focus on perceptual applications based on deep learning , The characteristic engineering process for such applications is relatively standard . But today , Decision scenarios ( Such as risk control 、 Personalized recommendation, etc ) In a large number of enterprise applications .
For decision scenarios , The processing logic of feature engineering is relatively flexible and complex , Therefore, standardized methodology and tools have not yet been formed in this area . This is also the field that this article will focus on , Through the elaboration of design methodology and architecture design practice , Let us deeply understand the real-time feature computing system and its typical use process .
Real time feature computation
This paper mainly focuses on the real-time feature calculation with very strong timeliness , The end-to-end delay of query calculation is generally set at the order of tens of milliseconds . Common computing modes for real-time features , When an event occurs , Based on a point in time moving forward from the current point in time , Form a time window , Perform relevant aggregation calculation in the window .
Here's the picture Figure-2 A typical real-time feature calculation scenario in the field of risk control is listed , It took ten days 、 An hour 、 Five minutes and three time windows , Different aggregation calculations based on Windows .
picture Figure-2: Examples of typical real-time feature calculation in the field of risk control
Real time feature computing has shown its importance in more and more scenes today , Its essence is to grasp the data characteristics in the latest time period , Provide strong support for rapid decision-making . This paper mainly aims at real-time feature calculation , To explain the relevant design concept and architecture .
Online and offline computing consistency architecture
Pain points : Two sets of development processes and consistency verification of online and offline computing
today , Without a proper methodology and tool chain , If it is necessary to develop and launch a set of real-time feature calculation logic , There are three main steps , That is, offline feature script development 、 Online feature code refactoring 、 And online and offline computing logic consistency verification . The relationship between the three is shown in the figure below Figure-3 Shown .
Figure-3: In the absence of appropriate tools, real-time feature calculation is the whole process from development to online
See the table below for the main tasks and participants of these three steps Table-1. from Table-1 Several key messages can be seen in :
- Because data scientists are generally used to Python And other data analysis tools , Generally, the developed model can not meet the online requirements of real-time feature calculation , Such as low latency 、 High throughput 、 High availability and other performance and operation and maintenance indicators cannot be met .
- Because data scientists and engineering teams are two teams 、 Two tool chains 、 Development of two systems , Therefore, the consistency check between the two systems becomes essential and very important .
- According to a large number of Engineering landing cases , Consistency verification involves team communication 、 Requirements alignment 、 Repeated test and confirmation, etc , The labor cost is often the highest among the three steps .
Table 1: In the absence of appropriate tools, real-time feature calculation is the main step from development to online
There are many reasons for inconsistency between online and offline computing logic , such as :
- Unequal tool capabilities . Now? ,Python It is the preferred tool for most data scientists ; contrary , Engineering teams usually try to use some high-performance databases to translate Python Script . Therefore, the two tools are not equal in expression ability . When SQL When the ability to express cannot meet the needs , There may be compromises in computational logic or the use of C/C++ And other high-performance programming languages to supplement the relevant capabilities .
- Poor cognition of demand communication . Data scientists and engineering teams may have different perceptions of the definition and processing of data . An online bank in the United States Varo Bank Describes a situation where they do not have the right tools , An inconsistent scene encountered when the real-time feature is online ( For details, please refer to the blog of their engineering team Feature Store: Challenges and Considerations). In the online environment , The engineering team naturally believes that “ The account balance ” The definition of should be the balance in the real-time account ; But for data scientists , Build from offline data “ Real time account balance ” It's actually a very complicated thing , So data scientists use a simpler definition , That is, the balance of the account at the end of yesterday . Obviously , The two have a poor understanding of the account balance , It directly causes the inconsistency of online and offline computing logic . The necessity of online and offline computing logic consistency verification , And the huge labor cost , It is necessary for us to re-examine the whole process of feature computing from development to online . A more rational methodology of life cycle is needed , And the corresponding architecture design , To efficiently support today's rapidly growing number and scale of machine learning landing scenarios .
The goal is : Go online immediately after development
We have realized that , Online and offline computing consistency verification is the bottleneck of the whole system implementation . So ideally , If the overall process needs to be improved , We expect to have an efficient process of development and launch . It is shown in the figure below Figure-4 Shown .
In this set of optimized processes , The scripts of data scientists can be deployed online immediately , Without having to go through a second code refactoring , There is no need for additional online and offline consistency verification . If the methodology based on this process can be implemented , It will greatly improve the overall process of real-time features from development to launch , The labor cost will also be reduced from the past 8 people The month was significantly shortened to 1 people month .
picture Figure-4: Optimization objective of real-time feature computing development cycle : Development is the online process
2.3. Technical requirements
In order to achieve the optimization goal of development and launch , At the same time, it is necessary to ensure the high performance of real-time computing , It can be concluded that the entire architecture needs to meet the following technical requirements :
- Need one : Low latency for online real-time feature computation 、 High concurrency . If we expect in the optimized process (Figure-4), The scripts of data scientists can be directly online , So we must be very careful to deal with a series of engineering problems of online computing . Its main requirement is to meet low latency 、 High concurrent real-time computing requirements ; Besides , Such as reliability 、 Extensibility 、 Disaster preparedness 、 Issues such as operation and maintenance are also features that require special attention in the actual implementation of the enterprise production environment . Obviously , If we only rely on data science to teach and use Python Write a feature calculation script to go online directly , These conditions cannot be met .
- Demand two : Online and offline unified programming interface . In order to reduce the cost from overall development to online , We expect that from the perspective of external users , The whole system needs a unified external programming interface , Not like Table-1 As shown in , Two different programming interfaces have been exposed . Based on unified programming interface , Then you no longer need to refactor the code to get the script online .
- Demand 3 : Online and offline computing consistency guarantee . Our optimization goal is to eliminate the need for additional high-cost online and offline consistency verification . that , How to ensure the consistency of online and offline calculations within the system , It's a problem that has to be solved .
Abstract architecture
Figure-5: The abstract architecture of the real-time feature platform that is launched immediately after development
In order to satisfy in Chapter 2.3 The three technical requirements mentioned in , We built the above Figure-5 The abstract architecture of . You can see , There are three modules in this abstract architecture diagram , To solve the technical challenges we are facing .
The following table lists the functional points of the module and the technical requirements to be solved .
Table-2: The core modules and functions of the real-time feature computing platform architecture
OpenMLDB Architecture design practice of
Based on the above analysis Figure-5 The abstract architecture of , as well as Table-2 The core module functions listed , Here we introduce OpenMLDB Architecture practice .
OpenMLDB (https://github.com/4paradigm/OpenMLDB) Is an open source machine learning database , It mainly builds efficient solutions for feature computing scenarios .
OpenMLDB The architecture design of the system is based on Figure-5 The abstract architecture listed , Through optimization or self-development based on existing open source software , To implement specific functions . The architecture after the visualization is shown in the figure below Figure-6 Shown .
picture Figure-6: OpenMLDB The overall architecture
From the architecture diagram Figure-6 You can see up here ,OpenMLDB There are several key modules , The explanation is as follows :
- SQL(+):OpenMLDB External exposure SQL As a unified interface . Because of the standard SQL There is no optimization for operation related to feature calculation ( For example, related operations of timing window ), So it's in the standard SQL On the basis of , Support more feature computing friendly syntax functions .
- Consistent execution plan generator : This is the core module to ensure the consistency of online and offline computing logic . It mainly contains SQL Syntax tree parsing and based on LLVM Execution plan generation module of . among , Unified execution plan generation module , For a given SQL, It can be translated into different execution plans optimized for online and offline , But at the same time, ensure the consistency of the two calculations .
- Distributed batch processing SQL engine Spark(+): For batch processing for offline development SQL engine ,OpenMLDB be based on Spark The secondary optimization development at the source code level , Efficient support SQL Extended syntax for feature computation in . Be careful , Because the batch engine does not have any data storage requirements , Therefore, there is no dedicated storage engine in the logic , Just read the data from the offline data source for calculation .
- Distributed time series database : The core real-time computing function is mainly composed of storage engine and real-time SQL The two core modules of the engine host , Together, it forms a distributed high-performance time series database . among ,SQL The engine is self-developed by the development team based on C++ Write a high-performance kernel ; The data storage engine is mainly used to store the latest window data required for feature calculation ( namely Figure-2 Material data in ). Be careful , The temporal database here has a concept of data life cycle (TTL, Time-To-Live), Suppose that our characteristic calculation logic only needs the data of the last three months , Then the old data of more than three months will be automatically cleared . The storage engine has two choices : One is , Memory storage engine developed by the development team (built-in):OpenMLDB To optimize latency and throughput for online processing , The memory based storage scheme is adopted by default , A double-layer jump table is constructed (double-layered skip list) Index structure of . This kind of data structure is especially suitable for quickly finding a key Here is a data sorted by timestamp . This memory index structure can reach the millisecond level in the search delay of sequential data [1], And the performance is much better than the commercial version of the memory database ; Two is , be based on RocksDB External memory storage engine : If the user is not sensitive to performance , But I hope to reduce the cost of memory , Users can also choose to be based on RocksDB External memory storage engine . Through the series connection of the above core components ,OpenMLDB It can realize the final optimization goal of development and launch .
The figure below Figure-7 Sum up OpenMLDB The overall use process from offline development to online deployment . contrast Figure-4 Corresponding optimization process objectives , We can find out , adopt OpenMLDB, From feature development to launch , It has well practiced the core idea of "development is online" .
picture Figure-7: OpenMLDB Usage flow
About OpenMLDB For more information, please refer to the following :
- Official website :https://openmldb.ai/
- GitHub: https://github.com/4paradigm/OpenMLDB
- Docs: https://openmldb.ai/docs/zh/
total junction
This paper summarizes the engineering challenges of building a real-time feature computing platform , And the optimization goal expected by the industry from off-line development to on-line development . Based on goals , The methodology and principles of architecture design are described . Finally, it introduces the optimization goal , Open source solution based on design methodology practice OpenMLDB The overall structure of .
Reference resources :
[1] Cheng Chen, Jun Yang, Mian Lu, Taize Wang, Zhao Zheng, Yuqiang Chen, Wenyuan Dai, Bingsheng He, Weng-Fai Wong, Guoan Wu, Yuping Zhao, and Andy Rudoff. Optimizing in-memory database engine for AI-powered on-line decision augmentation using persistent memory. International Conference on Very Large Data Bases (VLDB) 2021.
The authors introduce :
Lu Mian , He graduated from the computer department of Hong Kong University of science and technology ; Now is OpenMLDB Community PMC core member; Working in the fourth paradigm , Database team and high-performance computing team Tech Lead.
边栏推荐
- 回溯法/活动安排 最大兼容活动
- 在微服务架构中管理技术债务
- Analysis on the architecture of distributed systems - transaction and isolation level (multi object, multi operation) Part 2
- B站高管解读财报:疫情对公司长期发展无影响 视频化趋势不可阻挡
- Microservices - use of Nacos
- Exporting data using mysqldump
- 做自媒体剪辑真的能挣钱吗?
- 容易让单片机程序跑飞的原因
- Nexus configuration Yum repository for repository manager
- How to play seek tiger, which has attracted much attention in the market?
猜你喜欢

2022.02.28
![[creation mode] abstract factory mode](/img/16/d0086ba4cceb1c174d3f88ef5448e6.png)
[creation mode] abstract factory mode

With a loss of 13.6 billion yuan in three years, can listing revive Weima?

Implementation of gray-scale publishing scheme for microservice architecture based on gateway and Nacos

04 _ 深入浅出索引(上)

PHP Apache built-in stress testing tool AB (APACHE bench)
![[process blocks and methods of SystemVerilog] ~ domain, always process block, initial process block, function, task, life cycle](/img/c7/ff28df36b8d5dda704aa829dd5264f.png)
[process blocks and methods of SystemVerilog] ~ domain, always process block, initial process block, function, task, life cycle

Backtracking / solution space tree permutation tree

B站高管解读财报:疫情对公司长期发展无影响 视频化趋势不可阻挡

【创建型模式】原型模式
随机推荐
Simple C language address book
Intercept string (function)
3年亏损136亿,上市能救活威马吗?
2022 Tibet's latest eight major construction personnel (labor workers) simulation test question bank and answers
[creation mode] abstract factory mode
简单的C语言版本通讯录
02 _ 日志系统:一条SQL更新语句是如何执行的?
Small open source projects based on stm32f1
新华三交换机系统基本配置命令
Devil cold rice # 037 devil shares the ways to become a big enterprise; Female anchor reward routine; Self discipline means freedom; Interpretation of simple interest and compound interest
04 _ 深入浅出索引(上)
Let me tell you the benefits of code refactoring
基于 GateWay 和 Nacos 实现微服务架构灰度发布方案
05 _ In simple terms index (Part 2)
Illustration of tiger international quarterly report: revenue of USD 52.63 million continued to be internationalized
2022 Tibet's latest junior firefighter simulation test question bank and answers
How to do well in we media? Did you do these steps right?
2021 go developer survey
Uniapp develops wechat applet from build to launch
02 Tekton Pipeline