当前位置:网站首页>When we are doing flow batch integration, what are we doing?
When we are doing flow batch integration, what are we doing?
2022-07-03 13:10:00 【Big data sheep said】
1. Preface
This article mainly shares the background of the integration of streaming and batching that bloggers understand at present , The problem I want to solve , And the ideas that may be realized in the future , And introduce it with several cases . throw away a brick in order to get a gem , Let us not only stay in the integration of flow and batch , But to think more deeply about the reasons behind .
2. background
Before introducing the integration of flow and batch , First, let's take a look at the engines commonly used in the field of flow and batch :
Batch task : Commonly used Hive、Spark.
Flow task : Commonly used Flink.Spark Streaming And Storm Currently, the usage rate in the streaming scenario will be less than Flink.
3. What problem led to the concept of integrating flow and batch ?
A premise : In the production scenario , When indicators of the same caliber use flow tasks to produce real-time data , Offline data is produced with batch tasks , Will consider whether it is necessary to integrate flow and batch . If an indicator only needs to output offline , What about the integration of flow and batch ?
An angle : Bloggers think , The integration of flow and batch should be considered from the perspective of flow , To put the results of the flow task in the batch field ( Or in the form of batch data ) Reuse , Not just on the side of the engine ,API Interface level unification . This thinking is similar to Ali in the figure below (From FFA 2020) The point of view of the said problem is similar , Bloggers understand that real-time reuse in the offline field may be an abstraction of the problems listed by Alibaba . Because if it can be reused , The three problems in the figure below do not exist !
Problem solved : On the basis of the above premise and thinking angle , Bloggers think , At present, the most important thing to be solved in the integration of flow and batch is to solve the quality problem of flow task output data , This is also the premise that stream data can be reused in batch scenarios . Used to Flink Students who do real-time data development should have encountered Flink When producing data , There will always be some exceptions ( For example, the use of windows may lead to loss of numbers ) Lead to and offline Hive、Spark There are some slight differences in the output data , In this way, it is impossible to reuse real-time data in the offline field . Bloggers understand , The key to the integration of flow and batch is to solve this problem , Others are in resource conservation 、 The advantages in improving human efficiency are based on the added value .
4. So what is the cause of data quality problems in flow tasks , What are the common scenarios ?
Bloggers think , At present, the most important reason is the data quality problem caused by data disorder .
There are two common scenarios in the real-time field :
The first is Flink The scene of task opening window . give an example , One opened TUMBLE WINDOW Of Flink Mission , Encounter serious data disorder ( The maximum disorder of user configuration 、 Parameters such as allowable delay cannot be solved ), Then the task will throw away the data , This scenario will lead to differences between real-time data and offline data .
The second is the scenario of real-time dimension table Association . If the data of the fact table comes first , The data in the dimension table cannot be associated . Thus, it is different from offline .
Of course, there are other scenes , Here is not a list .
5. Want to solve the above data quality problems , What are the feasible ideas ?
- Idealized thinking : With TUMBLE WINDOW For example , TUMBLE WINDOW Our original intention is to produce constant results ( namely append flow ), Therefore, data with great delay cannot be processed , Then we can TUMBLE WINDOW Use GROUP AGG(retract flow 、 Or called CDC Pattern ) Replace to calculate . When there is late data ,GROUP AGG It will handle normally and withdraw the last result , Issue the new results of recalculation . But the problem with this method is if we want to use CDC Mode to run tasks , We need the whole link to CDC The mode to run , Including computing engines 、 Message queue 、OLAP Engine, etc , But also to protect Exactly-once.( But when it comes to CDC Did you think of the data Lake ? This may also be a follow-up development direction ). Then take Ali (From FFA 2020) One minute mentioned \ Examples of hourly cumulative indicators , Let's see how Ali does it . Alibaba actually uses GROUP AGG Do the calculations ( But I don't know whether to use the following link CDC The way it works ).
minute / Hourly cumulative indicator
- Ali's idea (From FFA 2020): As shown in the figure below , Scenario 1 is if the input source of stream batch integration is different , Batch task scheduling correction results are required , Scenario 2 is if the flow batch results are the same , Don't run the batch task . In the first case, there is nothing to say ; But in the second case , Here is a brief analysis of : We know that the premise of verifying the same flow batch results is , Run a batch of tasks and produce results. Take the initiative to compare the results with those of flow tasks , But in scenario 2, the batch task is actually not running !!! So what can be thought of here is the need to be in advance 、 In the matter 、 After the event, a lot of monitoring is done to ensure that the overall process of flow task output has no problems , So as to ensure that we can achieve and Expected batch tasks The results are the same .
Comparison of new and old R & D modes
summary : The first idea above is relatively idealized , Basically, we think from the perspective that the data produced by flow tasks can be reused in batch mode , Leaving aside the batch tasks, the implementation of this process . The second kind of Ali FFA 2020 In comparison, the link software and hardware conditions are not so high , Bloggers think it is more feasible .
6. summary
This paper mainly introduces the following three parts :
The birth of flow batch integration is to solve the problem that the same indicator is offline 、 The difference of real-time task output data ( Data quality )
The root cause of data differences is data disorder
If you want to solve this problem , Idealization is full link CDC, For more operational ideas, please refer to Alibaba FFA 2020
Please pay attention to what you like + give the thumbs-up + Look again .
Previous recommendation
[
flink sql Know why ( 6、 ... and )| flink sql Appointment calcite( Just read this one )
](http://mp.weixin.qq.com/s?__biz=MzkxNjA1MzM5OQ==&mid=2247489112&idx=1&sn=21e86dab0e20da211c28cd0963b75ee2&chksm=c1549aa0f62313b6674833cd376b2a694752a154a63532ec9446c9c3013ef97f2d57b4e2eb64&scene=21#wechat_redirect)
[
flink sql Know why ( 5、 ... and )| Customize protobuf format
](http://mp.weixin.qq.com/s?__biz=MzkxNjA1MzM5OQ==&mid=2247488994&idx=1&sn=20236350b1c8cfc4ec5055687b35603d&chksm=c154991af623100c46c0ed224a8264be08235ab30c9f191df7400e69a8ee873a3b74859fb0b7&scene=21#wechat_redirect)
[
flink sql Know why ( Four )| sql api Type system
](http://mp.weixin.qq.com/s?__biz=MzkxNjA1MzM5OQ==&mid=2247488788&idx=1&sn=0127fd4037788762a0401313b43b0ea5&chksm=c15499ecf62310fa747c530f722e631570a1b0469af2a693e9f48d3a660aa2c15e610653fe8c&scene=21#wechat_redirect)
[
flink sql Know why ( 3、 ... and )| Customize redis Data summary ( Source code attached )
](http://mp.weixin.qq.com/s?__biz=MzkxNjA1MzM5OQ==&mid=2247488720&idx=1&sn=5695e3691b55a7e40814d0e455dbe92a&chksm=c1549828f623113e9959a382f98dc9033997dd4bdcb127f9fb2fbea046545b527233d4c3510e&scene=21#wechat_redirect)
[
flink sql Know why ( Two )| Customize redis Data dimension table ( Source code attached )
](http://mp.weixin.qq.com/s?__biz=MzkxNjA1MzM5OQ==&mid=2247488635&idx=1&sn=41817a078ef456fb036e94072b2383ff&chksm=c1549883f623119559c47047c6d2a9540531e0e6f0b58b155ef9da17e37e32a9c486fe50f8e3&scene=21#wechat_redirect)
[
flink sql Know why ( One )| source\sink principle
](http://mp.weixin.qq.com/s?__biz=MzkxNjA1MzM5OQ==&mid=2247488486&idx=1&sn=b9bdb56e44631145c8cc6354a093e7c0&chksm=c1549f1ef623160834e3c5661c155ec421699fc18c57f2c63ba14d33bab1d37c5930fdce016b&scene=21#wechat_redirect)
边栏推荐
- 35道MySQL面试必问题图解,这样也太好理解了吧
- Brief introduction to mvcc
- 【数据库原理及应用教程(第4版|微课版)陈志泊】【第三章习题】
- Detailed explanation of the most complete constraintlayout in history
- Sword finger offer 15 Number of 1 in binary
- C graphical tutorial (Fourth Edition)_ Chapter 13 entrustment: delegatesamplep245
- 解决 System has not been booted with systemd as init system (PID 1). Can‘t operate.
- 2022-02-11 heap sorting and recursion
- SSH登录服务器发送提醒
- 【习题六】【数据库原理】
猜你喜欢
Integer case study of packaging
Finite State Machine FSM
2022-02-09 survey of incluxdb cluster
解决 System has not been booted with systemd as init system (PID 1). Can‘t operate.
[Database Principle and Application Tutorial (4th Edition | wechat Edition) Chen Zhibo] [Chapter 6 exercises]
Huffman coding experiment report
【数据库原理复习题】
2022-02-11 heap sorting and recursion
sitesCMS v3.1.0发布,上线微信小程序
【R】 [density clustering, hierarchical clustering, expectation maximization clustering]
随机推荐
2022-02-09 survey of incluxdb cluster
Xctf mobile--app3 problem solving
C graphical tutorial (Fourth Edition)_ Chapter 15 interface: interfacesamplep271
[judgment question] [short answer question] [Database Principle]
Analysis of the influence of voltage loop on PFC system performance
Logback 日志框架
【数据库原理及应用教程(第4版|微课版)陈志泊】【SQLServer2012综合练习】
C graphical tutorial (Fourth Edition)_ Chapter 20 asynchronous programming: examples - cases without asynchronous
Sitescms v3.0.2 release, upgrade jfinal and other dependencies
【习题五】【数据库原理】
Sword finger offer 17 Print from 1 to the maximum n digits
luoguP3694邦邦的大合唱站队
[exercice 7] [principe de la base de données]
【数据库原理及应用教程(第4版|微课版)陈志泊】【第五章习题】
Mysqlbetween implementation selects the data range between two values
Fabric. JS three methods of changing pictures (including changing pictures in the group and caching)
My creation anniversary: the fifth anniversary
2022-02-10 introduction to the design of incluxdb storage engine TSM
How to stand out quickly when you are new to the workplace?
基于同步坐标变换的谐波电流检测