当前位置:网站首页>From real-time computing to streaming data warehouse, where will Flink go next?
From real-time computing to streaming data warehouse, where will Flink go next?
2022-06-11 13:04:00 【Deep learning and python】
The guest | Zhang Jiao
edit | Jia Yaning
Xiaomi from 2019 Introduced in Flink And deal with the requirements related to real-time computing , From the first access version 1.7 Up to date 1.14, Accumulated upgraded and updated 6 It's a big version , At present, it has been connected, including data acquisition 、 Information flow advertising 、 Search recommendations 、 User portrait 、 Of all business lines of the group, including finance 3000+ Mission , Average daily processing 10 One trillion + The news of , And set up... At home and abroad 10+ colony .
that , Xiaomi is introducing Flink What challenges have you encountered after ? how ?Flink Where will it go in the end ? Based on curiosity about the above questions , We found Zhang Jiao, senior development engineer of Xiaomi big data department , He is Apache Flink Contributor, Mainly responsible for Xiaomi Flink Research and development of kernel layer of real-time computing framework .
At the same time, teacher Zhang Jiao has also been online QCon+ Case study club 「Flink Landing practice in real-time computing application scenario 」 Lecturer on the topic , So we are aiming at Flink I interviewed Mr. Zhang Jiao about his doubts and curiosity , Let's take a look at the teacher's thinking .
InfoQ: What kind of work have you been in charge of recently ?
Zhang Jiao : I'm currently in Xiaomi Computing Platform Department , Mainly responsible for the development and maintenance of Xiaomi real-time computing platform Flink Work related to the framework kernel , Including the development of new internal features 、 User support 、Flink Community participation 、 Daily maintenance of the framework, etc . At present, it is mainly aimed at users within the group , Provide data processing capability and support for data developers in all business lines of the company , Better enable the development of business .
We open source in the community Flink On the basis of , According to the characteristics and personalized needs of Xiaomi group , More customized development has been carried out , Some of these problems and functions will be encountered in internal use scenarios , Therefore, it is maintained on the internal version ; There are also common scenarios that companies will encounter and need , We will also feed it back to the community . Of course, we will also share our experience, lessons and technologies in development and maintenance , Achieve common progress .
InfoQ: In your daily work , Have you encountered any impressive challenges ?
Zhang Jiao : In the use of Flink In the process of , In fact, I think there are two kinds of problems that are more prominent :
- Flink The operation and maintenance problems of , Especially the version upgrade problem Community Flink The version upgrade iteration is relatively fast , Basically, a larger version will be released every three to four months , However, due to the internal version, there are many problems that cannot be merged into the community patch, In addition, the compatibility between the current versions is not very good , Include checkpoint Compatibility is not guaranteed across versions 、API Sometimes there will be changes, etc , Lead to Flink Cluster upgrading becomes a very painful thing , The interior even has to be maintained at the same time Flink Multiple versions of . at present , We still did a lot of things , Including interior patch Combing and integrating 、 Automatic detection of upgrade compatibility 、checkpoint Compatible processing development and so on . The community has also noticed this situation , In addition to standardization Flink Out of the release cycle , The old and new versions are also updated in the new version API Compatibility is guaranteed . It cannot be said that this problem has been completely solved by these means , But at least to some extent, it reduces the complexity and cost of upgrading .
- Flink The problem of consumption Actually Flink The introduction to is relatively simple , Even for Flink Less familiar users , By reading documents or looking up materials on the Internet, or reading examples of typical scenarios provided by us , Generally, you can realize your needs in a few days , Especially from Flink 1.9 Began to SQL Our support is becoming more and more perfect , Users can basically rely on pure code without writing any code at all SQL You can complete the requirements , But after running, users may encounter the problem of message backlog , There are many possible reasons , Include Source Slow reading speed 、 Complex message processing logic leads to slow consumption 、 Slow state reading speed affects processing performance 、Sink Slow write speed 、 The message has a short traffic peak 、 There are exceptions and other problems in message processing . There are many solutions , Such as parameter adjustment 、 Increase job parallelism 、 Check the upstream and downstream read-write bottlenecks 、 Optimize message processing logic, etc . Of course , Optimizing processing logic is still difficult , In especial Flink SQL The processing code is automatically generated by the framework , It is still difficult to optimize , Generally, we will provide some common optimizations as parameters to users who need them to configure manually .
InfoQ:Flink In recent years, the integration of flow and batch has been emphasized , What are its current application scenarios in Xiaomi ? Share your practice and exploration .
Zhang Jiao : In fact, there are some differences between batch processing and stream processing , For example, stream processing is event driven , In other words, after the event comes, it can be driven to process , But batch processing generally needs a scheduling system to schedule , Used to control the trigger time and conditions, etc , At present, it is combined with the hummingbird dispatching system inside Xiaomi +Flink+Iceberg Data Lake Technology , We have already made it through last year Flink The whole process of batch flow processing .
In this year , Xiaomi is vigorously promoting the integration of flow and batch , Usage scenarios include new business and direct use of new scenarios Flink Batch flow integrated development , It also includes the old scene, switching its batch scene to Flink in , Realization Flink A framework completes all its computing scenarios . at present , We mainly promote the use of... In the warehouse team , adopt Flink Batch stream fusion capability provided , add Iceberg Batch stream read / write interface , Greatly simplify the data ETL link .
The advantage of flow batch integration is that it can use a set of code to complete business logic , And because the batch flow processing of the same framework uses the same API Solve the problem of business caliber , This not only improves the development efficiency of the business , It also eliminates the problem of data quality caused by inconsistent caliber , For business, it can focus more on the implementation of business rather than the selection of computing engine 、 Quality assurance of comparative data .
InfoQ: How to see Flink The topic has matured in real-time computing , Do you think Flink What iterations will there be , How will it develop ?
Zhang Jiao : For now , After all these years of development ,Flink It has actually become a de facto standard in real-time computing , At present, the existing functions can basically solve the real-time computing requirements of all scenarios . therefore , next step Flink The starting point of may be :
- Force off-line computing field Completely unified computing framework , Even realize that users can completely not distinguish between real-time and offline computing scenarios , Reduce the learning cost of users and the operation and maintenance cost of bottom developers and the company to maintain the two frameworks . Of course , This is not a simple thing , If users and companies do not have significant benefits in doing this , It is difficult to promote from the business level . Besides , The idea of batch stream processing separately has been deeply rooted in the hearts of the people , It is not a simple thing to change the business side's thinking , All this depends on the continuous progress of technology and the proof of time .
- Force OLAP Data query field at present OLAP There are still many query engines , In a state of contention among a hundred schools of thought , The results of real-time calculation need to be collected to the downstream storage system and queried based on the data , If you can directly optimize and query Flink Calculated results of , And reuse at the same time Flink Computing power , So can it provide higher query real-time , And save storage costs ? Of course , There will be more work to be done , For example, query performance optimization , Can we solve the storage problem 、 The interaction between calculation and query, etc , We can see with joy Flink In fact, it is constantly pushing in this direction . in general , I personally think Flink Will not be satisfied with the achievements in the field of real-time computing , There will be more and better functions continuously launched , And promote the continuous development of the whole community .
InfoQ: What do you think of the newly proposed concept of streaming data warehouse ? What application scenarios will it have in the future ?
Zhang Jiao : Streaming data warehouse is mainly to solve the problem of offline and real-time integration in the development of data warehouse , At present, the vast majority of data warehouse development is still in use Lambda framework , That is, real-time data is generated through the real-time link to solve online analysis scenarios with high real-time requirements , The offline link is used to correct the historical data to ensure the correctness and integrity of the data , This not only affects the development efficiency , Increased development costs , At the same time, it also brings the problem of data caliber .
meanwhile , Because real-time links are generally used Kafka Wait for the message queue to be used as intermediate storage , although Kafka The writing efficiency is relatively high , But there is limited storage time , Only append is supported, not update , I won't support it OLAP Inquiry and so on . and Flink By introducing Dynamic table( Dynamic table ) To store hierarchical data generated by real-time links , The intermediate results can be directly stored and analyzed , Without introducing third-party storage and OLAP engine , It really realizes the integration of intermediate storage and computing .
For the moment ,Flink+Iceberg The data Lake solution can be seen as an alternative , Although additional third-party components are introduced Iceberg, But by Iceberg Supported transaction read-write capability and row level update mechanism , It can basically solve most of the scenarios encountered in this aspect .
InfoQ: Last , Yes, I want to understand and apply Flink Say something to your little friend !
Zhang Jiao : I personally think , From shallow to deep is the best way to learn a new technology , Try to use it first and then understand the principle , Know what it is and then know why , Personally, I don't recommend looking at the source code directly , actually Flink The amount of code of the framework source code is still relatively large , If you can't look at it with questions , It's probably impossible to hold on , Even if you stick to it , The effect may not be ideal , In the end, it may be the effect of getting twice the result with half the effort .
Of course, everyone's learning ideas and habits may be different , But as far as I'm concerned , See more Flink Official community documents and FLIP Improvement proposal , Get involved in the community issue Discussion and code review, It's all very helpful . Last , Continuous learning is worth having , I'm sure I won't regret .
Author's brief introduction
Zhang Jiao millet Senior Development Engineer of big data department
Apache Flink Contributor, Now he is the Senior Software Engineer of Xiaomi big data department , Mainly responsible for Xiaomi Flink Research and development of kernel layer of real-time computing framework , Former senior R & D Engineer of JD big data department 、 Senior R & D Engineer of iqiyi business intelligence department , He has many years of experience in big data real-time processing and has accumulated rich working experience .
边栏推荐
- 启封easy QF PDA帮助企业提升ERP的管理水平
- 字节真的是宇宙尽头吗?
- Usage of instr function in Oracle Database
- Add function drop-down multiple selections to display the selected personnel
- TeaTalk·Online 演讲实录 | 圆满完结!安全上云,选对数据迁移策略很重要
- 五年官司终败诉,万亿爬虫大军蠢蠢欲动
- 关于分布式锁的续命问题——基于Redis实现的分布式锁
- 美容院管理系统如何解决门店运营的三大难题?
- openharmony标准系统之app手动签名
- 经营养生理疗馆要注意什么问题?
猜你喜欢

How does the beauty salon management system solve the three problems of store operation?

求你了,不要再在对外接口中使用枚举类型了!

Why are the current membership warehouse stores bursting out collectively?
![[filter] design of time-varying Wiener filter based on MATLAB [including Matlab source code 1870]](/img/1a/7b80f3d81c1f4773194cffa77fdfae.png)
[filter] design of time-varying Wiener filter based on MATLAB [including Matlab source code 1870]

4. Locksupport and thread interruption
![[clearos] install the clearos system](/img/fe/8080c96ea18eb9afd4c4cff2c98b27.jpg)
[clearos] install the clearos system

Security mechanism of verification code in seckill

@Controller和RequestMapping如何解析的

室内场馆现代化的三大要点

Quel projecteur 4K est le meilleur rapport qualité - prix, quand bex3 pro met en évidence 128g Storage 618 vaut la peine de voir
随机推荐
马斯克称自己不喜欢做CEO,更想做技术和设计;吴恩达的《机器学习》课程即将关闭注册|极客头条...
刚高考完有些迷茫不知道做些什么?谈一谈我的看法
[filter] design of time-varying Wiener filter based on MATLAB [including Matlab source code 1870]
为什么现在的会员制仓储店都集体爆发了?
Search without data after paged browsing
How about Lenovo Xiaoxin 520? Which is more worth buying than dangbei D3x?
五年官司终败诉,万亿爬虫大军蠢蠢欲动
#61. Two point answer
.net core 抛异常对性能影响的求证之路
Master-slave replication of MySQL
What are the elements of running a gymnasium?
PADS使用之绘制原理图
Evolution of e-commerce development
Which 4K projector is the most cost-effective? When the Bay X3 Pro highlights the 128G storage 618, it is worth seeing
[arcgis] City relevance analysis
启封easy QF PDA帮助企业提升ERP的管理水平
In the list of 618 projector hedging brands in 2022, dangbei projection ranked top 1 in the hedging rate of idle fish
How to synchronize openstack RDO source to local for offline installation
Venue floor efficiency is so low? The key lies in these two aspects
[background interaction] select to bind the data transferred in the background