当前位置:网站首页>Data stack technology sharing: open source · data stack - extend flinksql to realize the join of flow and dimension tables
Data stack technology sharing: open source · data stack - extend flinksql to realize the join of flow and dimension tables
2022-06-24 12:28:00 【Data stack dtinsight】
One 、 Expand FlinkSQL Realize the integration of flow and dimension table join
Two 、 Why expand FlinkSQL?
1、 Real time computing needs to be completely SQL turn
SQL Is the most widely used language in data processing . It allows users to simply state their business logic . Big data batch computing uses SQL Very common , But support SQL There's not much real-time computing . Actually , use SQL Developing real-time tasks can greatly reduce the threshold of data development , In kangaroo cloud stack - Real time computing module , We decided to achieve complete SQL turn .
The data are calculated by SQL The advantages of
declarative . Users just need to express what I want , As for how to calculate, it's a matter of system , Users don't care .
Auto tuning . The query optimizer can provide SQL Generate the most available execution plan . Users don't need to understand it , You can automatically enjoy the performance improvement brought by the optimizer .
Easy to understand . Many people in different industries and fields understand SQL,SQL The threshold of learning is very low , use SQL As a cross team development language, it can greatly improve efficiency .
Stable .SQL It's a language with decades of history , It's a very stable language , There are few changes . So when we upgrade the engine version , Even replace it with another engine , Can be compatible with 、 Upgrade smoothly .
Reference link :https://blog.csdn.net/weixin_33827965/article/details/86723623
2、 Real time computing also needs the integration of flow and dimension table JOIN
In the world of real-time computing, it's not just streams and streams JOIN, We also need the flow and dimension table JOIN. In the last year , Kangaroo cloud stack V3.0 During version development , The latest version at that time ——flink1.6 in FlinkSQL, Have already put SQL The advantages of applying to Flink In the engine , But it doesn't support flow and dimension table JOIN.
FlinkSQL On 2017 year 7 In March, it began to open streaming computing services to Alibaba group , Although it's a very young product , But to double 11 Thousands of jobs have been supported during this period , In double 11 period ,Blink The peak of job processing has reached 5+ Billion per second , Among them, only Flink SQL The total peak value of job processing is 3 Billion / second .
Reference link :https://yq.aliyun.com/articles/457438
Let's first explain what a dimension table is ; Dimension tables are dynamic tables , The data stored in the table may not change , It's also possible to update regularly , But the update frequency is not very frequent . In business development, general dimension table data is stored in relational database, such as mysql,oracle etc. , It may also be stored in hbase,redis etc. nosql database .
3、 ... and 、FlinkSQL Realize the integration of flow and dimension table join Step by step
1、 use Flink api Realize the function of dimension table
To realize the function of dimension table, we need to use Flink Aysnc I/O This function , Alibaba contributed to Apache Flink Of .
Async I/O Alibaba contributed to the community , On 1.2 Version to introduce , The main purpose is to solve the problem that the network delay becomes the bottleneck of the system when interacting with the external system .
See this article for details :http://wuchong.me/blog/2017/05/17/flink-internals-async-io/
Corresponding to Flink Of api Namely RichAsyncFunction This abstract class , In the implementation of this abstract class, the open( initialization ),asyncInvoke( Data asynchronous call ),close( Some operations stopped ) Method , The main thing is to realize asyncInvoke The method inside .
Flow and dimension table join There are two problems :
1) The first is performance .
Because if the velocity is very fast , Every piece of data needs to go to the dimension table join, But the data of dimension table exists in a third-party storage system , If you have real-time access to a third-party storage system , Not only join The performance will be poor , Every time I go online io; It also puts a lot of pressure on third-party storage systems , It's possible that the third-party storage system will fail .
So the solution is to cache the data in the dimension table , It can be fully cached , This is mainly due to the fact that the dimension table data is not large , The other one is LRU cache , When the dimension table has a large amount of data .
2) The second problem is that the data delayed by the stream is associated with the previous dimension table data .
This involves storing snapshot data for dimension table data , So this scene uses HBase It's more suitable to make dimension table , because HBase It's built to support multiple versions of data .
2、 Analytic flow and dimension table join Of SQL The grammar is transformed into the underlying FlinkAPI
because FlinkSQL Most of it has been done SQL scene , We can't be parsing SQL All the grammar of , Turning him to the bottom FlinkAPI.
So what we're doing is parsing SQL grammar , To find the join Is there a dimension table in the table , If you have a dimension table , Then we'll take this join The statements of the dimension table are separated , use Flink Of TableAPI and StreamAPi Generate new DataStream, I'm putting this DataStream And other tables are doing join So that we can use SQL To realize the integration of flow and dimension table join The grammar .
SQL The tool of analysis is to use Apache calcite,Flink It's also done with this framework SQL Analytic . So all grammars can be parsed .
1)DEMO SQL
insert
into
MyResult
select
d.channel,
d.info
from
( select a.*,b.info
from
MyTable a
join sideTable b
on a.channel=b.name
where a.channel = 'xc2’
and a.pv=10 ) as d 2)Calcite analysis Insert into sentence , Split out sub statements
select a.*,b.info from MyTable a join sideTable b on a.channel=b.name
where a.channel = 'xc2' and a.pv=10select d.channel, d.info from d
insert into MyResult
3) Calcite Continue to parse select sentence
old: select a.*,b.info from MyTable a join sideTable b on a.channel=b.name where a.channel = 'xc2' and a.pv=10
Counting stack It's Yun Yuansheng. — Station Data Center PaaS, We are github and gitee There's an interesting open source project on :FlinkX,FlinkX It's based on Flink Batch flow unified data synchronization tool , It can collect static data , It can also collect real-time changing data , It's global 、 isomerism 、 Batch stream integrated data synchronization engine . If you like, please give us some star!star!star!
github Open source project :https://github.com/DTStack/flinkx
gitee Open source project :https://gitee.com/dtstack_dev_0/flinkx
边栏推荐
- Can Tencent's tendis take the place of redis?
- 怎么可以打新债 开户是安全的吗
- Tencent Youtu, together with Tencent security Tianyu and wechat, jointly launched an infringement protection scheme
- Deep parsing and implementation of redis pub/sub publish subscribe mode message queue
- Basic path test of software test on the function of the previous day
- 【数字IC/FPGA】Booth乘法器
- Programmer: after 5 years in a company with comfortable environment, do you want to continue to cook frogs in warm water or change jobs?
- ahk实现闹钟
- The operation and maintenance boss laughed at me. Don't you know that?
- How to develop mRNA vaccine? 27+ pancreatic cancer antigen and immune subtype analysis to tell you the answer!
猜你喜欢

Install Kali on the U disk and persist it

GTest从入门到入门

Insurance app aging service evaluation analysis 2022 issue 06

How can a shell script (.Sh file) not automatically close or flash back after execution?
![[live review] battle code pioneer phase 7: how third-party application developers contribute to open source](/img/fa/e52bd8a1a404a759ef6ba88e8da0f0.png)
[live review] battle code pioneer phase 7: how third-party application developers contribute to open source
[Architect (Part 41)] installation of server development and connection to redis database

Programmers spend most of their time not writing code, but...

Opencv learning notes - loading and saving images

《梦华录》要大结局了,看超前点映不如先来学学它!

ArrayList # sublist these four holes, you get caught accidentally
随机推荐
The world's largest meat processor has been "blackmailed", how many industries will blackmail virus poison?
Pipeline shared library
11+! Methylation modification patterns based on m6A regulatory factors in colon cancer are characterized by different tumor microenvironment immune spectra
Opencv learning notes - cv:: mat class
Listed JD Logistics: breaking through again
Use go to process millions of requests per minute
炒伦敦金短线稳定赚钱技巧?在哪里炒伦敦金安全靠谱?
I'm in Shenzhen. Where can I open an account? Is it safe to open an account online now?
Programmer: after 5 years in a company with comfortable environment, do you want to continue to cook frogs in warm water or change jobs?
2021-06-02: given the head node of a search binary tree, it will be transformed into an ordered two-way linked list with head and tail connected.
Fbnet/fbnetv2/fbnetv3: Facebook's lightweight network exploration in NAS | lightweight network
Based on am335x development board arm cortex-a8 -- acontis EtherCAT master station development case
RTMP streaming platform easydss video on demand interface search bar development label fuzzy query process introduction
Opencv learning notes -- Separation of color channels and multi-channel mixing
怎么可以打新债 开户是安全的吗
Kubernetes best practice: graceful termination
9+!通过深度学习从结直肠癌的组织学中预测淋巴结状态
Tencent cloud and the ICT Institute launched the preparation of the "cloud native open source white paper" to deeply interpret cloud native
Use the object selection tool to quickly create a selection in Adobe Photoshop
Ten thousand campus developers play AI in a fancy way. It's enough to see this picture!