当前位置:网站首页>Data stack technology sharing: open source · data stack - extend flinksql to realize the join of flow and dimension tables

Data stack technology sharing: open source · data stack - extend flinksql to realize the join of flow and dimension tables

2022-06-24 12:28:00 Data stack dtinsight

One 、 Expand FlinkSQL Realize the integration of flow and dimension table join

Two 、 Why expand FlinkSQL?

1、 Real time computing needs to be completely SQL turn

SQL Is the most widely used language in data processing . It allows users to simply state their business logic . Big data batch computing uses SQL Very common , But support SQL There's not much real-time computing . Actually , use SQL Developing real-time tasks can greatly reduce the threshold of data development , In kangaroo cloud stack - Real time computing module , We decided to achieve complete SQL turn .

The data are calculated by SQL The advantages of

declarative . Users just need to express what I want , As for how to calculate, it's a matter of system , Users don't care .

Auto tuning . The query optimizer can provide SQL Generate the most available execution plan . Users don't need to understand it , You can automatically enjoy the performance improvement brought by the optimizer .

Easy to understand . Many people in different industries and fields understand SQL,SQL The threshold of learning is very low , use SQL As a cross team development language, it can greatly improve efficiency .

Stable .SQL It's a language with decades of history , It's a very stable language , There are few changes . So when we upgrade the engine version , Even replace it with another engine , Can be compatible with 、 Upgrade smoothly .

Reference link :https://blog.csdn.net/weixin_33827965/article/details/86723623

2、 Real time computing also needs the integration of flow and dimension table JOIN

In the world of real-time computing, it's not just streams and streams JOIN, We also need the flow and dimension table JOIN. In the last year , Kangaroo cloud stack V3.0 During version development , The latest version at that time ——flink1.6 in FlinkSQL, Have already put SQL The advantages of applying to Flink In the engine , But it doesn't support flow and dimension table JOIN.

FlinkSQL On 2017 year 7 In March, it began to open streaming computing services to Alibaba group , Although it's a very young product , But to double 11 Thousands of jobs have been supported during this period , In double 11 period ,Blink The peak of job processing has reached 5+ Billion per second , Among them, only Flink SQL The total peak value of job processing is 3 Billion / second .

Reference link :https://yq.aliyun.com/articles/457438

Let's first explain what a dimension table is ; Dimension tables are dynamic tables , The data stored in the table may not change , It's also possible to update regularly , But the update frequency is not very frequent . In business development, general dimension table data is stored in relational database, such as mysql,oracle etc. , It may also be stored in hbase,redis etc. nosql database .

3、 ... and 、FlinkSQL Realize the integration of flow and dimension table join Step by step

1、 use Flink api Realize the function of dimension table

To realize the function of dimension table, we need to use Flink Aysnc I/O This function , Alibaba contributed to Apache Flink Of .

Async I/O Alibaba contributed to the community , On 1.2 Version to introduce , The main purpose is to solve the problem that the network delay becomes the bottleneck of the system when interacting with the external system .

See this article for details :http://wuchong.me/blog/2017/05/17/flink-internals-async-io/

Corresponding to Flink Of api Namely RichAsyncFunction This abstract class , In the implementation of this abstract class, the open( initialization ),asyncInvoke( Data asynchronous call ),close( Some operations stopped ) Method , The main thing is to realize asyncInvoke The method inside .

Flow and dimension table join There are two problems :

1) The first is performance .

Because if the velocity is very fast , Every piece of data needs to go to the dimension table join, But the data of dimension table exists in a third-party storage system , If you have real-time access to a third-party storage system , Not only join The performance will be poor , Every time I go online io; It also puts a lot of pressure on third-party storage systems , It's possible that the third-party storage system will fail .

So the solution is to cache the data in the dimension table , It can be fully cached , This is mainly due to the fact that the dimension table data is not large , The other one is LRU cache , When the dimension table has a large amount of data .

2) The second problem is that the data delayed by the stream is associated with the previous dimension table data .

This involves storing snapshot data for dimension table data , So this scene uses HBase It's more suitable to make dimension table , because HBase It's built to support multiple versions of data .

2、 Analytic flow and dimension table join Of SQL The grammar is transformed into the underlying FlinkAPI

because FlinkSQL Most of it has been done SQL scene , We can't be parsing SQL All the grammar of , Turning him to the bottom FlinkAPI.

So what we're doing is parsing SQL grammar , To find the join Is there a dimension table in the table , If you have a dimension table , Then we'll take this join The statements of the dimension table are separated , use Flink Of TableAPI and StreamAPi Generate new DataStream, I'm putting this DataStream And other tables are doing join So that we can use SQL To realize the integration of flow and dimension table join The grammar .

SQL The tool of analysis is to use Apache calcite,Flink It's also done with this framework SQL Analytic . So all grammars can be parsed .

1)DEMO SQL

insert 
into 
      MyResult 
      select 
            d.channel, 
            d.info 
      from 
             (       select a.*,b.info 
              from 
                       MyTable a 
              join sideTable b 
                       on a.channel=b.name     
              where a.channel = 'xc2’ 
                          and a.pv=10     ) as d 

2)Calcite analysis Insert into sentence , Split out sub statements

select a.*,b.info from MyTable a join sideTable b on a.channel=b.name 
      where a.channel = 'xc2' and a.pv=10
select d.channel, d.info from d

insert into MyResult

3) Calcite Continue to parse select sentence

old: select a.*,b.info from MyTable a join sideTable b on a.channel=b.name 
 where a.channel = 'xc2' and a.pv=10

Counting stack It's Yun Yuansheng. — Station Data Center PaaS, We are github and gitee There's an interesting open source project on :FlinkX,FlinkX It's based on Flink Batch flow unified data synchronization tool , It can collect static data , It can also collect real-time changing data , It's global 、 isomerism 、 Batch stream integrated data synchronization engine . If you like, please give us some star!star!star!

github Open source project :https://github.com/DTStack/flinkx

gitee Open source project :https://gitee.com/dtstack_dev_0/flinkx

原网站

版权声明
本文为[Data stack dtinsight]所创,转载请带上原文链接,感谢
https://yzsam.com/2021/06/20210601192520371v.html

随机推荐