当前位置：网站首页>[collect first and use it sooner or later] 100 Flink high-frequency interview questions series (I)

[collect first and use it sooner or later] 100 Flink high-frequency interview questions series (I)

2022-06-11 17:51:00 【Big data Institute】

【 Collect first , Sooner or later 】100 individual Flink High frequency interview question series （ One ）

Continuous sharing is useful 、 valuable 、 Selected high-quality big data interview questions
We are committed to building the most comprehensive big data interview topic database in the whole network

1、Flink How to do stress testing and monitoring ？

Refer to the answer ： The pressure we usually encounter comes from the following aspects ：
（1） If the speed of data stream generation is too fast , If the downstream operators can't consume it , It will create back pressure . Back pressure monitoring can be used Flink Web UI(localhost:8081) To visualize monitoring , Once you call the police, you know . In general, the back pressure problem may be caused by sink This The operator is not optimized well , Just do some optimization . For example, if you write ElasticSearch, It can be changed to batch write , It can be adjusted Big ElasticSearch Queue size and so on .
（2） Set up watermark The maximum delay time of this parameter , If the setting is too large , It may cause Memory pressure . You can set the maximum delay time to be smaller , Then send the late element to the side output stream . Update the results later . perhaps Use similar to RocksDB This state back end , RocksDB It will open up extra heap storage space , but IO It's going to slow down , Trade off required .
（3） And if the length of the sliding window is too long , And if the sliding distance is very short ,Flink Performance of It's going to go down a lot . We mainly By time slicing , Store each element in only one “ Overlapping windows ”, In this way, the writing of state in window processing can be reduced .

2、 How do you reasonably evaluate Flink The parallelism of the task ？

Refer to the answer ：Flink The task parallelism is reasonable. Generally, the pressure measurement is evaluated according to the peak flow , And leave a certain amount of space according to the cluster load buffer resources .

If the data source already exists , Then you can directly consume and test
If the data source does not exist , It is necessary to make pressure measurement data for testing

For one Flink In terms of tasks , In general, you can set the parallelism in fine granularity in the following ways ：

source Parallelism configuration ： With kafka For example ,source The parallelism of is generally set to kafka Corresponding topic The number of partitions
transform（ such as flatmap、map、filter Equal operator ） Configuration of parallelism ： These operators usually don't do too heavy operations , Parallelism can be compared with source bring into correspondence with , So that operators can do forward To transmit data , No network transmission
keyby The subsequent processing operator ： It is suggested that the maximum parallelism is an integer multiple of the operator parallelism , In this way, we can make keyGroup It's the same , This makes the data relatively uniform shuffle To the downstream operator , The picture below shows shuffle Strategy
sink Configuration of parallelism ：sink Is where data flows downstream , According to sink The amount of data and the pressure resistance of downstream services are evaluated . If sink yes kafka, Can be set as kafka Corresponding topic The number of partitions . Be careful sink The degree of parallelism is better than kafka partition In multiples , Otherwise, it may appear, such as kafka partition Uneven data . But most of the time sink Operator parallelism does not need to be specially set , It only needs to be the same as the parallelism of the whole task .

3、Flink How to guarantee Exactly-once Semantic ？

Refer to the answer ：Flink The end-to-end consistency semantics can be achieved by implementing two-phase commit and state saving . It is divided into the following steps ：

Start business （beginTransaction）： Create a temporary folder , To write data into this folder
Pre submission （preCommit）： Write cached data in memory to a file and close
Formal submission （commit）： Put the previously written temporary files in the target directory . This means there will be some delay in the final data
discarded （abort）： Discard temporary files

If the failure occurs after the pre submission is successful , Before formal submission . You can submit pre submitted data based on the status , You can also delete pre submitted data .

4、Flink How to deal with late data

Refer to the answer ：Flink in WaterMark and Window The mechanism solves the disorder problem of streaming data , For data that is out of order due to delay , According to eventTime Conduct business processing , For delayed data Flink There are solutions of their own , The main way is to give a delay time , The processing delay data can still be accepted in this time range ：

Set the time allowed to delay It's through allowedLateness(lateness: Time) Set up
Save delay data It's through sideOutputLateData(outputTag: OutputTag[T]) preservation
Get delay data It's through DataStream.getSideOutput(tag: OutputTag[X]) obtain

5、Flink Do you understand the restart strategy of ？

Refer to the answer ：Flink Support different restart strategies , These restart strategies control job How to restart after failure ：

1、 Fixed delay restart strategy

The fixed delay restart policy will try to restart a given number of times Job, If the maximum number of reboots is exceeded ,Job It will eventually fail . Between two consecutive restart attempts , The restart policy will wait for a fixed time .

2、 Failure rate restart strategy

Failure rate restart strategy in Job If it fails, it will restart , But beyond the failure rate ,Job Will eventually be judged as a failure . Between two consecutive restart attempts , The restart policy will wait for a fixed time .

3、 No restart policy

Job Direct failure , No attempt to restart .

Continuous sharing is useful 、 valuable 、 Selected high-quality big data interview questions
We are committed to building the most comprehensive big data interview topic database in the whole network

原网站

版权声明
本文为[Big data Institute]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/162/202206111740273893.html

当前位置：网站首页>[collect first and use it sooner or later] 100 Flink high-frequency interview questions series (I)

[collect first and use it sooner or later] 100 Flink high-frequency interview questions series (I)

【 Collect first , Sooner or later 】100 individual Flink High frequency interview question series （ One ）

Continuous sharing is useful 、 valuable 、 Selected high-quality big data interview questions
We are committed to building the most comprehensive big data interview topic database in the whole network

1、Flink How to do stress testing and monitoring ？

2、 How do you reasonably evaluate Flink The parallelism of the task ？

3、Flink How to guarantee Exactly-once Semantic ？

4、Flink How to deal with late data

5、Flink Do you understand the restart strategy of ？

Continuous sharing is useful 、 valuable 、 Selected high-quality big data interview questions
We are committed to building the most comprehensive big data interview topic database in the whole network

边栏推荐

猜你喜欢

随机推荐

当前位置：网站首页>[collect first and use it sooner or later] 100 Flink high-frequency interview questions series (I)

[collect first and use it sooner or later] 100 Flink high-frequency interview questions series (I)

【 Collect first , Sooner or later 】100 individual Flink High frequency interview question series （ One ）

Continuous sharing is useful 、 valuable 、 Selected high-quality big data interview questions We are committed to building the most comprehensive big data interview topic database in the whole network

1、Flink How to do stress testing and monitoring ？

2、 How do you reasonably evaluate Flink The parallelism of the task ？

3、Flink How to guarantee Exactly-once Semantic ？

4、Flink How to deal with late data

5、Flink Do you understand the restart strategy of ？

Continuous sharing is useful 、 valuable 、 Selected high-quality big data interview questions We are committed to building the most comprehensive big data interview topic database in the whole network

边栏推荐

猜你喜欢

随机推荐

Continuous sharing is useful 、 valuable 、 Selected high-quality big data interview questions
We are committed to building the most comprehensive big data interview topic database in the whole network

Continuous sharing is useful 、 valuable 、 Selected high-quality big data interview questions
We are committed to building the most comprehensive big data interview topic database in the whole network