当前位置:网站首页>[collect first and use it sooner or later] 49 Flink high-frequency interview questions series (I)

[collect first and use it sooner or later] 49 Flink high-frequency interview questions series (I)

2022-06-11 17:51:00 Big data Institute

【 Collect first , Sooner or later 】49 individual Flink High frequency interview question series ( One )

Continuous sharing is useful 、 valuable 、 Selected high-quality big data interview questions
We are committed to building the most comprehensive big data interview topic database in the whole network

One . You're developing Flink When the task , Have you ever met Back pressure problem , How did you check ?

1. Causes of back pressure

        Back pressure often appears in the big promotion or some popular activities , In the above scenario , The sudden increase of traffic in a short time leads to the accumulation of data , The overall throughput of the system cannot be improved .

2. Method of monitoring back pressure

        Can pass Flink Web UI Back pressure problem found .

       Flink Of TaskManager Every time 50 ms Trigger a back pressure state monitoring , Total monitoring 100 Time , And feed the results back to JobManager, Finally by JobManager Calculate the proportion of back pressure , Then show it .

The calculation logic of this ratio is as follows :

Degree of back pressure Index range remarks
HIGH0.5 < Ratio <= 1 serious
LOW0.10 < Ratio <= 0.5 commonly
OK0 <= Ratio <= 0.10 normal

3. Positioning and treatment of back pressure problem

       Flink In case of back pressure, the problem can be located from the following three aspects

a. Data skew

        Can be in Flink The background management page of each Task The size of the processing data . When data skews , It is obvious from the page that the amount of data processed by one or more nodes is much larger than that of other nodes . This kind of situation is generally in use KeyBy Equal component aggregation operator , Did not consider possible hot spots Key. This situation requires the user to be aware of the hot spots that cause the tilt Key Preprocessing .

b. Garbage collection mechanism (GC)

        Unreasonable settings TaskManager The garbage collection parameter of can cause serious GC problem , Can pass -XX:+PrintGCDetails Command view GC Log .

c. The code itself

        The user mistakenly used because he did not understand the implementation mechanism of the operator Flink operator , Cause performance problems . We can view the running machine node by CPU And memory location

Two . How to deal with... In a production environment Data skew problem ?

1. The main reasons for data skew are 2 In terms of

        There are serious data hotspots in the business , For example, in the browsing data of a real estate website, the number of views of second-hand houses in several first tier cities such as Beijing and Shanghai far exceeds that in other regions . Technically, if a lot of KeyBy、GroupBy Wait for the operation , And there is no grouping Key Do something special , Data hot spot problems will arise .

2. Solutions to problems

        Try to avoid hot spots in business key The design of the , For example, we can put Shanghai 、 Hot cities such as Beijing and non hot cities are divided into different regions , And treated separately ; When there are hot spots in technology , We should adjust the plan and break up the original key, Avoid direct aggregation ; Besides It can also be used Flink To avoid data skew .

3. Flink Task data skew scenarios and solutions

(1) Two stage aggregation to solve KeyBy hotspot

a. Will need to be grouped key Break up , For example, add random suffixes

b. Aggregate the scattered data

c. To be broken up key Revert to the original key

d. secondary KeyBy To count the final results and output them to the downstream

The specific code is as follows :

DataStream sourceStream = ...;
resultStream = sourceStream
     .map(record -> {
        Record record = JSON.parseObject(record, Record.class);
        String type = record.getType();
        record.setType(type + "_" + new Random().nextInt(100));
        return record;
      })
      //  First aggregation 
      .keyBy(0)
      .window(TumblingEventTimeWindows.of(Time.minutes(5)))
      .aggregate(new CountAggregate())
      .map(count -> {
        String key = count.getKey.substring(0, count.getKey.indexOf("_"));
        return RecordCount(key,count.getCount);
      })
      // Carry out secondary polymerization 
      .keyBy(0)
      .process(new CountProcessFunction);
resultStream.sink(...) 
env.execute(...)

(2)GroupBy + Aggregation Group aggregation hot issues

        If you use FlinkSQL The way , Then you can put FlinkSQL Nested in two layers , The inner layer is scattered randomly Several copies ( Such as 100) To reduce data hotspots ,( This fragmentation method can be flexibly specified according to the business ).

select date,
       city_id,
       sum(pv) as pv
from(
  select
        date,
        city_id,
        floor(rand() * 100),
        sum(count) as pv
  from kafka_table
        group by
        date,
        city_id,
        floor(rand() * 100) -- Break up randomly into 100 Share   
    )
    group by 
    date,
    city_id;

(3)Flink consumption Kafka Using parallelism and Kafka Data skew caused by inconsistent partition numbers

       Flink consumption Kafka When the data is , Yes, the recommended consumption parallelism is Kafka Number of partitions 1 Times or integral multiples , namely Flink Consumer Parallelism of = Kafka The number of partitions * n (n = 1, 2 ,3 ...).

3、 ... and 、 One Flink There can be an existing event time window in the task , Is there another processing time window ?

        Conclusion : One Flink Tasks can have event time windows at the same time , There is also a processing time window .

        So some friends asked , Why do we often Flink The task is either set to event time semantics , Or set it to process time semantics ?

        exactly , In the production environment , our Flink Tasks generally do not have windows with two temporal semantics at the same time .

        So how to explain the conclusion mentioned at the beginning ?

        Here it is explained from two perspectives :

  1. We actually There is no need to put one Flink The task is bound to a specific temporal semantics . For the event time window , We just give it watermark, Can let watermark Keep moving forward , Let the event time window continuously trigger the calculation . Easier for processing time , As long as the window operator is triggered at a fixed time interval according to the local time . Whatever the time window , It mainly meets the trigger conditions of the time window .

  2. Flink It is also supported in the implementation of .Flink Is to use a called TimerService Components to manage timer Of , We can register both event time and processing time timer,Flink I will judge for myself timer Whether the trigger conditions are met , If it is , Then call back the window processing function for calculation . demand : data source : User heartbeat log (uid,time,type). Calculate the score Android,iOS Of DAU, At the latest, output the result of accumulating the zero point of the current day to the current one minute .

  3. Realization way 1:cumulate window

SELECT  
    window_start
    , window_end
    , platform
    , sum(bucket_dau) as dau
from (
    SELECT
        window_start
        , window_end
        , platform
        , count(distinct uid) as bucket_dau
    FROM TABLE(
        CUMULATE(
        TABLE user_log,
        DESCRIPTOR(time),
        INTERVAL '60' SECOND
        , INTERVAL '1' DAY))
    GROUP BY                                  
        window_start
        , window_end
        , platform
        , MOD(HASH_CODE(user_id), 1024)
) tmp
GROUP by   
    window_start
    , window_end
    , platform
  • advantage : If it is the demand of the graph , It can perfectly trace back the curve .

  • shortcoming : If there is data disorder between large windows , There is a risk of losing numbers ; And because it's watermark Drive output , So there will be a delay in data output .

  1. Realization way 2:Deduplicate

--  If necessary, you can open  minibatch
select 
    platform
    , count(1) as dau
    , max(time) as time
from (
    select 
        uid
        , platform
        , time
        , row_number() over (partition by uid, platform, time / 24 / 3600 / 1000 order by time desc) rn
    from source
) tmp
where rn = 1
group by
    platform

advantage : Fast calculation . 

shortcoming : The task happened failover, Curves don't go back very well . There is no way to support cube Calculation .

  1. Realization way 3:group agg

--  If necessary, you can open  minibatch
SELECT   
    max(time) as time
    , platform
    , sum(bucket_dau) as dau
from (
    SELECT
        max(time) as time
        , platform
        , count(distinct uid) as bucket_dau
    FROM source
    GROUP BY
        platform
        , MOD(HASH_CODE(user_id), 1024)
) t 
GROUP by   
    platform

advantage : Fast calculation , Support cube Calculation .

shortcoming : The task happened failover, Curves don't go back very well .

Four 、Flink How to support batch flow integration ?

       Flink It supports both streaming and batch processing through an underlying engine .

        On top of the streaming engine ,Flink There are the following mechanisms :

       1. Checkpoint mechanism and state mechanism : For fault tolerance 、 A stateful process ;

       2. Watermark mechanism : Used to implement event clock ;

       3. Windows and triggers : Used to limit the calculation range , And define when the results will be presented .

        On top of the same streaming engine ,Flink There is another mechanism , For efficient batch processing .

       1. Backtracking for scheduling and recovery : from Microsoft Dryad introduce , Now it's used in almost all batch processors ;

       2. Special memory data structures for hashing and sorting : When it's needed , Overflow some data from memory to hard disk ;

       3. Optimizer : Minimize the time to generate results .

5、 ... and 、Flink High task delay , Want to solve this problem , How would you start ?

        stay Flink In the background task management , We can see Flink Which operator of and task There is back pressure . The main means are resource tuning and operator tuning .

      Resource tuning That is to say, it is necessary to deal with Operator The concurrent number of (parallelism)、CPU(core)、 Heap memory (heap_memory) Tuning with equal parameters .

        Job parameter tuning includes : Setting of parallelism ,State Set up ,checkpoint Set up .

Continuous sharing is useful 、 valuable 、 Selected high-quality big data interview questions
We are committed to building the most comprehensive big data interview topic database in the whole network

原网站

版权声明
本文为[Big data Institute]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/162/202206111740274146.html