当前位置：网站首页>Spark Tuning (I): from HQL to code

Spark Tuning (I): from HQL to code

2022-07-05 11:13:00 【InfoQ】

Hello everyone , I'm Huai Jinyu , A big data sprouts new , There are two gold eaters at home , Jia and Jia , Up can code Lower energy teach My all-around Daddy
If you like my article , Sure [ Focus on ]+[ give the thumbs-up ]+[ Comment on ], Your third company is my driving force , Looking forward to growing with you ~

1. cause

Daily big data processing , The common data output is the maximum and minimum value , Find a sum , Find an average of this , The common way to write , Write a hql, First divide into groups , Adding a max or sum Can

SELECT id,name,
 max(score1),
 sum(score2),
 avg(score3)
 FROM table
 GROUP BY id,name

Of course, if the conditions are more complicated , For example, add a if Judge , Namely sql A little longer , But it can still be written

Local testing , also OK, Test environment , Barely able , It feels good , It's just a matter of connecting resources

But put it on the production , Direct dumbfounded

snappy Compress , Raw data 500G
280 Billion data 
 First step Shuffle Write 800G
 The next task is estimated to need 8 Run in an hour

I use it 1.5T Of memory ,200 concurrent , Memory often overflows , Overtime ,GC It's been a long time

2. Optimization starts

sql There is a problem , First of all, the first reaction is to optimize the allocation and use of resources

--conf spark.storage.memoryFraction=0.7

There are many tasks with heartbeat timeout

--conf spark.executor.heartbeatInterval=240

Task serialization takes too long

--conf spark.locality.wait=60

Find out GC drawn-out , Optimize jvm Parameters

-XX:+UseG1GC

Find out spark There is task merging , Add reparation, Force task separation

dataset.repartition(20000)

After a series of optimization, it is found to be effective , But to little effect , The last step is to be executed in hours

A careful analysis of sql, Is it right? spark The bottom layer is for multiple max,min such , When the amount of data is large, you need to traverse the data many times

3. Problem solving

The final decision , Write in code , Do it again

Dataset<Row> ds = spark.sql(sql);
dsTag0200.javaRDD().mapPartitionsToPair(
 Transformation data 
 Group when key Make it tuple2
 Here I cache some differences that need to be aggregated later 
).reduceByKey(
 Judge the maximum and minimum 
sum The aggregation operation of uses difference to aggregate directly 
 You can directly output the final result once 
)

Excited heart, trembling hand , Task execution , Leave after work , dying is as natural as living , Execution results , See you tomorrow, Dabao

Of course, the end result 1 Half an hour , The efficiency is acceptable and the memory is controlled , Can be increased higher executor And improve the parallelism through reasonable resources

4 summary

about hql Some relatively complex operations , Especially for raw data , We must consider the amount of data , The amount of data is large to a certain extent , It's not that resources can pass , And the optimization space will become less

Conclusion

If you like my article , Sure [ Focus on ]+[ give the thumbs-up ]+[ Comment on ], Your third company is my driving force , Looking forward to growing with you ~
We can pay attention to the official account 【 Huai Jin holds Yu's Jia and Jia 】, Get the resource download method

原网站

版权声明
本文为[InfoQ]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/186/202207051057462499.html