当前位置:网站首页>Spark SQL task performance optimization (basic)
Spark SQL task performance optimization (basic)
2022-07-02 07:17:00 【Software development heart】
The significance of optimizing tasks
- For the project , It can save machine computing resources , Resources are time and money
- The execution time may be significantly shortened , For long chain task dependency, reduce waiting time , Especially for upstream tasks , Thus, the data stability increases
- Perform high-frequency data tasks to ensure their timeliness
spark UI
On the navigation bar above :
- 1 representative job page , You can see all the tasks analyzed by the current application , And all excutors in action Execution time of .
- 2 representative stage page , You can see all the applications in it stage,stage It is distinguished by wide dependence , Therefore, it is more granular than job Finer
- 3 representative storage page , What we did cache persist Wait for the operation , Will see here , You can see how much cache the application currently uses
- 4 representative environment page , It shows the current spark The environment it depends on , such as jdk,lib wait
- 5 representative executors page , Here you can see the memory used by the executor and shuffle in input and output Data such as
- 6 This is the name of the application , If... Is used in the code setAppName, It will be shown here
- 7 yes job Main page .
job page
Here is task execution Job Information about , An operator will start a job, such as count perhaps insert into etc.
** stay spark in rdd The calculation of is divided into two categories , One is transform Conversion operation , One is action operation ,** Only action Operation will trigger the real rdd Calculation . What are the specific action Can trigger calculation ,
It also includes the submission time , execution time , Every job obtain stage Number
as well as The success of stage
Number , Every job be-all task Number
And successful task Number
Find the longest execution job
- job Page duration It supports sorting , By sorting, we can know , That job is the longest , Convenient for follow-up from stage and task For further analysis
- Be careful job Spacing between , Locate whether there is an unreasonable number of files or a cluster rpc Is there a problem
- If you observe spark job There is no particularly slow homework , Then you need to pay attention to the interval between assignments . Interval refers to the previous job Completion time and next job The time difference between the start time of
This is also the cause of job performance bottlenecks . At this time, you can go through driver Log Make further positioning .( Too many files , When the cluster pressure is slightly high, file writing will be time-consuming .)
- If you observe spark job There is no particularly slow homework , Then you need to pay attention to the interval between assignments . Interval refers to the previous job Completion time and next job The time difference between the start time of
stage list page
stay Spark in job It's based on action Operation , Another level of the task is stage, It is distinguished according to width dependence .
Exhibition RDD The dependency graph of , adopt sql You can find the corresponding RDD Logic , Mainly refer to Exchange( produce shuffle),ps:gby、join Statements will produce exchange.
Every two times shuffle There is a stage, Such as join,group by
You can see it here spark The heaviest task is converted to rdd Of stage Information , Completed stage, Ongoing stage, Skipped stage
You can look at each stage Of task Number
and ` The success of task Number ``
from hadoop perhaps spark storage The size of the data readOutput
write in hadoop The data size ofshuffle read
Every stage The size of the data readshuffle write
Every stage The size of the data written to the disk is for a future stage Read .
generally speaking task Number (partition) be equal to spark.sql.shuffle.partitions
Number , When spark from hive,kudu When reading data ,task Quantity and sum Number of partitions in the data table
bring into correspondence with .
task list page
Focus on data distribution
[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-6l7UTBGL-1637583901499)(D:\Users\lawrence.w\AppData\Roaming\Typora\typora-user-images\image-20211105142221669.png)]
The picture above shows stage All of the task Data distribution of , It can effectively help us find data skew ( Large time , A lot of data )
excutors page
This page is commonly used , On the one hand, it can be seen that every excutor Whether data skew has occurred , On the other hand, we can analyze whether the current application has produced a large number of shuffle, Whether it is possible to reduce... By data locality or by reducing data transmission shuffle The amount of data .
see SQL Implementation plan of
scan: from hive,kudu Reading data
Filter: Filtering operation , Including removing missing values , Condition screening, etc
Project: mapping , Select the columns you want
HashAggregate: polymerization
wholestagecodegen: Full stage code generation , Used to integrate multiple processing logic into a single code module , yes Spark The new SQL Code generation model
Exchange: Data repartition
Sort: The data is based on a key Sort
SortMergeJoin: Big watch and big watch join The strategy of
shuffle Stage : Base two large tables on join key Re zoning , Corresponding to the... In the execution plan Exchange
sort Stage : Two tables of data for a single partition node , Sort them separately , Corresponding to the... In the execution plan Sort
merge Stage : Execute... On the data of the two sorted partition tables join operation .

Performance tuning is also in line with the 28 law , common 20% The optimization method can basically meet 80% The efficiency of the task needs to be optimized . Running performance, development efficiency and readability logic also need to be considered in daily work to achieve a balance .
Slow meter reading I/0
File format problem : Non cutting storage format Change it to orc ( Online default storage format )
parquet and orc This kind of storage format can read and write by column , Most of the time , We don't really need to look up all the fields , Columnar storage can reduce the amount of data read each time , On the other hand, columnar storage also optimizes the file push down operation to reduce reading , It can be filtered according to the range of files read .
The scanning volume is too large — The single partition is the full scale, and the partition is not selected in advance
Shown here Mistakenly treat partitions as business filtering
Avoid reading the same big table multiple times , Basically, resources are used to read large tables IO On
Solution : Read once case when Condition judgment is rewritten ,where ( Conditions 1=A or Conditions 2 = B)
cacheTable(’’) Is it feasible to read more times ? if cache The big watch itself takes a lot of time , We also have to consider whether the memory parameters are sufficient cache.
Another common case is based on The same big table is based on Different group by Field Proceed again union all—— Merge data indicators from different dimensions
Solution : You can use window functions groupping_set() Avoid reading more than once , Reuse groupby Operation of the shuffle data , At the same time, it also reduces shuffle operation
As much as possible filter
Get to the beginning RDD after , Data that should not be filtered out as soon as possible , And then reduce the use of memory , Thus enhance Spark Operation efficiency of the job .
Common in A surface join B surface A Filter the table , and B Table no filter
Reduce landing operations
The frequent landing of memory computing will be more time-consuming , You can transfer the results by using the temporary view in the middle , Of course, this kind of scenario is not limited to the intermediate result with a lot of calculation .( Look at the execution plan tempory view It is logical reuse, not memory reuse , The reusable logic can be processed into a small table cache table, Remember if you don't use it later uncache)
cache table cache_t1 as select ....;
uncache table cache_t1;
The reuse of Computing
The reuse of computation is operated by executing policies ,Spark The bigger operation is actually shuffle In itself ,spark Antithetical bucket Storage can materialize the bucket information of a table , Use the same... When using tables bucket shuffle This time can be reused during operation shuffle operation , There is no longer a need to shuffle The action of , This can speed up join 、group by、 over() These operations are the operations with more production practice
Make good use of cache
stay Spark In the calculation of , It is not recommended to use directly cache, In case cache It's a lot of , Possible memory overflow . May adopt persist The way , The specified cache level is MEMORY_AND_DISK, So when there is not enough memory , You can cache data on disk . in addition , Design the code reasonably , Use broadcast and cache appropriately , Too much data broadcast will bring pressure to transmission , Too much cache is not released in time , It will also lead to memory occupation . Generally speaking , Your code needs to reuse one rdd When , You need to consider caching , And when not in use , In time unpersist Release .
Data skew
Data skew is a common cause Spark SQL The problem of poor performance . Data skew refers to a certain partition The amount of data is much larger than other partition The data of , As a result, the running time of individual tasks is much longer than that of other tasks , So it drags down the whole SQL Running time of .
- Filter useless data at the source
- Manual operation
- Scatter the tilt key, Polymerization at both ends ( For aggregation ) Or split the tilt key Break up and union
- For the special key Handle : Null values map to specific Key, Then distribute to different nodes , Do not handle null values .
- Broadcast smaller tables ( Default size ), turn up BroadcastHashJoin The threshold of , In some scenes, you can put SortMergeJoin Turn it into BroadcastHashJoin And avoid shuffle The resulting data is skewed
Disk overflow is serious (shuffle Stage )
shuffle In the process , The same on each node key Will be written to the local disk file first , Then other nodes need to pull the same disk file on each node through network transmission key. And the same key When you take the same node for aggregation operation , It may also be because of the key Too much , Cause insufficient memory , And then overflow to the disk file . So in shuffle In the process , There may be a lot of reading and writing of disk files IO operation , And data network transmission operation . disk IO And network data transmission shuffle The main reason for poor performance .
Try to avoid shuffle operation ( It seems to be talking nonsense )
Appropriately increase the memory parameters -executor-memory
Several parameters are added spark.sql.shuffle.partitions=200; stay SQL Add parameters set spark.sql.shuffle.partitions=200;
To be continued
- Go package name
- SSM学生成绩信息管理系统
- ORACLE 11G利用 ORDS+pljson来实现json_table 效果
- Oracle段顾问、怎么处理行链接行迁移、降低高水位
- How to call WebService in PHP development environment?
- SQLI-LABS通關(less6-less14)
- 【调参Tricks】WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach
- TCP attack
- Alpha Beta Pruning in Adversarial Search
- sqli-labs通關匯總-page2
Three principles of architecture design
Two table Association of pyspark in idea2020 (field names are the same)
叮咚,Redis OM对象映射框架来了
Sqli - Labs Clearance (less6 - less14)
【论文介绍】R-Drop: Regularized Dropout for Neural Networks
Sqli-labs customs clearance (less2-less5)
Sqli-labs customs clearance (less6-less14)
ORACLE EBS 和 APEX 集成登录及原理分析
Basic knowledge of software testing
Oracle apex 21.2 installation and one click deployment
MySQL无order by的排序规则因素
Use of interrupt()
CSRF attack
Practice and thinking of offline data warehouse and Bi development
【信息检索导论】第三章 容错式检索
Sqli-labs customs clearance (less1)
ORACLE APEX 21.2安裝及一鍵部署
2021-07-05c /cad secondary development create arc (4)