当前位置:网站首页>如何通过 dba_hist_active_sess_history 分析数据库历史性能问题
如何通过 dba_hist_active_sess_history 分析数据库历史性能问题
2022-06-21 16:56:00 【用户7185440】
如何通过 dba_hist_active_sess_history 分析数据库历史性能问题
背景
在很多情况下,当数据库发生性能问题的时候,我们并没有机会来收集足够的诊断信息,比如system state dump或者hang analyze,甚至问题发生的时候DBA根本不在场。这给我们诊断问题带来很大的困难。那么在这种情况下,我们是否能在事后收集一些信息来分析问题的原因呢?
在Oracle 10G或者更高版本上,答案是肯定的。本文我们将介绍一种通过dba_hist_active_sess_history的数据来分析问题的一种方法。适用于
Oracle 10G或更高版本的单机或rac,本文适用于任何平台。详情
Active Session: on the CPU or was waiting for an event that didn't belong to the Idle wait class
在Oracle 10G中,我们引入了AWR和ASH采样机制,有一个视图gv$active_session_history会每秒钟将数据库所有节点的Active Session采样一次,
而dba_hist_active_sess_history则会将gv$active_session_history里的数据每10秒采样一次并持久化保存。
基于这个特征,我们可以通过分析dba_hist_active_sess_history的Session采样情况,来定位问题发生的准确时间范围,并且可以观察每个采样点的top event和top holder。下面通过一个例子来详细说明。dba_hist_active_sess_history版
1. Dump出问题期间的ASH数据:
为了不影响生产系统,我们可以将问题大概期间的ASH数据export出来在测试机上分析。
基于dba_hist_active_sess_history创建一个新表t_ash,然后将其通过exp/imp导入到测试机。在发生问题的数据库上执行exp:
SQL> conn user/passwd
SQL> create table t_ash as select * from dba_hist_active_sess_history where SAMPLE_TIME between TO_TIMESTAMP ('<time_begin>', 'YYYY-MM-DD HH24:MI:SS') and TO_TIMESTAMP ('<time_end>', 'YYYY-MM-DD HH24:MI:SS');
$ exp user/passwd file=t_ash.dmp tables=(t_ash) log=t_ash.exp.log
然后导入到测试机:
$ imp user/passwd file=t_ash.dmp log=t_ash.imp.log2. 验证导出的ASH时间范围:
建议采用Oracle SQL Developer来查询以防止输出结果折行不便于观察。
set line 200 pages 1000
col sample_time for a25
col event for a40
alter session set nls_timestamp_format='yyyy-mm-dd hh24:mi:ss.ff';
select
t.dbid, t.instance_number, min(sample_time), max(sample_time), count(*) session_count
from t_ash t
group by t.dbid, t.instance_number
order by dbid, instance_number;
INSTANCE_NUMBER MIN(SAMPLE_TIME) MAX(SAMPLE_TIME) SESSION_COUNT
1 2015-03-26 21:00:04.278 2015-03-26 22:59:48.387 2171
2 2015-03-26 21:02:12.047 2015-03-26 22:59:42.584 36
从以上输出可知该数据库共2个节点,采样时间共2小时,节点1的采样比节点2要多很多,问题可能发生在节点1上。3. 确认问题发生的精确时间范围:
参考如下脚本:
col event for a35
set lines 222
set pages 9999
select
dbid, instance_number, sample_id, sample_time, count(*) session_count
from t_ash t
group by dbid, instance_number, sample_id, sample_time
order by dbid, instance_number, sample_time;
INSTANCE_NUMBER SAMPLE_ID SAMPLE_TIME SESSION_COUNT
1 36402900 2015-03-26 22:02:50.985 4
1 36402910 2015-03-26 22:03:01.095 1
1 36402920 2015-03-26 22:03:11.195 1
1 36402930 2015-03-26 22:03:21.966 21
1 36402940 2015-03-26 22:03:32.116 102
1 36402950 2015-03-26 22:03:42.226 181
1 36402960 2015-03-26 22:03:52.326 200
1 36402970 2015-03-26 22:04:02.446 227
1 36402980 2015-03-26 22:04:12.566 242
1 36402990 2015-03-26 22:04:22.666 259
1 36403000 2015-03-26 22:04:32.846 289
1 36403010 2015-03-26 22:04:42.966 147
1 36403020 2015-03-26 22:04:53.076 2
1 36403030 2015-03-26 22:05:03.186 4
1 36403040 2015-03-26 22:05:13.296 1
1 36403050 2015-03-26 22:05:23.398 1
注意观察以上输出的每个采样点的active session的数量,数量突然变多往往意味着问题发生了。从以上输出可以确定问题发生的精确时间在2015-03-26 22:03:21 ~ 22:04:42,问题持续了大约1.5分钟。
注意: 观察以上的输出有无断档,比如某些时间没有采样。4. 确定每个采样点的top n event:
在这里我们指定的是top 2 event 并且注掉了采样时间以观察所有采样点的情况。如果数据量较多,您也可以通过开启sample_time的注释来观察某个时间段的情况。注意最后一列session_count指的是该采样点上的等待该event的session数量。
col event for a35
set lines 1000
set pages 9999
select t.dbid,
t.sample_id,
t.sample_time,
t.instance_number,
t.event,
t.session_state,
t.c session_count
from (select t.*,
rank() over(partition by dbid, instance_number, sample_time order by c desc) r
from (select /*+ parallel 2 */
t.*,
count(*) over(partition by dbid, instance_number, sample_time, event) c,
row_number() over(partition by dbid, instance_number, sample_time, event order by 1) r1
from t_ash t
/*where sample_time >
to_timestamp('2013-11-17 13:59:00',
'yyyy-mm-dd hh24:mi:ss')
and sample_time <
to_timestamp('2013-11-17 14:10:00',
'yyyy-mm-dd hh24:mi:ss')*/
) t
where r1 = 1) t
where r < 3
order by dbid, instance_number, sample_time, r;
SAMPLE_ID SAMPLE_TIME INSTANCE_NUMBER EVENT SESSION_STATE SESSION_COUNT
36402900 22:02:50.985 1 ON CPU 3
36402900 22:02:50.985 1 db file sequential read WAITING 1
36402910 22:03:01.095 1 ON CPU 1
36402920 22:03:11.195 1 db file parallel read WAITING 1
36402930 22:03:21.966 1 cursor: pin S wait on X WAITING 11
36402930 22:03:21.966 1 latch: shared pool WAITING 4
36402940 22:03:32.116 1 cursor: pin S wait on X WAITING 83
36402940 22:03:32.116 1 SGA: allocation forcing component growth WAITING 16
36402950 22:03:42.226 1 cursor: pin S wait on X WAITING 161
36402950 22:03:42.226 1 SGA: allocation forcing component growth WAITING 17
36402960 22:03:52.326 1 cursor: pin S wait on X WAITING 177
36402960 22:03:52.326 1 SGA: allocation forcing component growth WAITING 20
36402970 22:04:02.446 1 cursor: pin S wait on X WAITING 204
36402970 22:04:02.446 1 SGA: allocation forcing component growth WAITING 20
36402980 22:04:12.566 1 cursor: pin S wait on X WAITING 219
36402980 22:04:12.566 1 SGA: allocation forcing component growth WAITING 20
36402990 22:04:22.666 1 cursor: pin S wait on X WAITING 236
36402990 22:04:22.666 1 SGA: allocation forcing component growth WAITING 20
36403000 22:04:32.846 1 cursor: pin S wait on X WAITING 265
36403000 22:04:32.846 1 SGA: allocation forcing component growth WAITING 20
36403010 22:04:42.966 1 enq: US - contention WAITING 69
36403010 22:04:42.966 1 latch: row cache objects WAITING 56
36403020 22:04:53.076 1 db file scattered read WAITING 1
36403020 22:04:53.076 1 db file sequential read WAITING 1
从以上输出我们可以发现问题期间最严重的等待为cursor: pin S wait on X,高峰期等待该event的session数达到了265个,其次为SGA: allocation forcing component growth,高峰期session为20个。
***注意:
1) 再次确认以上输出有无断档,是否有某些时间没有采样。
2) 注意那些session_state为ON CPU的输出,比较ON CPU的进程个数与您的OS物理CPU的个数,如果接近或者超过物理CPU个数,那么您还需要检查OS当时的CPU资源状况,
比如OSWatcher/NMON等工具,高的CPU Run Queue可能引发该问题,当然也可能是问题的结果,需要结合OSWatcher和ASH的时间顺序来验证。5. 观察每个采样点的等待链: (更详细显示参考:附1)
其原理为通过dba_hist_active_sess_history. blocking_session记录的holder来通过connect by级联查询,找出最终的holder. 在RAC环境中,每个节点的ASH采样的时间很多情况下并不是一致的,
因此您可以通过将本SQL的第二段注释的sample_time稍作修改让不同节点相差1秒的采样时间可以比较(注意最好也将partition by中的sample_time做相应修改)。
该输出中isleaf=1的都是最终holder,而iscycle=1的代表死锁了(也就是在同一个采样点中a等待b,b等待c,而c又等待a,这种情况如果持续发生,那么尤其值得关注)。采用如下查询能观察到阻塞链。
col event for a35
set lines 1000
set pages 9999
select /*+ parallel 8 */
level lv,
connect_by_isleaf isleaf,
connect_by_iscycle iscycle,
t.dbid,
t.sample_id,
t.sample_time,
t.instance_number,
t.session_id,
t.sql_id,
t.session_type,
t.event,
t.session_state,
t.blocking_inst_id,
t.blocking_session,
t.blocking_session_status
from t_ash t
/*where sample_time >
to_timestamp('2013-11-17 13:55:00',
'yyyy-mm-dd hh24:mi:ss')
and sample_time <
to_timestamp('2013-11-17 14:10:00',
'yyyy-mm-dd hh24:mi:ss')*/
start with blocking_session is not null
/*start with event = 'buffer busy waits'*/
connect by nocycle
prior dbid = dbid
and prior sample_time = sample_time
/*and ((prior sample_time) - sample_time between interval '-1'
second and interval '1' second)*/
and prior blocking_inst_id = instance_number
and prior blocking_session = session_id
and prior blocking_session_serial# = session_serial#
order siblings by dbid, sample_time;
LV ISLEAF ISCYCLE SAMPLE_TIME INSTANCE_NUMBER SESSION_ID SQL_ID EVENT SESSION_STATE BLOCKING_INST_ID BLOCKING_SESSION BLOCKING_SESSION_STATUS
1 0 0 22:04:32.846 1 1259 3ajt2htrmb83y cursor: WAITING 1 537 VALID
2 1 0 22:04:32.846 1 537 3ajt2htrmb83y SGA: WAITING UNKNOWN
***注意:
为了输出便于阅读,我们将等待event做了简写,下同。从上面的输出可见,在相同的采样点上(22:04:32.846),节点1 session 1259 在等待cursor: pin S wait on X,其被节点1 session 537 阻塞,
而节点1 session 537 又在等待SGA: allocation forcing component growth,并且ASH没有采集到其holder,因此这里cursor: pin S wait on X只是一个表面现象,
问题的原因在于SGA: allocation forcing component growth6. 基于第5步的原理来找出每个采样点的最终top holder:
比如如下SQL列出了每个采样点top 2的blocker session,并且计算了其最终阻塞的session数(参考blocking_session_count)
col event for a35
set lines 1000
set pages 9999
select t.lv,
t.iscycle,
t.dbid,
t.sample_id,
t.sample_time,
t.instance_number,
t.session_id,
t.sql_id,
t.session_type,
t.event,
t.seq#,
t.session_state,
t.blocking_inst_id,
t.blocking_session,
t.blocking_session_status,
t.c blocking_session_count
from (select t.*,
row_number() over(partition by dbid, instance_number, sample_time order by c desc) r
from (select t.*,
count(*) over(partition by dbid, instance_number, sample_time, session_id) c,
row_number() over(partition by dbid, instance_number, sample_time, session_id order by 1) r1
from (select
level lv,
connect_by_isleaf isleaf,
connect_by_iscycle iscycle,
t.*
from t_ash t
/*where sample_time >
to_timestamp('2013-11-17 13:55:00',
'yyyy-mm-dd hh24:mi:ss')
and sample_time <
to_timestamp('2013-11-17 14:10:00',
'yyyy-mm-dd hh24:mi:ss')*/
start with blocking_session is not null
/*start with event = 'buffer busy waits'*/
connect by nocycle
prior dbid = dbid
and prior sample_time = sample_time
/*and ((prior sample_time) - sample_time between interval '-1'
second and interval '1' second)*/
and prior blocking_inst_id = instance_number
and prior blocking_session = session_id
and prior
blocking_session_serial# = session_serial#) t
where t.isleaf = 1) t
where r1 = 1) t
where r < 3
order by dbid, sample_time, r;
SAMPLE_TIME INSTANCE_NUMBER SESSION_ID SQL_ID EVENT SEQ# SESSION_STATE BLOCKING_SESSION_STATUS BLOCKING_SESSION_COUNT
22:03:32.116 1 1136 1p4vyw2jan43d SGA: 1140 WAITING UNKNOWN 82
22:03:32.116 1 413 9g51p4bt1n7kz SGA: 7646 WAITING UNKNOWN 2
22:03:42.226 1 1136 1p4vyw2jan43d SGA: 1645 WAITING UNKNOWN 154
22:03:42.226 1 537 3ajt2htrmb83y SGA: 48412 WAITING UNKNOWN 4
22:03:52.326 1 1136 1p4vyw2jan43d SGA: 2150 WAITING UNKNOWN 165
22:03:52.326 1 537 3ajt2htrmb83y SGA: 48917 WAITING UNKNOWN 8
22:04:02.446 1 1136 1p4vyw2jan43d SGA: 2656 WAITING UNKNOWN 184
22:04:02.446 1 537 3ajt2htrmb83y SGA: 49423 WAITING UNKNOWN 10
22:04:12.566 1 1136 1p4vyw2jan43d SGA: 3162 WAITING UNKNOWN 187
22:04:12.566 1 2472 SGA: 1421 WAITING UNKNOWN 15
22:04:22.666 1 1136 1p4vyw2jan43d SGA: 3667 WAITING UNKNOWN 193
22:04:22.666 1 2472 SGA: 1926 WAITING UNKNOWN 25
22:04:32.846 1 1136 1p4vyw2jan43d SGA: 4176 WAITING UNKNOWN 196
22:04:32.846 1 2472 SGA: 2434 WAITING UNKNOWN 48
从以上输出,比如第一行,代表在22:03:32.116,节点1的session 1136 最终阻塞了82个session. 顺着时间往下看,可见节点1的session 1136 是问题期间最严重的holder,
它在每个采样点都阻塞了100多个session,并且它持续等待SGA: allocation forcing component growth,
注意观察其seq#您会发现该event的seq#在不断变化,表明该session并未完全hang住,由于时间正好发生在夜间22:00左右,这显然是由于自动收集统计信息job导致shared memory resize造成,
因此可以结合Scheduler/MMAN的trace以及dba_hist_memory_resize_ops的输出进一步确定问题。
**注意:
1) blocking_session_count 指某一个holder最终阻塞的session数,比如 a <- b<- c (a被b阻塞,b又被c阻塞),只计算c阻塞了1个session,因为中间的b可能在不同的阻塞链中发生重复。
2) 如果最终的holder没有被ash采样(一般因为该holder处于空闲),比如 a<- c 并且b<- c (a被c阻塞,并且b也被c阻塞),但是c没有采样,那么以上脚本无法将c统计到最终holder里,这可能会导致一些遗漏。
3) 注意比较blocking_session_count的数量与第3步查询的每个采样点的总session_count数,如果每个采样点的blocking_session_count数远小于总session_count数,那表明大部分session并未记载holder,因此本查询的结果并不能代表什么。
4) 在Oracle 10g中,ASH并没有blocking_inst_id列,在以上所有的脚本中,您只需要去掉该列即可,因此10g的ASH一般只能用于诊断单节点的问题。其他关于ASH的应用
除了通过ASH数据来找holder以外,我们还能用它来获取很多信息(基于数据库版本有所不同):
比如通过PGA_ALLOCATED列来计算每个采样点的最大PGA,合计PGA以分析ora-4030/Memory Swap相关问题;
通过TEMP_SPACE_ALLOCATED列来分析临时表空间使用情况;
通过IN_PARSE/IN_HARD_PARSE/IN_SQL_EXECUTION列来分析SQL处于parse还是执行阶段;
通过CURRENT_OBJ#/CURRENT_FILE#/CURRENT_BLOCK#来确定I/O相关等待发生的对象.
附1:
col event for a35
set lines 1000
set pages 9999
select /*+ parallel 8 */
level lv,
connect_by_isleaf isleaf,
connect_by_iscycle iscycle,
t.dbid,
t.sample_id,
t.sample_time,
t.instance_number,
t.session_id,
t.session_serial#,
t.session_type,
t.session_state,
t.SQL_OPNAME,
t.sql_id,
T.SQL_PLAN_HASH_VALUE,
t.SQL_PLAN_LINE_ID,
T.SQL_CHILD_NUMBER,
T.SQL_EXEC_ID,
T.SQL_EXEC_START,
T.IN_BIND,
t.IN_PARSE,
t.IN_HARD_PARSE,
t.IN_SQL_EXECUTION,
t.IN_PLSQL_COMPILATION,
t.IN_PLSQL_EXECUTION,
t.IN_SEQUENCE_LOAD,
T.IN_PLSQL_RPC,
t.PROGRAM,
t.MACHINE,
t.PGA_ALLOCATED,
t.TEMP_SPACE_ALLOCATED,
t.DELTA_READ_IO_BYTES,
t.DELTA_WRITE_IO_BYTES,
t.TM_DELTA_CPU_TIME,
t.event,
t.blocking_inst_id,
t.blocking_session,
t.blocking_session_serial#,
t.blocking_session_status,
t.WAIT_TIME,
t.TIME_WAITED,
t.CURRENT_OBJ#,
t.CURRENT_ROW#,
t.xid
from m_ash_202204162 t
start with event = 'buffer busy waits'
--start with blocking_session is not null
connect by nocycle prior dbid = dbid and prior blocking_session = session_id and prior blocking_session_serial# = session_serial#
and prior sample_time = sample_time and prior blocking_inst_id = instance_number order siblings by dbid, sample_time;附2: 查看某个具体session的采样等待信息
col event for a35
set lines 1000
set pages 9999
select /*+ parallel 4 */
t.dbid,
t.sample_id,
t.sample_time,
t.instance_number,
t.session_id,
t.session_serial#,
t.session_type,
t.session_state,
t.SQL_OPNAME,
t.sql_id,
T.SQL_PLAN_HASH_VALUE,
t.SQL_PLAN_LINE_ID,
T.SQL_CHILD_NUMBER,
T.SQL_EXEC_ID,
T.SQL_EXEC_START,
T.IN_BIND,
t.IN_PARSE,
t.IN_HARD_PARSE,
t.IN_SQL_EXECUTION,
t.IN_PLSQL_COMPILATION,
t.IN_PLSQL_EXECUTION,
t.IN_SEQUENCE_LOAD,
T.IN_PLSQL_RPC,
t.PROGRAM,
t.MACHINE,
t.PGA_ALLOCATED,
t.TEMP_SPACE_ALLOCATED,
t.DELTA_READ_IO_BYTES,
t.DELTA_WRITE_IO_BYTES,
t.TM_DELTA_CPU_TIME,
t.event,
t.blocking_inst_id,
t.blocking_session,
t.blocking_session_serial#,
t.blocking_session_status,
t.WAIT_TIME,
t.TIME_WAITED,
t.CURRENT_OBJ#,
t.CURRENT_ROW#,
t.xid
from m_ash_202204162 t
where session_id = 10501
order by dbid, instance_number,sample_time;附3:查看Sql文本
select * from dba_hist_sqltext where sql_id='bxgrjr8wa7g7x';附4:查看sql执行详细信息参考文档:
分析sql历史执行信息.md
边栏推荐
- 服务端socket程序
- RMB 18billion, a super master fund was born in Suzhou
- 有哪些好用的工作汇报工具
- Lei Jun's hundreds of billions of mistakes?
- 案例驱动 :从入门到掌握Shell编程详细指南
- 论文解读(USIB)《Towards Explanation for Unsupervised Graph-Level Representation Learning》
- RT thread persimmon pie M7 Quanzhi f133 DDR running xboot
- 剑指 Offer 37. 序列化二叉树
- AttributeError: module ‘cv2‘ has no attribute ‘gapi_ wip_ gst_ GStreamerPipeline‘
- Laravel实现软删除
猜你喜欢

Viewing technological changes through Huawei Corps (IV): interactive media (Music)

【微服务|Nacos】快速实现nacos的配置中心功能,并完成配置更新和版本迭代
![[pwn基础]Pwntools学习](/img/67/6fb8a9628d5b750b0396f083aaeb91.png)
[pwn基础]Pwntools学习

Encryption crash, is there a future for Web3 games? In depth discussion of 5 papers

Server socket program
![LeetCode 1108 IP地址无效化[暴力] HERODING的LeetCode之路](/img/c6/d3eb6cee92b1c0848bf3d3b58b8b22.png)
LeetCode 1108 IP地址无效化[暴力] HERODING的LeetCode之路

RMB 18billion, a super master fund was born in Suzhou

System V IPC与POSIX IPC

企业高管收入杂谈

Redis6.0 new features (Part 1)
随机推荐
Node输出方式
Vue.js+Node.js全栈开发教程:连接MySQL
编写第一个C#应用程序——Hello, C#
[technical management] assembly number and sword team
快速失败和安全失败的区别
POSIX共享内存
PHP的empty,isset和is_null区别
如何使用 DATAX 以 UPSERT 语义更新下游 ORACLE 数据库中的数据
数字经济王宁教你如何正确地选择短期投资
基于AM4377的EtherCAT主站控制stm32从站
EtherCAT igh master station controls Esther servo to return to zero
Development of digital collection system and construction of NFT artwork trading platform
CentOS使用composer install 报错 - phpunitphpunit 8
力扣461.汉明距离
PHP连接Mysql8.0报错:Illuminate\Database\QueryException
Differences between WCDMA and LTE
Byte traffic business experience: realize as soon as possible, sew money bags, and sell all in goods
Excess rlsp
主动学习(Active Learning) 概述、策略和不确定性度量
180亿,苏州诞生一个超级母基金