转载自我的CSDN Blog
异常现象
一次新的视频播放卡死异常:4路视频10秒切换拷机,出现测试app画面卡死的异常,可以切换到其它界面。
CPU负载
查看负载,发现异常app有线程跑满CPU,且CPU大多消耗在内核层
1 2 3 4 5 6 7 8 9 10 11 12
| User 2%, System 23%, IOW 0%, IRQ 0% User 30 + Nice 0 + Sys 283 + Idle 899 + IOW 0 + IRQ 0 + SIRQ 0 = 1212
PID TID PR CPU% S VSS RSS PCY UID Thread Proc 2768 29663 1 24% R 1398368K 182332K fg u0_a53 Thread-13095407 com.test33.demo 32103 32103 2 0% R 2364K 948K fg root top top 77 77 0 0% S 0K 0K fg root fb-vsync 193 222 0 0% S 171340K 15512K fg system mali-renderer /system/bin/surfaceflinger 190 190 0 0% S 2472K 792K fg root lmkd /system/bin/lmkd 8 8 0 0% S 0K 0K fg root rcu_preempt 9 9 0 0% S 0K 0K fg root rcu_bh 10 10 3 0% S 0K 0K fg root rcu_sched
|
debuggerd
使用debuggerd获得该进程的backtrace及寄存器信息:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
| ABI: 'arm' pid: 2768, tid: 29663, name: Thread-13095407 >>> com.test33.demo <<< signal 19 (SIGSTOP), code 0 (SI_USER), fault addr -------- r0 00000000 r1 00001000 r2 00000098 r3 9c29b000 r4 9c19f000 r5 9c29c564 r6 9c29c000 r7 b8b65468 r8 9c1a0000 r9 9c1a0000 sl 9c19f000 fp ffffffff ip b8b65468 sp 9c29c558 lr b4c8982b pc b4c89832 cpsr 80070030 ... backtrace: #00 pc 0034d832 /system/lib/libart.so (_ZN3art6Thread25InstallImplicitProtectionEv+233) #01 pc 0034da73 /system/lib/libart.so (_ZN3art6Thread12InitStackHwmEv+210) #02 pc 0034dca1 /system/lib/libart.so (_ZN3art6Thread4InitEPNS_10ThreadListEPNS_9JavaVMExtEPNS_9JNIEnvExtE+112) #03 pc 00354b65 /system/lib/libart.so (_ZN3art6Thread6AttachEPKcbP8_jobjectb+136) #04 pc 0032fc4f /system/lib/libart.so (_ZN3art7Runtime19AttachCurrentThreadEPKcbP8_jobjectb+14) #05 pc 002586cb /system/lib/libart.so (_ZN3art3JII19AttachCurrentThreadEP7_JavaVMPP7_JNIEnvPv+158) #06 pc 00089335 /data/app/com.test33.demo-1/lib/arm/libPlayCtrl.so (_ZN13CHKMediaCodec10OutputDataERl+64) ... memory map: 9c19f000-9c29cfff rw- 0 fe000 [stack:29663]
|
查看异常位置的源代码及其汇编
1 2 3
| for (uint8_t* p = stack_top; p >= pregion; p -= kPageSize) { dont_optimize_this = *p; }
|
1 2 3 4 5 6
| 34d826: f557 eb52 blx a4ecc <mprotect@plt> 34d82a: 42b4 cmp r4, r6 34d82c: bf98 it ls 34d82e: 4633 movls r3, r6 34d830: d807 bhi.n 34d842 <_ZN3art6Thread25InstallImplicitProtectionEv+0xfa> 34d832: 781a ldrb r2, [r3, #0]
|
当前PC=0x34d832,该指令从r3寄存器所指向地址中取数据保存到r2,r3指向线程栈,寄存器的数据看起来没什么问题。
PERF
我们再来看下此时内核在做什么,为什么会消耗这么多的CPU? 使用perf工具抓异常线程的信息如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
| # Overhead Command Shared Object Symbol # ........ ............... ................. .................................................................................................................... # 29.60% com.test33.demo [kernel.kallsyms] [k] do_page_fault | --- do_page_fault do_DataAbort __dabt_usr art::StackVisitor::GetVRegFromOptimizedCode(art::ArtMethod*, unsigned short, art::VRegKind, unsigned int*) const
26.83% com.test33.demo [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore | --- _raw_spin_unlock_irqrestore | |--52.88%-- do_page_fault | do_DataAbort | __dabt_usr | art::StackVisitor::GetVRegFromOptimizedCode(art::ArtMethod*, unsigned short, art::VRegKind, unsigned int*) const | |--46.95%-- __down_read_trylock | do_page_fault | do_DataAbort | __dabt_usr | art::StackVisitor::GetVRegFromOptimizedCode(art::ArtMethod*, unsigned short, art::VRegKind, unsigned int*) const --0.17%-- [...]
9.65% com.test33.demo libart.so [.] art::StackVisitor::GetVRegFromOptimizedCode(art::ArtMethod*, unsigned short, art::VRegKind, unsigned int*) const | --- art::StackVisitor::GetVRegFromOptimizedCode(art::ArtMethod*, unsigned short, art::VRegKind, unsigned int*) const
7.86% com.test33.demo [kernel.kallsyms] [k] _raw_spin_unlock | --- _raw_spin_unlock handle_pte_fault __handle_mm_fault handle_mm_fault do_page_fault do_DataAbort __dabt_usr art::StackVisitor::GetVRegFromOptimizedCode(art::ArtMethod*, unsigned short, art::VRegKind, unsigned int*) const
|
看起来是程序在访问线程栈时候,那块page已经不存在,所以触发了kernel的page fault机制去分配page,但为何kernel一直陷在page fault中无法返回?
查看了系统可用内存还很充足,不存在分配不到内存的情况。
GDB
用gdb连接上去修改r3寄存器内容,让它指向其它page则能成功执行当前指令,进入下一条指令,否则如果还指向异常的page,则会停留在当前指令上:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
| (gdb) info register r0 0x0 0 r1 0x1000 4096 r2 0x98 152 r3 0x9c29b000 2619977728 r4 0x9c19f000 2618945536 r5 0x9c29c564 2619983204 r6 0x9c29c000 2619981824 r7 0xb8b65468 3098956904 r8 0x9c1a0000 2618949632 r9 0x9c1a0000 2618949632 r10 0x9c19f000 2618945536 r11 0xffffffff 4294967295 r12 0xb8b65468 3098956904 sp 0x9c29c558 0x9c29c558 lr 0xb4c8982b -1261922261 pc 0xb4c89832 0xb4c89832 cpsr 0x80070030 -2147024848 (gdb) set $r3=0x9c29c000 (gdb) si 0xb4c89834 in ?? ()
|
修改r3的值为另外page时候,si指令能成功执行
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
| (gdb) set $r3=0x9c29bccc (gdb) info register r0 0x0 0 r1 0x1000 4096 r2 0x98 152 r3 0x9c29bccc 2619981004 r4 0x9c19f000 2618945536 r5 0x9c29c564 2619983204 r6 0x9c29c000 2619981824 r7 0xb8b65468 3098956904 r8 0x9c1a0000 2618949632 r9 0x9c1a0000 2618949632 r10 0x9c19f000 2618945536 r11 0xffffffff 4294967295 r12 0xb8b65468 3098956904 sp 0x9c29c558 0x9c29c558 lr 0xb4c8982b -1261922261 pc 0xb4c89832 0xb4c89832 cpsr 0x80070030 -2147024848 (gdb) si ^C Program received signal SIGINT, Interrupt.
|
修改r3的值为当前page的其它地址时候,si指令执行失败。
所以,是部分page出现了异常,但异常的根本原因呢?
workaround
我提供了一个patch给客户,用于把kernel里面CMA内存借用的机制给禁用掉,长时间拷机不再出现该异常。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
| diff --git a/mm/page_alloc.c b/mm/page_alloc.c index bf6c928..aaf8342 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -933,7 +933,7 @@ static int fallbacks[MIGRATE_TYPES][4] = { [MIGRATE_UNMOVABLE] = { MIGRATE_RECLAIMABLE, MIGRATE_MOVABLE, MIGRATE_RESERVE }, [MIGRATE_RECLAIMABLE] = { MIGRATE_UNMOVABLE, MIGRATE_MOVABLE, MIGRATE_RESERVE }, #ifdef CONFIG_CMA - [MIGRATE_MOVABLE] = { MIGRATE_CMA, MIGRATE_RECLAIMABLE, MIGRATE_UNMOVABLE, MIGRATE_RESERVE }, + [MIGRATE_MOVABLE] = { MIGRATE_RECLAIMABLE, MIGRATE_UNMOVABLE, MIGRATE_RESERVE }, [MIGRATE_CMA] = { MIGRATE_RESERVE }, /* Never used */ #else [MIGRATE_MOVABLE] = { MIGRATE_RECLAIMABLE, MIGRATE_UNMOVABLE, MIGRATE_RESERVE }, @@ -1360,7 +1360,7 @@ void free_hot_cold_page(struct page *page, int cold) * excessively into the page allocator */ if (migratetype >= MIGRATE_PCPTYPES) { - if (unlikely(is_migrate_isolate(migratetype))) { + if (unlikely(is_migrate_isolate(migratetype)) || is_migrate_cma(migratetype)) { free_one_page(zone, page, 0, migratetype); goto out; }
|
猜测:
由此我猜测:该app从cma中借用了一个page做为其stack的一部分,视频播放时候需要CMA内存,在CMA内存紧张时候会收回被借出去的page,包括被app借走的那块;接着,当app访问stack时候发现对应的page已经不存在,于是触发kernel的page fault机制去分配内存。可能是当前app在执行的代码比较特殊,必须要使用原来的那块page,但是那块page已经被用作视频buffer无法收回,于是kernel一直陷在page fault中。
这猜测过于勉强,但我已没有时间继续就这问题深入下去,暂且如此吧。