Star64_linux/drivers/gpu/drm/amd/amdkfd
Dennis Li a9a83a92d0 drm/kfd: fix a system crash issue during GPU recovery
The crash log as the below:

[Thu Aug 20 23:18:14 2020] general protection fault: 0000 [#1] SMP NOPTI
[Thu Aug 20 23:18:14 2020] CPU: 152 PID: 1837 Comm: kworker/152:1 Tainted: G           OE     5.4.0-42-generic #46~18.04.1-Ubuntu
[Thu Aug 20 23:18:14 2020] Hardware name: GIGABYTE G482-Z53-YF/MZ52-G40-00, BIOS R12 05/13/2020
[Thu Aug 20 23:18:14 2020] Workqueue: events amdgpu_ras_do_recovery [amdgpu]
[Thu Aug 20 23:18:14 2020] RIP: 0010:evict_process_queues_cpsch+0xc9/0x130 [amdgpu]
[Thu Aug 20 23:18:14 2020] Code: 49 8d 4d 10 48 39 c8 75 21 eb 44 83 fa 03 74 36 80 78 72 00 74 0c 83 ab 68 01 00 00 01 41 c6 45 41 00 48 8b 00 48 39 c8 74 25 <80> 78 70 00 c6 40 6d 01 74 ee 8b 50 28 c6 40 70 00 83 ab 60 01 00
[Thu Aug 20 23:18:14 2020] RSP: 0018:ffffb29b52f6fc90 EFLAGS: 00010213
[Thu Aug 20 23:18:14 2020] RAX: 1c884edb0a118914 RBX: ffff8a0d45ff3c00 RCX: ffff8a2d83e41038
[Thu Aug 20 23:18:14 2020] RDX: 0000000000000000 RSI: 0000000000000082 RDI: ffff8a0e2e4178c0
[Thu Aug 20 23:18:14 2020] RBP: ffffb29b52f6fcb0 R08: 0000000000001b64 R09: 0000000000000004
[Thu Aug 20 23:18:14 2020] R10: ffffb29b52f6fb78 R11: 0000000000000001 R12: ffff8a0d45ff3d28
[Thu Aug 20 23:18:14 2020] R13: ffff8a2d83e41028 R14: 0000000000000000 R15: 0000000000000000
[Thu Aug 20 23:18:14 2020] FS:  0000000000000000(0000) GS:ffff8a0e2e400000(0000) knlGS:0000000000000000
[Thu Aug 20 23:18:14 2020] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Thu Aug 20 23:18:14 2020] CR2: 000055c783c0e6a8 CR3: 00000034a1284000 CR4: 0000000000340ee0
[Thu Aug 20 23:18:14 2020] Call Trace:
[Thu Aug 20 23:18:14 2020]  kfd_process_evict_queues+0x43/0xd0 [amdgpu]
[Thu Aug 20 23:18:14 2020]  kfd_suspend_all_processes+0x60/0xf0 [amdgpu]
[Thu Aug 20 23:18:14 2020]  kgd2kfd_suspend.part.7+0x43/0x50 [amdgpu]
[Thu Aug 20 23:18:14 2020]  kgd2kfd_pre_reset+0x46/0x60 [amdgpu]
[Thu Aug 20 23:18:14 2020]  amdgpu_amdkfd_pre_reset+0x1a/0x20 [amdgpu]
[Thu Aug 20 23:18:14 2020]  amdgpu_device_gpu_recover+0x377/0xf90 [amdgpu]
[Thu Aug 20 23:18:14 2020]  ? amdgpu_ras_error_query+0x1b8/0x2a0 [amdgpu]
[Thu Aug 20 23:18:14 2020]  amdgpu_ras_do_recovery+0x159/0x190 [amdgpu]
[Thu Aug 20 23:18:14 2020]  process_one_work+0x20f/0x400
[Thu Aug 20 23:18:14 2020]  worker_thread+0x34/0x410

When GPU hang, user process will fail to create a compute queue whose
struct object will be freed later, but driver wrongly add this queue to
queue list of the proccess. And then kfd_process_evict_queues will
access a freed memory, which cause a system crash.

v2:
The failure to execute_queues should probably not be reported to
the caller of create_queue, because the queue was already created.
Therefore change to ignore the return value from execute_queues.

Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Dennis Li <Dennis.Li@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-09-15 17:23:18 -04:00
..
cik_event_interrupt.c drm/amdkfd: Provide SMI events watch 2020-07-15 13:27:34 -04:00
cik_int.h
cik_regs.h
cwsr_trap_handler.h drm/amdkfd: Fix spurious debug exception on gfx10 2020-08-10 17:26:51 -04:00
cwsr_trap_handler_gfx8.asm
cwsr_trap_handler_gfx9.asm
cwsr_trap_handler_gfx10.asm drm/amdkfd: Fix spurious debug exception on gfx10 2020-08-10 17:26:51 -04:00
Kconfig
kfd_chardev.c drm/amdkfd: implement the dGPU fallback path for apu (v6) 2020-08-26 16:40:17 -04:00
kfd_crat.c drm/amdkfd: implement the dGPU fallback path for apu (v6) 2020-08-26 16:40:17 -04:00
kfd_crat.h
kfd_dbgdev.c
kfd_dbgdev.h
kfd_dbgmgr.c
kfd_dbgmgr.h
kfd_debugfs.c
kfd_device.c drm/amdkfd: Add GPU reset SMI event 2020-08-31 14:40:03 -04:00
kfd_device_queue_manager.c drm/kfd: fix a system crash issue during GPU recovery 2020-09-15 17:23:18 -04:00
kfd_device_queue_manager.h drm/amdkfd: sparse: Fix warning in reading SDMA counters 2020-08-24 12:22:32 -04:00
kfd_device_queue_manager_cik.c
kfd_device_queue_manager_v9.c drm/amdkfd: implement the dGPU fallback path for apu (v6) 2020-08-26 16:40:17 -04:00
kfd_device_queue_manager_v10.c
kfd_device_queue_manager_vi.c
kfd_doorbell.c
kfd_events.c
kfd_events.h
kfd_flat_memory.c drm/amdkfd: implement the dGPU fallback path for apu (v6) 2020-08-26 16:40:17 -04:00
kfd_int_process_v9.c drm/amdkfd: Provide SMI events watch 2020-07-15 13:27:34 -04:00
kfd_interrupt.c
kfd_iommu.c drm/amdkfd: implement the dGPU fallback path for apu (v6) 2020-08-26 16:40:17 -04:00
kfd_iommu.h
kfd_kernel_queue.c
kfd_kernel_queue.h
kfd_module.c
kfd_mqd_manager.c
kfd_mqd_manager.h
kfd_mqd_manager_cik.c
kfd_mqd_manager_v9.c drm/amdkfd: Update hardware scheduling time quanta 2020-07-02 12:02:55 -04:00
kfd_mqd_manager_v10.c drm/amdkfd: Update hardware scheduling time quanta 2020-07-02 12:02:55 -04:00
kfd_mqd_manager_vi.c drm/amdkfd: Update hardware scheduling time quanta 2020-07-02 12:02:55 -04:00
kfd_packet_manager.c drm/amdkfd: Support navy_flounder KFD 2020-07-15 12:46:55 -04:00
kfd_packet_manager_v9.c drm/amdkfd: Update hardware scheduling time quanta 2020-07-02 12:02:55 -04:00
kfd_packet_manager_vi.c drm/amdkfd: Update hardware scheduling time quanta 2020-07-02 12:02:55 -04:00
kfd_pasid.c drm/amdkfd: Remove redundant kfd2kgd interface lookup 2020-07-08 09:02:54 -04:00
kfd_pm4_headers.h
kfd_pm4_headers_ai.h
kfd_pm4_headers_diq.h
kfd_pm4_headers_vi.h
kfd_pm4_opcodes.h
kfd_priv.h drm/amdkfd: Add GPU reset SMI event 2020-08-31 14:40:03 -04:00
kfd_process.c drm/amdkfd: sparse: Fix warning in reading SDMA counters 2020-08-24 12:22:32 -04:00
kfd_process_queue_manager.c
kfd_queue.c
kfd_smi_events.c drm/amdkfd: Add GPU reset SMI event 2020-08-31 14:40:03 -04:00
kfd_smi_events.h drm/amdkfd: Add GPU reset SMI event 2020-08-31 14:40:03 -04:00
kfd_topology.c drm/amdkfd: fix set kfd node ras properties value 2020-08-26 16:40:19 -04:00
kfd_topology.h
Makefile drm/amdkfd: Provide SMI events watch 2020-07-15 13:27:34 -04:00
soc15_int.h