在我使用 jstack -F 进行线程转储后，Java 进程不响应但恢复答案

【问题标题】：Java process not respoding but resuming after I do a thread dump with jstack -F在我使用 jstack -F 进行线程转储后，Java 进程不响应但恢复
【发布时间】：2021-07-15 12:50:17
【问题描述】：

我遇到了一个奇怪的问题，Java 进程卡住了（每天一次/两次），它只有在我执行后才能恢复：

jstack -F ${PID}

当 Java 进程卡住时，如果我尝试使用 jcmd 进行线程转储，我会收到 AttachNotSupportedException。

我只设法使用 jstack -F 进行线程转储，并使用与 JRE 版本同步的 JDK 版本，用于启动 java 进程的 JRE 版本。

我唯一能想到的是，也许操作系统调度程序不允许 Java 进程使用 CPU 时间，如果我执行 jstack -F 我会强制它允许它运行？

我们将不胜感激。

问候，

克里斯蒂

UPDATE-1

今天又发生了。我检查的第一件事是那个盒子上使用的内存（99.1%）。之后我执行了一个 jmap -heap，并且在堆转储后进程恢复没有任何问题。附加堆转储。

jmap -heap 7703
Attaching to process ID 7703, please wait...
Debugger attached successfully.
Server compiler detected.
JVM version is 25.162-b12

using thread-local object allocation.
Parallel GC with 2 thread(s)

Heap Configuration:
   MinHeapFreeRatio         = 0
   MaxHeapFreeRatio         = 100
   MaxHeapSize              = 536870912 (512.0MB)
   NewSize                  = 89128960 (85.0MB)
   MaxNewSize               = 178782208 (170.5MB)
   OldSize                  = 179306496 (171.0MB)
   NewRatio                 = 2
   SurvivorRatio            = 8
   MetaspaceSize            = 21807104 (20.796875MB)
   CompressedClassSpaceSize = 1073741824 (1024.0MB)
   MaxMetaspaceSize         = 17592186044415 MB
   G1HeapRegionSize         = 0 (0.0MB)

Heap Usage:
PS Young Generation
Eden Space:
   capacity = 143130624 (136.5MB)
   used     = 73244792 (69.85167694091797MB)
   free     = 69885832 (66.64832305908203MB)
   51.1733897003062% used
From Space:
   capacity = 17825792 (17.0MB)
   used     = 8176960 (7.79815673828125MB)
   free     = 9648832 (9.20184326171875MB)
   45.871510225183826% used
To Space:
   capacity = 17825792 (17.0MB)
   used     = 0 (0.0MB)
   free     = 17825792 (17.0MB)
   0.0% used
PS Old Generation
   capacity = 243269632 (232.0MB)
   used     = 23534032 (22.443801879882812MB)
   free     = 219735600 (209.5561981201172MB)
   9.674052534432247% used

25964 interned Strings occupying 2759784 bytes.

UPDATE-2

启用 GC 日志后，当进程冻结时，这是 GC 日志的尾部。

2020-09-02T06:51:11.286+0000: 86020.549: Total time for which application 

threads were stopped: 0.0001978 seconds, Stopping threads took: 0.0000666 seconds
2020-09-02T06:51:11.286+0000: 86020.550: Application time: 0.0000610 seconds
2020-09-02T06:51:11.286+0000: 86020.550: Total time for which application threads were stopped: 0.0001793 seconds, Stopping threads took: 0.0000589 seconds
2020-09-02T06:51:11.287+0000: 86020.550: Application time: 0.0003371 seconds
2020-09-02T06:51:11.287+0000: 86020.550: Total time for which application threads were stopped: 0.0001749 seconds, Stopping threads took: 0.0000283 seconds
2020-09-02T06:51:11.287+0000: 86020.550: Application time: 0.0001277 seconds
2020-09-02T06:51:11.287+0000: 86020.550: Total time for which application threads were stopped: 0.0001554 seconds, Stopping threads took: 0.0000364 seconds
2020-09-02T06:51:11.287+0000: 86020.551: Application time: 0.0000400 seconds
2020-09-02T06:51:11.287+0000: 86020.551: Total time for which application threads were stopped: 0.0001082 seconds, Stopping threads took: 0.0000158 seconds
2020-09-02T06:51:11.288+0000: 86020.552: Application time: 0.0010649 seconds
2020-09-02T06:51:11.288+0000: 86020.552: Total time for which application threads were stopped: 0.0001945 seconds, Stopping threads took: 0.0000571 seconds
2020-09-02T06:51:11.289+0000: 86020.552: Application time: 0.0001078 seconds
2020-09-02T06:51:11.289+0000: 86020.552: Total time for which application threads were stopped: 0.0001852 seconds, Stopping threads took: 0.0000336 seconds
2020-09-02T06:51:11.289+0000: 86020.552: Application time: 0.0000366 seconds
2020-09-02T06:51:11.289+0000: 86020.552: Total time for which application threads were stopped: 0.0000910 seconds, Stopping threads took: 0.0000180 seconds
2020-09-02T06:51:11.289+0000: 86020.552: Application time: 0.0000412 seconds
2020-09-02T06:51:11.289+0000: 86020.553: Total time for which application threads were

【问题讨论】：

如果我也发送一个 kill -SIGCONT $(PID) 看起来我能够恢复进程，这表明内核可能已经向进程发送了一个 kill -SIGSTOP，可能是由于缺少资源。我看到这个问题的机器负载很大。
起初它看起来像是这个问题的问题stackoverflow.com/questions/34251580/… 但看起来这个错误在我使用的当前内核版本中已修复 root@hostname /]# rpm -q --更新日志 kernel-2.6.32-754.29.1.el6.x86_64 | grep 'get_futex_key_refs' - [kernel] futex：确保 get_futex_key_refs() 始终意味着障碍（Larry Woodman）[1167405]
我们一直在努力解决类似的问题，这确实是一个内核错误。升级到 Linux 4.x 有所帮助。
一种可能是当时发生了full GC，您可能想将GC更改为CMS，同时进行线程转储，看看是否有任何代码创建了许多对象或者是否有是否存在内存泄漏并且某些对象没有被 GC？

标签： java unix jvm kernel freeze

【解决方案1】：

如果您在使用 jcmd 获取线程转储时收到 AttachNotSupportedException，请尝试在与 java 进程运行相同的用户下运行 jcmd。见com.sun.tools.attach.AttachNotSupportedException: Unable to open socket file: target process not responding or HotSpot VM not loaded

【讨论】：