【发布时间】:2020-04-30 22:49:45
【问题描述】:
我需要一些关于我在使用 k8s 1.14 并在其上运行 gitlab 管道时遇到的问题的建议。许多作业都抛出退出代码 137 错误,我发现这意味着容器被突然终止。
集群信息:
Kubernetes 版本:1.14 正在使用的云:AWS EKS 节点:C5.4xLarge
深入挖掘后发现如下日志:
**kubelet: I0114 03:37:08.639450** 4721 image_gc_manager.go:300] [imageGCManager]: Disk usage on image filesystem is at 95% which is over the high threshold (85%). Trying to free 3022784921 bytes down to the low threshold (80%).
**kubelet: E0114 03:37:08.653132** 4721 kubelet.go:1282] Image garbage collection failed once. Stats initialization may not have completed yet: failed to garbage collect required amount of images. Wanted to free 3022784921 bytes, but freed 0 bytes
**kubelet: W0114 03:37:23.240990** 4721 eviction_manager.go:397] eviction manager: timed out waiting for pods runner-u4zrz1by-project-12123209-concurrent-4zz892_gitlab-managed-apps(d9331870-367e-11ea-b638-0673fa95f662) to be cleaned up
**kubelet: W0114 00:15:51.106881** 4781 eviction_manager.go:333] eviction manager: attempting to reclaim ephemeral-storage
**kubelet: I0114 00:15:51.106907** 4781 container_gc.go:85] attempting to delete unused containers
**kubelet: I0114 00:15:51.116286** 4781 image_gc_manager.go:317] attempting to delete unused images
**kubelet: I0114 00:15:51.130499** 4781 eviction_manager.go:344] eviction manager: must evict pod(s) to reclaim ephemeral-storage
**kubelet: I0114 00:15:51.130648** 4781 eviction_manager.go:362] eviction manager: pods ranked for eviction:
1. runner-u4zrz1by-project-10310692-concurrent-1mqrmt_gitlab-managed-apps(d16238f0-3661-11ea-b638-0673fa95f662)
2. runner-u4zrz1by-project-10310692-concurrent-0hnnlm_gitlab-managed-apps(d1017c51-3661-11ea-b638-0673fa95f662)
3. runner-u4zrz1by-project-13074486-concurrent-0dlcxb_gitlab-managed-apps(63d78af9-3662-11ea-b638-0673fa95f662)
4. prometheus-deployment-66885d86f-6j9vt_prometheus(da2788bb-3651-11ea-b638-0673fa95f662)
5. nginx-ingress-controller-7dcc95dfbf-ld67q_ingress-nginx(6bf8d8e0-35ca-11ea-b638-0673fa95f662)
然后 pod 被终止,导致退出代码 137s。
谁能帮我理解原因和解决这个问题的可能解决方案?
谢谢你:)
【问题讨论】:
-
>> 退出代码 137 - 表示“内存不足” 从上面的日志垃圾收集被调用,其中 defaultthreshold 被违反 --image-gc-high-threshold=90 和 --image- gc-low-threshold=80
-
嘿@D.T. .是的。您能解释一下如何避免 pod 被终止吗?我检查了内存,他们有 20G 的空间,我检查了节点的内存和磁盘压力,他们有足够的空间。我不明白为什么要终止 pod 以回收临时空间。
-
映像文件系统上的磁盘使用率为 95%,超过了高阈值 (85%)。尝试将 3022784921 字节释放到低阈值 (80%)。 > 无法垃圾收集所需数量的图像。想要释放 3022784921 个字节,但释放了 0 个字节。你能增加一些磁盘空间吗?你也有配额吗?
kubectl describe quota -
@PjoterS 没有应用配额或限制范围。我已经将磁盘空间增加到 50GB。通过查看“kubectl describe nodes”输出中的“污点”和“事件”,我确认没有磁盘压力。我检查了“kubectl top nodes”的输出以检查内存和 CPU 是否处于压力之下,但它们似乎在控制之下
标签: linux kubernetes kubernetes-pod amazon-eks