【问题标题】:Kafka producer/consumer are opening too many file descriptorKafka 生产者/消费者打开了太多的文件描述符
【发布时间】:2025-12-23 15:40:07
【问题描述】:

我们有一个 3 节点的 Kafka 集群部署,每个主题有 5 个主题和 6 个分区。我们已经配置了复制因子 =3,我们看到一个非常奇怪的问题,文件描述符的数量已经超过了 ulimit(我们的应用程序是 50K)

As per the lsof command and our analysis 

1. there have 15K established connection from kafka producer/consumer  towards broker and at the same time in thread dump we have observed thousands of entry for kafka 'admin-client-network-thread'

admin-client-network-thread" #224398 daemon prio=5 os_prio=0 tid=0x00007f12ca119800 nid=0x5363 runnable [0x00007f12c4db8000]
java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
- locked <0x00000005e0603238> (a sun.nio.ch.Util$3)
- locked <0x00000005e0603228> (a java.util.Collections$UnmodifiableSet)
- locked <0x00000005e0602f08> (a sun.nio.ch.EPollSelectorImpl)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
at org.apache.kafka.common.network.Selector.select(Selector.java:672)
at org.apache.kafka.common.network.Selector.poll(Selector.java:396)
at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:460)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:238)
- locked <0x00000005e0602dc0> (a org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:214)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:205)
at kafka.admin.AdminClient$$anon$1.run(AdminClient.scala:61)
at java.lang.Thread.run(Thread.java:748)


2. As per the lsof output , We have observed 35K entry for pipe and event poll

java    5441 app  374r     FIFO                0,9      0t0  22415240 pipe
java    5441 app  375w     FIFO                0,9      0t0  22415240 pipe
java    5441 app  376u  a_inode               0,10        0      6379 [eventpoll]
java    5441 app  377r     FIFO                0,9      0t0  22473333 pipe
java    5441 app  378r     FIFO                0,9      0t0  28054726 pipe
java    5441 app  379r     FIFO                0,9      0t0  22415241 pipe
java    5441 app  380w     FIFO                0,9      0t0  22415241 pipe
java    5441 app  381u  a_inode               0,10        0      6379 [eventpoll]
java    5441 app  382w     FIFO                0,9      0t0  22473333 pipe
java    5441 app  383u  a_inode               0,10        0      6379 [eventpoll]
java    5441 app  384u  a_inode               0,10        0      6379 [eventpoll]
java    5441 app  385r     FIFO                0,9      0t0  40216087 pipe
java    5441 app  386r     FIFO                0,9      0t0  22483470 pipe


Setup details :- 
apache kafka client :- 1.0.1
Kafka version :- 1.0.1
Open JDK :- java-1.8.0-openjdk-1.8.0.222.b10-1
CentOS version :- CentOS Linux release 7.6.1810

Note:- After restarted VM file descriptor count was able to clear and come to normal count as 1000 
then after few second file descriptor count started to increase and it will reach to 50K (limit) after 
1-week in Idle scenarios.

【问题讨论】:

    标签: apache-kafka centos kafka-consumer-api kafka-producer-api


    【解决方案1】:

    此问题是由于使用了已弃用的 kafka.admin.AdminClient API。相反,可以使用 org.apache.kafka.clients.admin.AdminClient 从 Kafka 获取类似的信息。此 API 具有等效的方法并提供与旧版 API 相同的功能。

    使用遗留 API(kafka.admin.AdminClient API),在线程转储中观察到许多守护线程(“admin-client-network-thread”)。不知何故,在旧版 API 中,管理客户端网络线程维护得不好,您会看到为每次调用创建了许多“管理客户端网络线程”守护线程,并且它们都不会终止。由于在进程和系统级别观察到大量文件描述符。

    【讨论】:

      【解决方案2】:

      代理为每个日志段文件和网络连接创建和维护文件句柄。如果代理托管许多分区并且分区有许多日志段文件,则总数可能非常大。这也适用于网络连接。

      段文件数是分区数乘以取决于保留策略的某个数字。默认保留策略是在一周后(或 1GB,无论先发生什么)开始一个新分段,并在其中的所有数据超过一周时删除一个分段。

      我没有立即看到设置较大的file-max 可能导致性能下降,但页面缓存未命中很重要。

      【讨论】: