获取 BusyPoolException com.datastax.spark.connector.writer.QueryExecutor ，我做错了什么？答案

【问题标题】：Getting BusyPoolException com.datastax.spark.connector.writer.QueryExecutor , what wrong me doing?获取 BusyPoolException com.datastax.spark.connector.writer.QueryExecutor ，我做错了什么？
【发布时间】：2020-01-11 21:53:30
【问题描述】：

我正在使用 spark-sql-2.4.1 ,spark-cassandra-connector_2.11-2.4.1 和 java8 和 apache cassandra 3.0 版本。

我有如下的 spark-submit 或 spark 集群环境来加载 20 亿条记录。

--executor-cores 3 
--executor-memory 9g 
--num-executors 5 
--driver-cores 2 
--driver-memory 4g

使用以下配置

cassandra.concurrent.writes=1500
cassandra.output.batch.size.rows=10
cassandra.output.batch.size.bytes=2048
cassandra.output.batch.grouping.key=partition 
cassandra.output.consistency.level=LOCAL_QUORUM
cassandra.output.batch.grouping.buffer.size=3000
cassandra.output.throughput_mb_per_sec=128

工作大约需要 2 小时，时间真的很长

当我检查日志时，我看到警告 com.datastax.spark.connector.writer.QueryExecutor - BusyPoolException

如何解决这个问题？

【问题讨论】：

标签： apache-spark cassandra apache-spark-sql datastax-java-driver spark-cassandra-connector

【解决方案1】：

cassandra.concurrent.writes 的值不正确 - 这意味着您同时发送 1500 个并发批次。但默认情况下，Java driver allows 1024 simultaneous requests。通常，如果这个参数的数值太大，可能会导致节点过载，结果会导致任务重试。

此外，其他设置不正确 - 如果您指定 cassandra.output.batch.size.rows，则其值将覆盖 cassandra.output.batch.size.bytes 的值。详情请见corresponding section of the Spark Cassandra Connector reference。

性能调优的一个方面是拥有正确数量的 Spark 分区，以便您达到良好的并行度 - 但这实际上取决于您的代码、Cassandra 集群中有多少节点等。

附：另外，请注意配置参数应该以spark.cassandra. 开头，而不是简单的cassandra. - 如果您以这种形式指定它们，那么这些参数将被忽略并使用默认值。

【讨论】：

@BdLearner 应该根据用于从 Cassandra 获取数据的访问模式来选择分区键 - 这实际上取决于您的用例，而且通常很难说。我建议在academy.datastax.com 上参加 DS220 课程 - 有一些关于如何正确设计分区键的示例