在 Apache Spark 上训练逻辑回归模型时出错。 SPARK-5063答案

【问题标题】：Error with training logistic regression model on Apache Spark. SPARK-5063在 Apache Spark 上训练逻辑回归模型时出错。 SPARK-5063
【发布时间】：2015-11-18 16:55:37
【问题描述】：

我正在尝试使用 Apache Spark 构建逻辑回归模型。这是代码。

parsedData = raw_data.map(mapper) # mapper is a function that generates pair of label and feature vector as LabeledPoint object
featureVectors = parsedData.map(lambda point: point.features) # get feature vectors from parsed data 
scaler = StandardScaler(True, True).fit(featureVectors) #this creates a standardization model to scale the features
scaledData = parsedData.map(lambda lp: LabeledPoint(lp.label, scaler.transform(lp.features))) #trasform the features to scale mean to zero and unit std deviation
modelScaledSGD = LogisticRegressionWithSGD.train(scaledData, iterations = 10)

但我得到这个错误：

异常：您似乎正试图从广播变量、操作或转换中引用 SparkContext。 SparkContext 只能在驱动程序上使用，不能在它在工作人员上运行的代码中使用。有关详细信息，请参阅 SPARK-5063。

我不确定如何解决这个问题。任何帮助将不胜感激。

【问题讨论】：

标签： python apache-spark pyspark apache-spark-mllib logistic-regression

【解决方案1】：

您看到的问题与我在 How to use Java/Scala function from an action or a transformation? 中描述的问题几乎相同。要进行转换，您必须调用 Scala 函数，并且它需要访问 SparkContext，因此您会看到错误。

处理此问题的标准方法是仅处理数据的所需部分，然后压缩结果。

labels = parsedData.map(lambda point: point.label)
featuresTransformed = scaler.transform(featureVectors)

scaledData = (labels
    .zip(featuresTransformed)
    .map(lambda p: LabeledPoint(p[0], p[1])))

modelScaledSGD = LogisticRegressionWithSGD.train(...)

如果不打算基于MLlib 组件实现自己的方法，则可以更轻松地使用高级ML API。

编辑：

这里有两个可能的问题。

此时LogisticRegressionWithSGD 支持only binomial 分类（感谢eliasah 指出）。如果您需要多标签分类，可以将其替换为LogisticRegressionWithLBFGS。
StandardScaler 仅支持密集向量，因此应用有限。

【讨论】：

它给出了这个error。我以前从未见过这个错误。
在 1.4.1 上运行良好。稍后我将下载 1.3.1 并检查是否可以重现该问题。 StandardScaler 不适用于稀疏数据，但我看起来不是这里的问题。
这个解决方案对我来说听起来合乎逻辑且正确，这就是为什么我对这个错误感到惊讶。
我想我知道发生了什么。您正在使用百万数据，但 LogisticRegressionWithSGD 预计会出现二进制分类问题。你能检查一下this message的日志吗？
Spark 似乎在抱怨输入，但似乎就是这样。