sparkR 1.6：使用 glm 建模时如何预测概率（二项式族）答案

【问题标题】：sparkR 1.6: How to predict probability when modeling with glm (binomial family)sparkR 1.6：使用 glm 建模时如何预测概率（二项式族）
【发布时间】：2016-06-25 19:03:11
【问题描述】：

我刚刚在 CentOS 上安装了 sparkR 1.6.1 并且没有使用 hadoop。我用离散的“目标”值对数据建模的代码如下：

# 'tr' is a R data frame with 104 numeric columns and one TARGET column
#    TARGET column is either 0 or 1
# Convert 'tr' to spark data frame

train <- createDataFrame(sqlContext, tr)

# test is an R dataframe without TARGET column
# Convert 'test' to spark Data frame
te<-createDataFrame(sqlContext,test)
# Using sparkR's glm model to model data
model <- glm(TARGET ~ . , data = train, family = "binomial")
# Make predictions
predictions <- predict(model, newData = te )

我能够如下评价成功或失败（希望我是正确的）：

modelPrediction <- select(predictions, "prediction")
head(modelPrediction)

  prediction
1          0
2          0
3          0
4          0
5          0
6          0

但是当我想评估概率时，我得到的结果如下：

modelPrediction <- select(predictions, "probability")
head(modelPrediction)

                probability
1 <environment: 0x6188e1c0>
2 <environment: 0x61894b88>
3 <environment: 0x6189a620>
4 <environment: 0x618a00b8>
5 <environment: 0x618a5b50>
6 <environment: 0x618ac550>

请帮助我获取测试事件的概率值。谢谢。

【问题讨论】：

请把head(prediction)的结果包括进来

标签： sparkr

【解决方案1】：

背景：当您的 R 代码从 Spark 后端请求某些计算的结果时，Spark 会执行计算并将结果序列化。然后在 R 端对该结果进行反序列化，从而获得 R 对象。

现在，它在 Spark 后端的工作方式是 -- 如果它认为要返回的对象的类型是 Character、String、Long、Float、Double 之一Integer、Boolean、Date、TimeStamp 或它们的Array 等，然后序列化对象。但是，如果它发现类型与其中任何一个都不匹配，它只需为对象分配一个 id，根据该 id 将其存储在内存中，然后将此 id 发送给 R 客户端。（RBackendHandler 中的JVMObjectTracker 负责跟踪 spark 后端上的 jvm 对象。）然后将其反序列化为 R 端的 jobj 类。（您可以查看SerDe.scala 的writeObject 方法，以全面了解前期序列化的内容和未序列化的内容。）

现在，在 R 端，如果您查看 predictions 数据框的 probability 列中的对象，您会发现它们的类是 jobj。如前所述，此类的对象充当 Spark 集群上实际 Java 对象的代理。在这种特殊情况下，支持 java 类是org.apache.spark.mllib.linalg.DenseVector。这是一个向量，因为它包含每个类别的概率。而且由于这个向量不是 SerDe 类支持的序列化类型之一，spark 后端只是返回jobj 代理并将这些DenseVector 对象存储在内存中，以便将来对其进行操作。

在这样的背景下——你应该能够通过调用这些DenseVector 对象上的方法在你的R 前端获得概率值。到目前为止，我认为这是唯一的方法。以下是适用于 iris 数据集的代码 --

irisDf <- createDataFrame(sqlContext, iris)
irisDf$target <- irisDf$Species == 'setosa'
model <- glm(target ~ . , data = irisDf, family = "binomial")
summary(model)
predictions <- predict(model, newData = irisDf)
modelPrediction <- select(predictions, "probability")
localPredictions <- SparkR:::as.data.frame(predictions)

getValFrmDenseVector <- function(x) {
    #Given it's binary classification there are just two elems in vector
    a <- SparkR:::callJMethod(x$probability, "apply", as.integer(0))
    b <- SparkR:::callJMethod(x$probability, "apply", as.integer(1))
    c(a, b)
}

t(apply(localPredictions, 1, FUN=getValFrmDenseVector))

有了这个，我得到以下两个类的概率输出——

        [,1]         [,2]
1   3.036612e-15 1.000000e+00
2   5.919287e-12 1.000000e+00
3   7.831827e-14 1.000000e+00
4   7.712003e-13 1.000000e+00
5   4.427117e-16 1.000000e+00
6   3.816329e-16 1.000000e+00
[...]

注意：SparkR::: 前缀函数不会导出到 SparkR 包命名空间中。所以请记住，您正在针对包私有实现进行编码。（但我真的不知道如何以其他方式实现这一点，除非 Spark 为其提供公共 API 支持。）

【讨论】：

感谢您的详尽解释。提出的解决方案确实有效。正如您正确提到的， callJMethod() 在 sparkR shell 中没有任何帮助。它是 sparkRBackend.R 的一部分。同时，我还能够使用以下代码获得结果： modelPrediction % showDF(numRows= 78000, truncate = FALSE) ;水槽（）
酷，谢谢。我也会看看 showDF 函数。（但这仍然不允许我从代码中引用概率值而不是打印它们）。