RevoScaleR：rxPredict，参数个数与变量个数不匹配答案

【问题标题】：RevoScaleR: rxPredict, the number of parameters does not match the number of variablesRevoScaleR：rxPredict，参数个数与变量个数不匹配
【发布时间】：2016-08-05 13:29:02
【问题描述】：

我已经使用 Microsoft 的“Data Science End to End Walkthrough”为自己设置了 R Server，他们的示例运行良好。

示例（纽约出租车数据）使用非分类变量（即距离、出租车费等）来预测分类变量（1 或 0 表示是否支付小费）。

我正在尝试使用分类变量作为输入，使用线性回归（rxLinMod 函数）来预测类似的二进制输出，但出现了错误。

错误表示参数的数量与变量的数量不匹配，但在我看来 number of variables 实际上是每个因子（变量）内的级别数。

复制

在 SQL Server 中创建一个名为 example 的表：

USE [my_database];
SET ANSI_NULLS ON;
SET QUOTED_IDENTIFIER ON;
CREATE TABLE [dbo].[example](
    [Person] [nvarchar](max) NULL,
    [City] [nvarchar](max) NULL,
    [Bin] [integer] NULL
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY];

把数据放进去：

insert into [dbo].[example] values ('John','London',0);
insert into [dbo].[example] values ('Paul','New York',0);
insert into [dbo].[example] values ('George','Liverpool',1);
insert into [dbo].[example] values ('Ringo','Paris',1);
insert into [dbo].[example] values ('John','Sydney',1);
insert into [dbo].[example] values ('Paul','Mexico City',1);
insert into [dbo].[example] values ('George','London',1);
insert into [dbo].[example] values ('Ringo','New York',1);
insert into [dbo].[example] values ('John','Liverpool',1);
insert into [dbo].[example] values ('Paul','Paris',0);
insert into [dbo].[example] values ('George','Sydney',0);
insert into [dbo].[example] values ('Ringo','Mexico City',0);

我还使用了一个 SQL 函数，它以表格式返回变量，因为 Microsoft 示例中需要它。创建函数formatAsTable：

USE [my_database];
SET ANSI_NULLS ON;
SET QUOTED_IDENTIFIER ON;
CREATE FUNCTION [dbo].[formatAsTable] (
@City nvarchar(max)='',
@Person nvarchar(max)='')
RETURNS TABLE
AS
  RETURN
  (
  -- Add the SELECT statement with parameter references here
  SELECT
    @City AS City,
    @Person AS Person
  );

我们现在有一个包含两个分类变量的表 - Person 和 City。

让我们开始预测。在 R 中，运行以下命令：

library(RevoScaleR)
# Set up the database connection
connStr <- "Driver=SQL Server;Server=<servername>;Database=<dbname>;Uid=<uid>;Pwd=<password>"
sqlShareDir <- paste("C:\\AllShare\\",Sys.getenv("USERNAME"),sep="")
sqlWait <- TRUE
sqlConsoleOutput <- FALSE
cc <- RxInSqlServer(connectionString = connStr, shareDir = sqlShareDir, 
                    wait = sqlWait, consoleOutput = sqlConsoleOutput)
rxSetComputeContext(cc)
# Set the SQL which gets our data base
sampleDataQuery <- "SELECT * from [dbo].[example] "
# Set up the data source
inDataSource <- RxSqlServerData(sqlQuery = sampleDataQuery, connectionString = connStr, 
                                colClasses = c(City = "factor",Bin="logical",Person="factor"
                                ),
                                rowsPerRead=500)

现在，建立线性回归模型。

isWonObj <- rxLinMod(Bin ~ City+Person,data = inDataSource)

查看模型对象：

isWonObj

注意它看起来像这样：

...
Total independent variables: 11 (Including number dropped: 3)
...

Coefficients:
                           Bin
(Intercept)       6.666667e-01
City=London      -1.666667e-01
City=New York     4.450074e-16
City=Liverpool    3.333333e-01
City=Paris        4.720871e-16
City=Sydney      -1.666667e-01
City=Mexico City       Dropped
Person=John      -1.489756e-16
Person=Paul      -3.333333e-01
Person=George          Dropped
Person=Ringo           Dropped

它说有 11 个变量，这很好，因为这是因子中水平的总和。

现在，当我尝试根据 City 和 Person 预测 Bin 值时，我收到错误消息：

首先，我将要预测的City 和Person 格式化为表格。然后，我预测将其用作输入。

sq<-"SELECT City, Person FROM [dbo].[formatAsTable]('London','George')"
pred<-RxSqlServerData(sqlQuery = sq,connectionString = connStr
                      , colClasses = c(City = "factor",Person="factor"))

如果您检查 pred 对象，它看起来像预期的那样：

> head(pred)
    City Person
1 London George

现在当我尝试预测时，我得到了一个错误。

scoredOutput <- RxSqlServerData(
  connectionString = connStr,
  table = "binaryOutput"
)

rxPredict(modelObject = isWonObj, data = pred, outData = scoredOutput, 
          predVarNames = "Score", type = "response", writeModelVars = FALSE, overwrite = TRUE,checkFactorLevels = FALSE)

错误提示：

INTERNAL ERROR: In rxPredict, the number of parameters does not match the number of  variables: 3 vs. 11.

我可以看到 11 来自哪里，但我只为预测查询提供了 2 个值 - 所以我看不到 3 来自哪里，或者为什么会出现问题。

感谢任何帮助！

【问题讨论】：

标签： sql-server r revolution-r

【解决方案1】：

您确定指定 colInfo 可以解决问题吗？看起来 rxPredict 中存在一个普遍问题，而不是 rxPredict 与 SQL Server 结合使用：

# lm() and predict() don't have a problem with missing factor levels ("two" in this case):
fac <- c("one", "two", "three")
val = c(1, 2, 3)
trainingData <- data.frame(fac, val, stringsAsFactors = TRUE)
lmModel <- lm(val ~ fac, data = trainingData)
print(summary(lmModel))
predictionData = data.frame(fac = c("one", "three", "one", "one"))
lmPred <- predict(lmModel, newdata = predictionData)
lmPred
# The result is OK:
# 1 2 3 4
# 1 3 1 1

# rxLinMod() and rxPredict() behave different:
rxModel <- rxLinMod(val ~ fac, data = trainingData)
rxPred <- rxPredict(rxModel, data = predictionData, writeModelVars = TRUE)
# The following error is thrown:
# "INTERNAL ERROR: In rxPredict, the number of parameters does not match
# the number of  variables: 3 vs. 4."
# checkFactorLevels = FALSE doesn't help here, it actually seems to just
# check the order of factor levels.
levels(predictionData$fac) <- c("two", "three", "one")
rxPred <- rxPredict(rxModel, data = predictionData, writeModelVars = TRUE)
# The following error is thrown (twice):
# ERROR:order of factor levels in the data are inconsistent with
# the order of the model coefficients:fac = two versus fac = one. Set
# checkFactorLevels = FALSE to ignore.
rxPred <- rxPredict(rxModel, data = predictionData, checkFactorLevels = FALSE, writeModelVars = TRUE)
rxPred
#   val_Pred    fac
#1  1           two
#2  3           three
#3  1           two
#4  1           two
# This looks suspicious at best. While the prediction values are still
# correct if you look only at the order of the records in trainingData,
# the model variables are messed up.

在我的场景中，我有一个具有大约 10.000 个级别的因子（仅在创建模型期间已知）和多个具有大约 5 个级别的因子（在创建模型之前已知）。在以“正确”顺序调用 rxPredict() 时，似乎不可能为所有这些指定级别。

【讨论】：

我已经编辑了我的原始答案以纳入您的问题 - 如果我理解正确，它应该会有所帮助。

【解决方案2】：

答案似乎与 R 处理因子变量的方式一致，但错误消息本可以更清楚地区分因子、水平、变量和参数。

看来，用于生成预测的参数输入不能简单地是没有级别的字符或因素。 它们需要与模型参数化中使用的同一变量的因子具有相同的水平。

因此，以下几行：

sq<-"SELECT City, Person FROM [dbo].[formatAsTable]('London','George')"
pred<-RxSqlServerData(sqlQuery = sq,connectionString = connStr
                      , colClasses = c(City = "factor",Person="factor"))

...应该替换为：

sq<-"SELECT City, Person FROM [dbo].[formatAsTable]('London','George')"

column_information<-list(
  City=list(type="factor",levels=c("London","New York","Liverpool","Paris","Sydney","Mexico City")),
  Person=list(type="factor",levels=c("John","Paul","George","Ringo")),
  Bin=list(type="logical")
)

pred<-RxSqlServerData(sqlQuery = sq,connectionString = connStr
                      ,colInfo=column_information,
                      stringsAsFactors=FALSE)

我已经看到了其他带有分类变量的示例，如果没有这个，它们似乎也可以工作，但也许级别仍然存在。

我希望这可以节省我浪费的时间！

编辑 SLSvenR 的回应

我认为我关于与训练集具有相同级别的评论仍然成立。

fac <- c("one", "two", "three")
val = c(1, 2, 3)
trainingData <- data.frame(fac, val, stringsAsFactors = TRUE)
lmModel <- lm(val ~ fac, data = trainingData)
print(summary(lmModel))
predictionData = data.frame(fac = c("one", "three", "one", "one"))
lmPred <- predict(lmModel, newdata = predictionData)
lmPred
# The result is OK:
# 1 2 3 4
# 1 3 1 1

levels(predictionData$fac)<-levels(trainingData$fac)
# rxLinMod() and rxPredict() behave different:
rxModel <- rxLinMod(val ~ fac, data = trainingData)
rxPred <- rxPredict(rxModel, data = predictionData, writeModelVars = TRUE,checkFactorLevels = TRUE)
rxPred
# This result appears correct to me.

我无法评论这是好是坏 - 但是看起来解决这个问题的一种方法是将训练数据的级别应用于测试集，我假设您可以实时执行此操作。

【讨论】：

【解决方案3】：

虽然只设置因子水平 (... levels(predictionData$fac)

rxSetComputeContext("local")

sqlPredictQueryDS

predictQueryDS = rxImport(sqlPredictQueryDS)

if ("Artikelnummer" %in% colnames(predictQueryDS)) { predictQueryDS

除了设置所需的因子水平之外，RxFactors 还重新排序因子索引。我并不是说 colInfo 的解决方案是错误的，也许它只是不适用于“太多”级别的因素。

【讨论】：

好点。像这样的问题是我从那以后尽可能避免使用 RevoScaleR 的原因！