【问题标题】:user defined summaryFunction in caret, logloss插入符号中用户定义的summaryFunction,logloss
【发布时间】:2020-06-26 03:49:29
【问题描述】:

使用 caret 包,我无法让以下用户定义的摘要函数正常工作。它应该计算对数损失,但我不断得到未找到对数损失。下面是一个可重现的例子:

data <- data.frame('target' = sample(c('Y','N'),100,replace = T), 'X1' = runif(100), 'X2' = runif(100))

log.loss2 <- function(data, lev = NULL, model = NULL) {
  logloss = -sum(data$obs*log(data$Y) + (1-data$obs)*log(1-data$Y))/length(data$obs)
  names(logloss) <- c('LL')
  logloss
}

fitControl <- trainControl(method="cv",number=1, classProbs = T, summaryFunction = log.loss2)

my.grid <- expand.grid(.decay = c(0.05), .size = c(2))

fit.nnet2 <- train(target ~., data = data,
                  method = "nnet", maxit = 500, metric = 'LL',
                  tuneGrid = my.grid, verbose = T)

【问题讨论】:

    标签: r r-caret


    【解决方案1】:

    错误是由于您没有在训练调用中包含trControl = fitControl。但是,这会给您带来另一个错误,这是由于 data$obsdata$pred 是因素 - 需要转换为给出 12 的数字,减去 1 得到所需的 01

    log.loss2 <- function(data, lev = NULL, model = NULL) {
      data$pred <- as.numeric(data$pred)-1
      data$obs <- as.numeric(data$obs)-1 
      logloss = -sum(data$obs*log(data$Y) + (1-data$obs)*log(1-data$Y))/length(data$obs)
      names(logloss) <- c('LL')
      logloss
    }
    
    fitControl <- trainControl(method="cv",number=1, classProbs = T, summaryFunction = log.loss2)
    
    fit.nnet2 <- train(target ~., data = data,
                       method = "nnet", maxit = 500, metric = "LL" ,
                       tuneGrid = my.grid, verbose = T, trControl = fitControl,
                       maximize = FALSE)
    #output
    Neural Network 
    
    100 samples
      2 predictor
      2 classes: 'N', 'Y' 
    
    No pre-processing
    Resampling: Cross-Validated (1 fold) 
    Summary of sample sizes: 0 
    Resampling results:
    
      LL       
      0.6931472
    
    Tuning parameter 'size' was held constant at a value of 2
    Tuning parameter 'decay' was held constant at a value of 0.05
    

    注意几点:

    此损失函数仅适用于包含N/Y 作为类的数据,因为概率定义为data$Y,更好的方法是找到类的名称并使用它。此外,自 log(0) 以来截断概率值的良好做法不是一个好主意:

    LogLoss <- function (data, lev = NULL, model = NULL) 
      { 
        obs <- data[, "obs"]
        cls <- levels(obs) #find class names
        probs <- data[, cls[2]] #use second class name
        probs <- pmax(pmin(as.numeric(probs), 1 - 1e-15), 1e-15) #bound probability
        logPreds <- log(probs)        
        log1Preds <- log(1 - probs)
        real <- (as.numeric(data$obs) - 1)
        out <- c(mean(real * logPreds + (1 - real) * log1Preds)) * -1
        names(out) <- c("LogLoss")
        out
      }
    

    【讨论】:

    • 这太完美了!非常感谢,我遇到了这两个错误,非常感谢您也注意到后续问题
    • @missuse : 我认为我们还必须在 train() 中添加参数 Maximize = F,因为对数损失越低越好?
    【解决方案2】:

    @missuse 已经回答了这个问题,但我想在 logloss 函数中添加权重选项:

    # Cross-entropy error function
    LogLoss <- function(pred, true, eps = 1e-15, weights = NULL) {
      # Bound the results
      pred = pmin(pmax(pred, eps), 1 - eps)
    
      if (is.null(weights)) {
        return(-(sum(
          true * log(pred) + (1 - true) * log(1 - pred)
        )) / length(true))
      } else{
        return(-weighted.mean(true * log(pred) + (1 - true) * log(1 - pred), weights))
      }
    }
    
    # Caret train weighted logloss summary function
    caret_logloss <- function(data, lev = NULL, model = NULL) {
      cls <- levels(data$obs) #find class names
      loss <- LogLoss(
        pred = data[, cls[2]],
        true = as.numeric(data$obs) - 1,
        weights = data$weights
      )
      names(loss) <- c('MyLogLoss')
      loss
    }
    

    【讨论】:

    • 应该是cls &lt;- levels(data$obs) in caret_logloss
    猜你喜欢
    • 2014-08-21
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2020-08-02
    • 2016-10-06
    • 1970-01-01
    • 2010-10-11
    • 2017-12-09
    相关资源
    最近更新 更多