【问题标题】:Removing all special characters from an entire dataframe but keeping factor level definitions从整个数据框中删除所有特殊字符但保留因子级别定义
【发布时间】:2020-03-13 13:03:34
【问题描述】:

'正在尝试删除特殊字符,例如"-","/",")","(" 等完全来自我的数据框。但是我的数据框只包含一个观察结果,因为它正在输入将在生产中使用的模型。我已经明确定义了因子水平数据框。

我尝试了以下方法:

sanitize_string <- function(string){
  gsub('\\s+', "_", string) %>%
    gsub("[(]", "_", .) %>%
    gsub("[)]", "_", .) %>%
    gsub("[/]", "_", .) %>%
    gsub("[-]", "_", .)}

然后:

 df <- as.data.frame(lapply(df, function(dataframe) sapply(dataframe, sanitize_string)), stringsAsFactors=FALSE)

但是当我这样做时,我失去了我的因子水平,它只是将每个因子视为具有一个水平,当我尝试从我的模型中获得预测时,这会导致问题,因为 sparse.model.matrix 需要 2 或每个因素都有更多级别,但实际上在生产中,只会发送一个观察结果。

谢谢。

这是我的数据框:

 $ children_under16                : Factor w/ 2 levels "No","Yes": 1
 $ ft_employment_status            : Factor w/ 5 levels "Employed","Full-Time Education(Student)",..: 1
 $ fuel_type                       : Factor w/ 2 levels "D","P": 2
 $ homeowner                       : Factor w/ 2 levels "FALSE","TRUE": 2
 $ marital_status                  : Factor w/ 6 levels "Married","Separated",..: 1
 $ overnight_loc                   : Factor w/ 7 levels "In a private Driveway",..: NA
 $ usage_type                      : Factor w/ 3 levels "CLASS_1","SDPC",..: 1
 $ licence_type                    : Factor w/ 3 levels "UK","European",..: 1
 $ yad_relationship_to_policyholder: Factor w/ 8 levels "Spouse","No_YAD",..: 1
 $ A                          : Factor w/ 7 levels "1","2","5","3",..: 1
 $ B                          : Factor w/ 19 levels "C","E","Q","D",..: 1
 $ C                           : Factor w/ 63 levels "11","19","58",..: 1
 $ region                          : Factor w/ 12 levels "Yorkshire and The Humber",..: 1
 $ D                      : Factor w/ 28 levels "Semi-Detached Suburbia",..: 27
 $ E                   : Factor w/ 77 levels "Families in Terraces and Flats",..: 77
 $ F                 : Factor w/ 9 levels "Suburbanites",..: 1
 $ industry_band                   : Factor w/ 18 levels "13","14","15",..: 14
 $ occ_band_goco                   : Factor w/ 17 levels "0","1","2","3",..: 2
 $ transmission                    : Factor w/ 2 levels "A","M": 2
 $ vehicle_make                    : Factor w/ 19 levels "OTHER","AUDI",..: 1
 $ vehicle_type           : Factor w/ 17 levels "Mid Exec Saloon/Estate/Coupe",..: 1
 $ rural_urban                     : Factor w/ 19 levels "Urban major conurbation",..: 2
 $ water_company                   : Factor w/ 23 levels "Affinity Water",..: 23
 $ seats                           : Factor w/ 6 levels "-99","2","4",..: ```


【问题讨论】:

  • 可以给head(df)str(df)吗?
  • 你能提供你的数据样本吗?很想重现这个问题。

标签: r gsub xgboost model.matrix


【解决方案1】:

您可以清理因子的levels,而不是列。这将保留级别的顺序——尽管如果您的清理采用两个不同的级别并使它们相同,则会产生错误。我只会做一个 for 循环:

for (i in 1:ncol(df)) {
  if(is.factor(df[[i]])) {
    levels(df[[i]]) = sanitize_string(levels(df[[i]]))
  }
}

我无法在您发布的结构上对此进行测试,但如果您有问题,请与 dput() 分享一些数据,以便我可以复制/粘贴它(例如,dput(df[1:10, ]),或其他一些小的子集)说明问题),我很乐意测试和改进。

【讨论】:

    猜你喜欢
    • 2018-06-24
    • 2023-03-25
    • 2021-03-29
    • 2017-02-13
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2018-04-16
    相关资源
    最近更新 更多