【问题标题】:Cannot remove leading/trailing white space with gsub or trimws无法使用 gsub 或 trimws 删除前导/尾随空格
【发布时间】:2018-03-16 08:43:40
【问题描述】:

我正在尝试处理需要大量清理的数据集。我有一个主题名称,我似乎无法从中删除前导空格。

示例数据:

Data <- dput(Data)
structure(list(Teacher = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L
), .Label = c("Please.rate.teacher:.JOHN.DOE .Overall.rating.for.teacher", 
"Please.rate.teacher: Jane.Doe.Overall.rating.for.teacher"), class = "factor"), 
    Overall_Rating = c(5L, 4L, 5L, 4L, 4L, 5L, 4L, 4L, 4L, 4L, 
    3L, 5L, 4L, 4L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 3L)), .Names = c("Teacher", 
"Overall_Rating"), class = "data.frame", row.names = c(NA, -22L
))

我的清洁尝试:

Data_clean <- Data %>%
 mutate(Teacher = as.character(Teacher),
Teacher = gsub("Please.rate.teacher|.Overall.rating.for.teacher|[:]", "", Teacher),
            Teacher = gsub("[.]", " ", Teacher),
            Teacher = trimws(Teacher),
            Teacher = tolower(Teacher), Teacher = tools::toTitleCase(Teacher)) 

导致剩余的前导和尾随空格,这也打破了第二个名称的标题大小写:

unique(Data_clean$Teacher)
[1] "John Doe " " jane Doe"

第一个名字仍然有尾随空格,第二个名字有前导空格。

我怎样才能删除它?

【问题讨论】:

  • 查找?trimws
  • 我在更改大小写之前先调用 trimws
  • 对不起,应该更仔细地阅读。我在下面添加了一个解决方案,请看一下。

标签: r data-cleaning


【解决方案1】:

我怀疑您的数据包含非 ASCII 空格,例如 "\u00A0"trimws 函数只会删除 ASCII 空格字符。

尝试运行utf8::utf8_print(unique(Data_clean$Teacher), utf8 = FALSE) 看看是否是这种情况。

要处理非 ASCII 空格,请将代码中的 trimws(x) 替换为

gsub("(^[[:space:]]*)|([[:space:]]*$)", "", x)

【讨论】:

  • 你的怀疑是正确的! utf8::utf8_print(unique(Data_clean$Teacher), utf8 = FALSE) [1] "John Doe\u00a0" "\u00a0jane Doe"
  • 很高兴听到这个消息。我建议更换trimws 电话是否解决了您的问题?
  • 不幸的是,没有。我仍然得到相同的结果
【解决方案2】:

这是一个完全可重复的示例,特别是 stringrstr_trim,因为我不知道为什么 trimws 不适合您。您发布的代码给了我相同的输出,正确地将大小写更改为标题并删除了空格。

data <- structure(list(Teacher = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 
                                     1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L
), .Label = c("Please.rate.teacher:.JOHN.DOE .Overall.rating.for.teacher", 
              "Please.rate.teacher: Jane.Doe.Overall.rating.for.teacher"), class = "factor"), 
Overall_Rating = c(5L, 4L, 5L, 4L, 4L, 5L, 4L, 4L, 4L, 4L, 
                   3L, 5L, 4L, 4L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 3L)), .Names = c("Teacher", 
                                                                                "Overall_Rating"), class = "data.frame", row.names = c(NA, -22L
                                                                                ))

library(tidyverse)
data %>%
  mutate(
    Teacher = Teacher %>%
      str_remove_all("Please.rate.teacher:|.Overall.rating.for.teacher") %>%
      str_replace_all("\\.", " ") %>%
      str_trim() %>%
      str_to_title()
  ) %>%
  `[[`(1) %>%
  unique()
#> [1] "John Doe" "Jane Doe"

reprex package (v0.2.0) 于 2018 年 3 月 15 日创建。

【讨论】:

  • 谢谢!这很好用,我喜欢它使用tidyverse
【解决方案3】:

这个怎么样?

Data_clean <- Data %>%
     mutate(Teacher = gsub("Please.rate.teacher|\\s*\\.Overall.rating.for.teacher|:", "", Teacher),
            Teacher = gsub("\\.", " ", Teacher),
            Teacher = trimws(Teacher),
            Teacher = tolower(Teacher), Teacher = tools::toTitleCase(Teacher))

unique(Data_clean$Teacher);
#[1] "John Doe" "Jane Doe"

说明:替换 Teacher".Overall.rating..." 之前出现的可选 (&gt;=0) 空格。


样本数据

Data <- structure(list(Teacher = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L
), .Label = c("Please.rate.teacher:.JOHN.DOE .Overall.rating.for.teacher",
"Please.rate.teacher: Jane.Doe.Overall.rating.for.teacher"), class = "factor"),
    Overall_Rating = c(5L, 4L, 5L, 4L, 4L, 5L, 4L, 4L, 4L, 4L,
    3L, 5L, 4L, 4L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 3L)), .Names = c("Teacher",
"Overall_Rating"), class = "data.frame", row.names = c(NA, -22L
))

【讨论】:

  • 不幸的是,我得到了同样的结果:[1] "John Doe " " jane Doe"
  • 嗯?这很奇怪。我无法重现您的输出。能复查吗?我已经包含了unique(Data_clean$Teacher) 的输出。
猜你喜欢
  • 2012-05-17
  • 1970-01-01
  • 1970-01-01
  • 2012-02-28
  • 2017-03-16
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2019-12-16
相关资源
最近更新 更多