【问题标题】:Creating an R data.frame column based on the difference between two character columns根据两个字符列之间的差异创建一个 R data.frame 列
【发布时间】:2016-09-27 01:51:31
【问题描述】:

我有一个 data.frame,df,其中有 2 列,一列是歌曲的标题,另一列是标题和艺术家的组合。我希望创建一个单独的艺术家领域。 此处显示前三行

title                               titleArtist
I'll Never Smile Again  I'll Never Smile Again TOMMY DORSEY & HIS ORCHESTRA / FRANK SINATRA & PIED PIPERS
Imagination         Imagination GLENN MILLER & HIS ORCHESTRA / RAY EBERLE
The Breeze And I    The Breeze And I JIMMY DORSEY & HIS ORCHESTRA / BOB EBERLY

此代码对这组数据没有问题

library(stringr)
library(dplyr)

 df %>% 
 head(3) %>% 
 mutate(artist=str_to_title(str_trim(str_replace(titleArtist,title,"")))) %>% 
 select(artist,title)

 artist                                                         title
1 Tommy Dorsey & His Orchestra / Frank Sinatra & Pied Pipers I'll Never Smile Again
2                  Jimmy Dorsey & His Orchestra / Bob Eberly       The Breeze And I
 3                  Glenn Miller & His Orchestra / Ray Eberle            Imagination

但是当我将它应用于数千行时,我得到了错误

Error: Incorrectly nested parentheses in regexp pattern. (U_REGEX_MISMATCHED_PAREN)

#or for part of the mutation

df$artist <-str_replace(df$titleArtist,df$title,"")

Error in stri_replace_first_regex(string, pattern, replacement, opts_regex =    attr(pattern,  : 
 Incorrectly nested parentheses in regexp pattern. (U_REGEX_MISMATCHED_PAREN)

我已经从列中删除了所有括号,并且代码在我收到错误之前似乎可以工作一段时间

Error: Syntax error in regexp pattern. (U_REGEX_RULE_SYNTAX)

是另一个可能导致问题的特殊字符还是其他原因?

TIA

【问题讨论】:

  • traceback() 是否提供有关触发错误的任何信息?
  • 使用gsubsub 是否会引发与str_replace 相同的错误?我看到您在 titleArtist 中有 / - 它也可能出现在标题中吗?如果不访问数据,很难真正分析这个问题。
  • 检查您的标题和/或艺术家是否为空。可能必须使用ifese()
  • 发送建议。 traceback() 至少我没有提供任何有意义的信息,例如第一个错误的行号。标题中也有“/”(当记录有两个 A 面时发生)。我成功地用 '&' 替换,但仍然遇到同样的错误 - 尽管它是否与 '&' 或其他我不知道的东西有关。除了 '(' 和 '/' 之外,是否还有一个禁止字符列表,其中任何一个都可能导致问题
  • @dww。我已上传到 googlesheets docs.google.com/spreadsheets/d/…

标签: r string dataframe


【解决方案1】:

您的一般问题是str_replace 将您的artist 值视为正则表达式,因此由于括号之外的特殊字符存在很多潜在错误。 stringi 库,stringr 包装和简化,允许更细粒度的控制,包括将参数视为固定字符串而不是正则表达式。我没有您的原始数据,但是当我在其中抛出一些导致错误的字符时,这可以工作:

library(dplyr)
library(stringi)


df = data_frame(title = c("I'll Never Smile Again (",  "Imagination.*", "The Breeze And I(?>="),
           titleArtist = c("I'll Never Smile Again ( TOMMY DORSEY & HIS ORCHESTRA / FRANK SINATRA & PIED PIPERS",
                            "Imagination.* GLENN MILLER & HIS ORCHESTRA / RAY EBERLE",
                            "The Breeze And I(?>= JIMMY DORSEY & HIS ORCHESTRA / BOB EBERLY"))

df %>%
  mutate(artist=stri_trans_totitle(stri_trim(stri_replace_first_fixed(titleArtist,title,"")))) %>% 
  select(artist,title)

结果:

Source: local data frame [3 x 2]

artist                     title
(chr)                     (chr)
1 Tommy Dorsey & His Orchestra / Frank Sinatra & Pied Pipers I'll Never Smile Again (
2                  Glenn Miller & His Orchestra / Ray Eberle             Imagination.*
3                  Jimmy Dorsey & His Orchestra / Bob Eberly      The Breeze And I(?>=

【讨论】:

  • 我注意到stringr::str_replace(titleArtist, fixed(title), "") 等同于stringi::stri_replace_first_fixed(titleArtist, title, "")
  • 看起来很有魅力 感谢您的解决方案和解释
【解决方案2】:
 df <- data.frame(ID=11:13, T_A=c('a/b','b/c','x/y'))  # T_A Title/Artist 
   ID T_A
 1 11 a/b
 2 12 b/c
 3 13 x/y

 # Title Artist are separated by /
 > within(df, T_A<-data.frame(do.call('rbind', strsplit(as.character(T_A), '/', fixed=TRUE))))
  ID T_A.X1 T_A.X2
 1 11      a      b
 2 12      b      c
 3 13      x      y

【讨论】:

  • 谢谢,但我不想根据“/”拆分列。对于第 1 行,我会尝试将 titleArtist 列分成“我再也不会微笑”和“TOMMY DORSEY & HIS ORCHESTRA / FRANK SINATRA”
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2018-05-31
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2019-06-07
  • 1970-01-01
相关资源
最近更新 更多