【问题标题】:How can I remove all characters between two other recurring characters in a string using R?如何使用 R 删除字符串中其他两个重复字符之间的所有字符?
【发布时间】:2019-05-15 06:33:41
【问题描述】:

在使用 gsub 帮助“清理”之前,以下代码成功获取了我需要的文本。

am1<-getURL("url.com")
ami1<-htmlTreeParse(am1, useInternalNodes = TRUE)
ami1.tree.parse<- unlist(xpathApply(ami1, path = '//td', fun = xmlValue))
ami1.txt<-NULL
  for (i in 2:(length(ami1.tree.parse)-1)) {
    ami1.txt<-paste(ami1.txt, as.character(ami1.tree.parse[i]), sep = ' ')
  }

问题

我无法删除采访文本中的全部问题。例如,文本如下所示:

[1] "Q. How well do you think things are going in your marriage?JOE SMITH: It's going quite alright.Q. Where do you see yourself in five years?JOE SMITH: I'll probably move to Los Angeles and get into acting.Q. Okay. How do you think your wife feels about your thinking?JOE SMITH: I think she'd respond positively."

为了格式化:

“问。你认为你的婚姻进展如何?乔·史密斯:一切都很好。问。五年后你认为自己在哪里?乔·史密斯:我可能会搬到洛杉矶去问:好的。你觉得你的妻子对你的想法有什么看法?JOE SMITH:我想她会积极回应。”

绝对清楚,我需要从上面的文字中得到:

[1] "It's going quite alright. I'll probably move to Los Angeles and get into acting. I think she'd respond positively."

“一切都很好。我可能会搬到洛杉矶开始演戏。我想她会积极回应。”

我试过了:

 ami1.txt<-gsub("Q.[^?]+H:", "",ami1.txt)
 ami1.txt<-gsub("Q.[^?]+H: ", "",ami1.txt)
 ami1.txt<-gsub("Q.*H:", "",ami1.txt)

这归结为我没有抓住regex,但如果有人能指出我正确的方向,我将不胜感激。

唉,我撒谎了,文本显然有点复杂。我已将更复杂的元素添加到上述文本的末尾,如下所示。一些“问题”(Q.)以一句话开头:

 str2<-"Q. How well do you think things are going in your marriage?JOE SMITH: It's going quite alright.Q. Where do you see yourself in five years?JOE SMITH: I'll probably move to Los Angeles and get into acting.Q. Okay. How do you think your wife feels about your thinking?JOE SMITH: I think she'd respond positively.Q. That's interesting. When would you consider speaking to her?JOE SMITH: Probably, tomorrow. Q. That sounds good. How do you feel now? Better than before?JOE SMITH: Yeah I'm feeling alright."

问。你认为你们的婚姻进展如何?乔·史密斯:一切都很好。问。五年后你觉得自己在哪里? JOE SMITH:我可能会搬到洛杉矶开始演戏。好的。你认为你的妻子对你的想法有什么看法?乔·史密斯:我认为她会积极回应。那很有意思。你什么时候会考虑和她说话? JOE SMITH: 可能,明天。问:听起来不错。你现在感觉怎么样?比以前好多了?JOE SMITH:是的,我感觉很好。

任务保持不变,akrun 的回答让我很接近:

 trimws(gsub("Q[^?]+\\?|[A-Z ]+:", "", str2))
 print(str2)
 [1] "It's going quite alright. I'll probably move to Los Angeles and get into acting. I think she'd respond positively. Probably, tomorrow.  Better than before? Yeah I'm feeling alright."

[1] “一切都很好。我可能会搬到洛杉矶并开始演戏。我想她会积极回应。可能,明天。比以前更好?是的,我感觉很好。”

最终更新

阿克伦的回答:

 trimws(gsub("Q[^?]+\\?|[A-Z ]+:", "", str2))

我不完全确定为什么上述答案没有完全删除“Q”和最后一个问号之间的所有内容,但是唉。在修改了我上面的问题之后,我发现我真正想要的是从“Q”到“:”的所有内容都被删除。所以我用这个tool 来帮助我理解我对正则表达式的理解出了什么问题。我进行了以下操作以清除“Q”和“:”之间的所有字符。

 gsub("Q[^:]+\\?|[A-Z ]+:", "", str2)

【问题讨论】:

    标签: r regex string text gsub


    【解决方案1】:

    我们可以匹配以 Q 开头的字符后跟不是 ? 的字符 ([^?]) 后跟问号或 (|) 大写字母后跟 : 并替换它带空格。如果有前导/滞后空格,请使用trimws

    trimws(gsub("Q[^?]+\\?|[A-Z ]+:", "", str1))
    #[1] "It's going quite alright. I'll probably move to Los Angeles and get into acting. I think she'd respond positively."
    

    数据

    str1 <- "Q. How well do you think things are going in your marriage?JOE SMITH: It's going quite alright.Q. Where do you see yourself in five years?JOE SMITH: I'll probably move to Los Angeles and get into acting.Q. Okay. How do you think your wife feels about your thinking?JOE SMITH: I think she'd respond positively."
    

    【讨论】:

    • 谢谢。它有效,我只是在上面添加了一个编辑,因为我在其中一些问题上对你撒了谎。一些“问题”包含不会被替换的附加句子或问题。
    猜你喜欢
    • 2022-11-22
    • 1970-01-01
    • 1970-01-01
    • 2011-04-14
    • 2015-07-23
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2021-11-29
    相关资源
    最近更新 更多