【问题标题】:replace tabs and line break R替换制表符和换行符 R
【发布时间】:2018-11-17 12:59:33
【问题描述】:

我正在清理一个大文本文件以读入 R。几乎每一行都由制表符分隔,但一些长引号也有换行符。我正在使用选项卡将文档分隔为具有扬声器列和 cmets 列的数据框,这些换行符破坏了我的格式,因为 R 认为每一行都是新扬声器,但随后说扬声器是 NA 当它没有找到选项卡时。下面是我所拥有的示例:

Interviewer: How are you?

Subject: I'm just incredibly frustrated. <br/>
*NA* Really, R is frustrating me. <br/>
*NA* But maybe someone has a solution for me?

Interviewer: Fortunately, I have an answer for you.

这就是我想要的:

Interviewer: How are you?

Subject: I'm just incredibly frustrated. Really, R is frustrating me. But maybe someone has a solution for me?

Interviewer: Fortunately, I have an answer for you.

我是这样看文档的:

atas <- stri_read_lines("ATAS2.txt") %>% str_replace_all("\t", "TABS_TO_BE_DELETED")

(我有那个随机字符串,因为当我将文本文档设为数据框时,R 会不断擦除选项卡,仅供参考)。

现在,要删除换行符,我已经尝试过:

atas2 <- gsub("\r?\n|\r", " ", atas) 

atas2 <- str_replace_all(atas, "\n" , " ")

我也不能只删除所有特殊字符或格式来解决这个问题。如果我必须删除所有非字母数字字符,我需要保留 制表符 (至少足够长,以便在它们的位置放置一些模糊的字符串,以便以后拆分), ?.[]():.

我想让它忽略那些换行符或以某种方式将行合并在一起。仅告诉它与不匹配的行合并的唯一注意事项是我自己有一些行,没有任何扬声器需要在扬声器列中没有归属,例如(但不限于):

(Laughter)

Interview 41

[Inaudible cross-talk]

感谢您提供的任何帮助!

【问题讨论】:

  • 您是否要读取tab 分隔的数据?
  • @Onyambu,不,它不是制表符分隔的。我们用来记录采访的转录软件在演讲者和他们的评论之间自动标记。在少数情况下,人们手动转录并且没有制表符,但在 90% 的情况下,制表符会在文档中留出空间,但文档本身并没有制表符分隔

标签: r string text line-breaks stringr


【解决方案1】:

您可以采取稍微不同的方法并执行类似的操作。请注意,您通常必须对 R 正则表达式中的特殊字符进行双重转义(第一个是转义反斜杠)。

#read in text as a single string
text <- "Interviewer: How are you?
Subject: I'm just incredibly frustrated. 
    Really, R is frustrating me. 
    But maybe someone has a solution for me?
Interviewer: Fortunately, I have an answer for you."

#add `#` markers to separate text before and after speaker followed by colon 
text2 <- str_replace_all(text, "(\\w+?\\:)", "#\\1#")

#split at markers, remove first blank element, and cast as a 2-column data frame
text3 <- as.data.frame(matrix(str_split(text2, "#")[[1]][-1], ncol=2, byrow=TRUE))

#remove line breaks, tabs etc
text3$V2 <- str_replace_all(text3$V2, "[\\r\\n\\t]+", " ")

#remove excessive white space
text3$V2 <- str_trim(str_replace_all(text3$V2, "\\s+", " "))

text3
            V1                                                                                                    V2
1 Interviewer:                                                                                          How are you?
2     Subject: I'm just incredibly frustrated. Really, R is frustrating me. But maybe someone has a solution for me?
3 Interviewer:                                                                Fortunately, I have an answer for you.

【讨论】:

    【解决方案2】:

    如果输出与 Andrew Gustar 显示的一样,您可以这样做:

    read.csv(text=gsub("\\n(?!\\w+:)","",text,perl = T),sep=":",h=F)
               V1                                                                                                     V2
    1 Interviewer                                                                                           How are you?
    2     Subject  I'm just incredibly frustrated. Really, R is frustrating me. But maybe someone has a solution for me?
    3 Interviewer                                                                 Fortunately, I have an answer for you.
    

    【讨论】:

      猜你喜欢
      • 2011-09-15
      • 2010-12-09
      • 2014-07-24
      • 1970-01-01
      • 2022-09-23
      • 2011-05-12
      • 1970-01-01
      • 1970-01-01
      • 2023-03-31
      相关资源
      最近更新 更多