替换制表符和换行符 R答案

【问题标题】：replace tabs and line break R替换制表符和换行符 R
【发布时间】：2018-11-17 12:59:33
【问题描述】：

我正在清理一个大文本文件以读入 R。几乎每一行都由制表符分隔，但一些长引号也有换行符。我正在使用选项卡将文档分隔为具有扬声器列和 cmets 列的数据框，这些换行符破坏了我的格式，因为 R 认为每一行都是新扬声器，但随后说扬声器是 NA 当它没有找到选项卡时。下面是我所拥有的示例：

Interviewer: How are you?

Subject: I'm just incredibly frustrated. <br/>
*NA* Really, R is frustrating me. <br/>
*NA* But maybe someone has a solution for me?

Interviewer: Fortunately, I have an answer for you.

这就是我想要的：

Interviewer: How are you?

Subject: I'm just incredibly frustrated. Really, R is frustrating me. But maybe someone has a solution for me?

Interviewer: Fortunately, I have an answer for you.

我是这样看文档的：

atas <- stri_read_lines("ATAS2.txt") %>% str_replace_all("\t", "TABS_TO_BE_DELETED")

（我有那个随机字符串，因为当我将文本文档设为数据框时，R 会不断擦除选项卡，仅供参考）。

现在，要删除换行符，我已经尝试过：

atas2 <- gsub("\r?\n|\r", " ", atas)

和

atas2 <- str_replace_all(atas, "\n" , " ")

我也不能只删除所有特殊字符或格式来解决这个问题。如果我必须删除所有非字母数字字符，我需要保留 制表符 （至少足够长，以便在它们的位置放置一些模糊的字符串，以便以后拆分）， ?、.、[]、() 和 :.

我想让它忽略那些换行符或以某种方式将行合并在一起。仅告诉它与不匹配的行合并的唯一注意事项是我自己有一些行，没有任何扬声器需要在扬声器列中没有归属，例如（但不限于）：

(Laughter)

Interview 41

[Inaudible cross-talk]

感谢您提供的任何帮助！

【问题讨论】：

您是否要读取tab 分隔的数据？
@Onyambu，不，它不是制表符分隔的。我们用来记录采访的转录软件在演讲者和他们的评论之间自动标记。在少数情况下，人们手动转录并且没有制表符，但在 90% 的情况下，制表符会在文档中留出空间，但文档本身并没有制表符分隔

标签： r string text line-breaks stringr

【解决方案1】：

您可以采取稍微不同的方法并执行类似的操作。请注意，您通常必须对 R 正则表达式中的特殊字符进行双重转义（第一个是转义反斜杠）。

#read in text as a single string
text <- "Interviewer: How are you?
Subject: I'm just incredibly frustrated. 
    Really, R is frustrating me. 
    But maybe someone has a solution for me?
Interviewer: Fortunately, I have an answer for you."

#add `#` markers to separate text before and after speaker followed by colon 
text2 <- str_replace_all(text, "(\\w+?\\:)", "#\\1#")

#split at markers, remove first blank element, and cast as a 2-column data frame
text3 <- as.data.frame(matrix(str_split(text2, "#")[[1]][-1], ncol=2, byrow=TRUE))

#remove line breaks, tabs etc
text3$V2 <- str_replace_all(text3$V2, "[\\r\\n\\t]+", " ")

#remove excessive white space
text3$V2 <- str_trim(str_replace_all(text3$V2, "\\s+", " "))

text3
            V1                                                                                                    V2
1 Interviewer:                                                                                          How are you?
2     Subject: I'm just incredibly frustrated. Really, R is frustrating me. But maybe someone has a solution for me?
3 Interviewer:                                                                Fortunately, I have an answer for you.

【讨论】：

【解决方案2】：

如果输出与 Andrew Gustar 显示的一样，您可以这样做：

read.csv(text=gsub("\\n(?!\\w+:)","",text,perl = T),sep=":",h=F)
           V1                                                                                                     V2
1 Interviewer                                                                                           How are you?
2     Subject  I'm just incredibly frustrated. Really, R is frustrating me. But maybe someone has a solution for me?
3 Interviewer                                                                 Fortunately, I have an answer for you.

【讨论】：