使用 R 分割文本答案

【问题标题】：Splitting text using R使用 R 分割文本
【发布时间】：2015-03-14 13:10:42
【问题描述】：

我有一个字符变量的数据框，其中包含长段落，我需要在某些短语确定的位置进行拆分。然而问题在于，在许多情况下，这些短语会与前面的单词合并。

这是我正在做的事情：

data  <- readLines(n=2)
= DAY 1 CHALLENGES = syndicated.= DAY 2 CHALLENGES = Red Sea.= DAY 3 CHALLENGES = framework.= DAY 4 CHALLENGES = Did ;-)= DAY 5 CHALLENGES = Paste ...= DAY 6 CHALLENGES = Name 
= DAY 1 CHALLENGES = very high.= DAY 2 CHALLENGES = Rank understand.= DAY 3 CHALLENGES = buy....= DAY 4 CHALLENGES = result.= DAY 5 CHALLENGES = coffee.= DAY 6 CHALLENGES = Bla.

df  <- as.data.frame(data)

delim  <- c("= DAY 1 CHALLENGES = ",
            "= DAY 2 CHALLENGES = ",
            "= DAY 3 CHALLENGES = ",
            "= DAY 4 CHALLENGES = ",
            "= DAY 5 CHALLENGES = ",
            "= DAY 6 CHALLENGES = ")

y  <- data.frame(do.call('rbind',
                         strsplit(as.character(df$data), delim, fixed = FALSE)))
y
                               X1
1                                
2 = DAY 1 CHALLENGES = very high.
                                                                                    X2
1 syndicated.= DAY 2 CHALLENGES = Red Sea.= DAY 3 CHALLENGES = framework.= DAY 4 CHALLENGES = Did ;-)= DAY 5 CHALLENGES = Paste ...= DAY 6 CHALLENGES = Name 
2                               Rank understand.= DAY 3 CHALLENGES = buy....= DAY 4 CHALLENGES = result.= DAY 5 CHALLENGES = coffee.= DAY 6 CHALLENGES = Bla.

我想获取每个 = DAY x CHALLENGES = 文本段，直到下一个这样的段作为单独的变量。

谢谢！

更新建议的方法：

> a  <- scan(file ="~/Desktop/alm/a.txt", what="")
Read 1 item
> a
[1] "= DAY 1 CHALLENGES = very high.= DAY 2 CHALLENGES = Rank understand.= DAY 3 CHALLENGES = buy....= DAY 4 CHALLENGES = result. = DAY 5 CHALLENGES = Paste the link(s) that you think is Paid Media.http://lebron11.nikeinc.com/ DAY 5 CHALLENGE: Paste the link(s) that you think is Owned Media.http://www.nike.com/ ; https://www.pinterest.com/nikewomen DAY 5 CHALLENGE: Paste the link(s) that you think is BONUS QUESTION DAY 5 = DAY 6 CHALLENGES = Bla."
> b  <- scan(file ="~/Desktop/alm/b.txt", what="")
Read 1 item
> b
[1] "= DAY 1 CHALLENGES = very high.= DAY 2 CHALLENGES = Rank understand.= DAY 3 CHALLENGES = buy....= DAY 4 CHALLENGES = result. Paste the link(s) that you think is Paid Media.http://lebron11.nikeinc.com/ DAY 5 CHALLENGE: Paste the link(s) that you think is Owned Media.http://www.nike.com/ ; https://www.pinterest.com/nikewomen DAY 5 CHALLENGE: Paste the link(s) that you think is BONUS QUESTION DAY 5 ?= DAY 6 CHALLENGES = Bla."
> c <- c(a,b)
> df  <- as.data.frame(c)
> lst <- strsplit(gsub(" (?=\\= DAY)", ".", c, perl=TRUE), 
+                 '(?<=[.)])(?=\\=)', perl=TRUE)
> out <-  do.call(cbind, lapply(lst, function(x) sub('^=.*= ', '', x)))
Warning message:
In (function (..., deparse.level = 1)  :
  number of rows of result is not a multiple of vector length (arg 2)
> out
     [,1]                                                                                                                                                                                                                                                                                
[1,] "very high."                                                                                                                                                                                                                                                                        
[2,] "Rank understand."                                                                                                                                                                                                                                                                  
[3,] "buy...."                                                                                                                                                                                                                                                                           
[4,] "result.."                                                                                                                                                                                                                                                                          
[5,] "Paste the link(s) that you think is Paid Media.http://lebron11.nikeinc.com/ DAY 5 CHALLENGE: Paste the link(s) that you think is Owned Media.http://www.nike.com/ ; https://www.pinterest.com/nikewomen DAY 5 CHALLENGE: Paste the link(s) that you think is BONUS QUESTION DAY 5."
[6,] "Bla."                                                                                                                                                                                                                                                                              
     [,2]              
[1,] "very high."      
[2,] "Rank understand."
[3,] "buy...."         
[4,] "Bla." #this is not the value from the input file           
[5,] "very high." #this is missing in the input file, yet a value is getting output      
[6,] "Rank understand." #incorrect recognition of ?= DAY 6 CHALLENGES =; the same happens with := and != or similar

问题在 cmets 中指出。缺失值的指示将很有用，而不是插入随机值。

【问题讨论】：

你必须做得更好，请给我们一些示例输入和一些示例输出。
你可以参考这个链接stackoverflow.com/questions/5963269/…
@StefanPetkov 再次更新了帖子。正如我之前提到的，我已经在这篇文章上花了一些时间。如果这不起作用，您可能没有按照我的建议提供所有模式......
我认为它有效，但有“：= DAY x =”的情况除外。然后它失败了。但是 "?= DAY x =" 现在可以正常使用了。

标签： r strsplit

【解决方案1】：

这可能有帮助

library(stringr)
str_extract_all(df$data, '= [A-Za-z]+ \\d+ [A-Za-z]+ = [A-Za-z ]+(\\.+| ;-\\)| \\.+| +)')
#[[1]]
#[1] "= DAY 1 CHALLENGES = syndicated." "= DAY 2 CHALLENGES = Red Sea."   
#[3] "= DAY 3 CHALLENGES = framework."  "= DAY 4 CHALLENGES = Did ;-)"    
#[5] "= DAY 5 CHALLENGES = Paste ..."   "= DAY 6 CHALLENGES = Name "      

#[[2]]
#[1] "= DAY 1 CHALLENGES = very high."      
#[2] "= DAY 2 CHALLENGES = Rank understand."
#[3] "= DAY 3 CHALLENGES = buy...."         
#[4] "= DAY 4 CHALLENGES = result."         
#[5] "= DAY 5 CHALLENGES = coffee."         
#[6] "= DAY 6 CHALLENGES = Bla."

或使用strsplit

 lst <- strsplit(as.character(df$data), '(?<=[.)])(?=\\=)', perl=TRUE)
 lst
 #[[1]]
 #[1] "= DAY 1 CHALLENGES = syndicated." "= DAY 2 CHALLENGES = Red Sea."   
 #[3] "= DAY 3 CHALLENGES = framework."  "= DAY 4 CHALLENGES = Did ;-)"    
 #[5] "= DAY 5 CHALLENGES = Paste ..."   "= DAY 6 CHALLENGES = Name "      

 #[[2]]
 #[1] "= DAY 1 CHALLENGES = very high."      
 #[2] "= DAY 2 CHALLENGES = Rank understand."
 #[3] "= DAY 3 CHALLENGES = buy...."         
 #[4] "= DAY 4 CHALLENGES = result."         
 #[5] "= DAY 5 CHALLENGES = coffee."         
 #[6] "= DAY 6 CHALLENGES = Bla."

如果要提取字符串syndicated.、very high.等..

  do.call(cbind, lapply(lst, function(x) sub('^=.*= ', '', x)))
  #       [,1]          [,2]              
  #[1,] "syndicated." "very high."      
  #[2,] "Red Sea."    "Rank understand."
  #[3,] "framework."  "buy...."         
  #[4,] "Did ;-)"     "result."         
  #[5,] "Paste ..."   "coffee."         
  #[6,] "Name "       "Bla."

更新

基于更新后的字符串“a”

  lst <- strsplit(gsub(" (?=\\= DAY)", ".", a, perl=TRUE), 
                         '(?<=[.)])(?=\\=)', perl=TRUE)
  out <-  do.call(cbind, lapply(lst, function(x) sub('^=.*= ', '', x)))
  out[,1]
  #[1] "very high."                                                                                                                                                                                                                                                                        
  #[2] "Rank understand."                                                                                                                                                                                                                                                                  
  #[3] "buy...."                                                                                                                                                                                                                                                                           
  #[4] "result.."                                                                                                                                                                                                                                                                          
  #[5] "Paste the link(s) that you think is Paid Media.http://lebron11.nikeinc.com/ DAY 5 CHALLENGE: Paste the link(s) that you think is Owned Media.http://www.nike.com/ ; https://www.pinterest.com/nikewomen DAY 5 CHALLENGE: Paste the link(s) that you think is BONUS QUESTION DAY 5."
  #[6] "Bla."

更新2

我在c 上再次尝试（将对象名称更改为c1，因为c 是R 中的一个函数

  c1 <- c(a,b)
  c2 <- gsub("( |\\?)(?=\\= DAY)|\\.com. (?=DAY)", " .", c1, perl=TRUE)
  lst <- strsplit(c2, '(?<=[.)])(?=(\\=|DAY))', perl=TRUE)
  lst2 <- lapply(lst, function(x) unname(unlist(tapply(x,
      gsub('.*?DAY (\\d+).*', '\\1', x), FUN=paste, collapse= ' '))))
  out <- do.call(cbind,lapply(lst2, function(x)
       sub('^=[^=:]+(\\=|:) ', '', sub('^(?=DAY)', '= ', x, perl=TRUE))))

  out[,1]
  #[1] "very high."                                                                                                                                                                                                                                                                      
  #[2] "Rank understand."                                                                                                                                                                                                                                                                
  #[3] "buy...."                                                                                                                                                                                                                                                                         
  #[4] "result. ."                                                                                                                                                                                                                                                                       
  #[5] "Paste the link(s) that you think is Paid Media.http://lebron11.nikeinc . DAY 5 CHALLENGE: Paste the link(s) that you think is Owned Media.http://www.nike.com/ ; https://www.pinterest.com/nikewomen DAY 5 CHALLENGE: Paste the link(s) that you think is BONUS QUESTION DAY 5 ."
  #[6] "Bla."                                                                                               

 out[,2]
 #[1] "very high."                                                                                                                                                                             
 #[2] "Rank understand."                                                                                                                                                                       
 #[3] "buy...."                                                                                                                                                                                
 #[4] "result. Paste the link(s) that you think is Paid Media.http://lebron11.nikeinc ."                                                                                                       
 #[5] "Paste the link(s) that you think is Owned Media.http://www.nike.com/ ; https://www.pinterest.com/nikewomen DAY 5 CHALLENGE: Paste the link(s) that you think is BONUS QUESTION DAY 5  ."
 #[6] "Bla."

【讨论】：

我想我已经很烦了，但我会继续问。上次更新有效，但它只显示我的数据中缺少位。例如，如果 = DAY 6 CHALLENGES = 在一个条目中丢失，那么整个数据集的提取顺序就会变得混乱......可以以某种方式补救吗？此外，strsplit() 方法似乎几乎可以工作，但是当有 := DAY... 或 ?= DAY... 时它会被破坏。
@StefanPetkov 请务必使用原始帖子中的所有可能场景更新您的帖子。这样一来，就更容易一次解决，而不是每次都得到惊喜。