【问题标题】:R - Remove all line breaks between repeating characterR - 删除重复字符之间的所有换行符
【发布时间】:2021-11-29 01:14:00
【问题描述】:

我目前正在进行情绪分析的数据清理工作,并且正在使用数据框形式的大型新闻文章数据集。我需要能够分析数据框的每行一篇文章,并且正在寻找一种方法来删除第一个“======”和第二个“======”之间的换行符,重复贯穿整个数据框。此外,在内容“折叠到自身”之后,我希望保留发布者和日期列。

df <-  matrix(c("======","NA","NA","Daily Bugle Dec 31","Daily Bugle", "Dec 31" ,"Wookies are","NA","NA",". recreationally", "NA","NA", "using drugs at a", "NA", "NA", "higher rate than", "NA", "NA","ever before.", "NA", "NA","======", "NA", "NA" ),ncol=3,byrow=TRUE)
colnames(df) <- c("content","publisher","date")
df <- as.data.frame(df)
df[ df == "NA" ] <- NA

给出这个:

content              publisher   date
======               <NA>         <NA>
Daily Bugle, Dec 31  Daily Bugle Dec 31
Wookies are          <NA>         <NA>
recreationally       <NA>         <NA>
using drugs at a     <NA>         <NA>
higher rate than     <NA>         <NA>
ever before.         <NA>         <NA>
======               <NA>         <NA>

我想要这样的东西:

content                                           publisher     date
======
Wookies are recreationally using drugs at a hig... Daily Bugle Dec 31           
======
Article 2
======
Article 3
======

希望这很清楚。我对 R 比较陌生。

【问题讨论】:

  • 您可以通过添加minimal reproducible example 来提高在这里找到帮助的机会。添加 MRE 和所需输出的示例(以代码形式,而不是表格和图片)使其他人更容易找到和测试您的问题的答案。这样你就可以帮助别人帮助你!附言这里是a good overview on how to ask a good question
  • 感谢您的提示,达里奥!我是新手,所以感谢所有帮助。我会将其编辑为问题的更好版本。
  • 在某些时候你会想要 gsub('[\\.]', '', df1$content) 为 '.'不会对情绪分析有太多帮助。

标签: r string dataframe


【解决方案1】:
  • 每篇文章都以'===' 开头,因此可以用作文章编号。
  • 删除每篇文章的content 的第一个值。
  • 保留publisherdate 的第一个值。
library(dplyr)

df %>%
  mutate(article_no = cumsum(grepl('===', content))) %>%
  filter(!grepl('===', content)) %>%
  group_by(article_no) %>%
  summarise(content = paste0(content[-1], collapse = ''), 
            publisher = publisher[1], 
            date = date[1])

#  article_no content                                                                 publisher   date  
#       <int> <chr>                                                                   <chr>       <chr> 
#1          1 Wookies are. recreationallyusing drugs at ahigher rate thanever before. Daily Bugle Dec 31

【讨论】:

  • 对于进一步的情绪分析,summarise collapse = ' ' 即空间?
  • 我喜欢这个网站!我发现这是一种比上述方法更有效的解决问题的方法,但两者都很好用。非常感谢!你和马雷克都被邀请参加我未来的婚礼。
【解决方案2】:

为了帮助你,首先我需要准备一些数据。

library(tidyverse)
articles = read.table(
  header = TRUE,sep = ",",text="
content,publisher,date
======,NA,NA
Daily News Dec 27,Daily News,Dec 27
Wookies are,NA,NA
. recreationally,NA,NA
using drugs at a,NA,NA
higher rate than,NA,NA
using drugs at a,NA,NA
higher rate than,NA,NA
using drugs at a,NA,NA
higher rate than,NA,NA
using drugs at a,NA,NA
higher rate than,NA,NA
ever before.,NA,NA
ever before.,NA,NA
ever before.,NA,NA
ever before.,NA,NA
======,NA,NA
======,NA,NA
Daily News Dec 28,Daily News,Dec 28
Wookies are,NA,NA
. recreationally,NA,NA
using drugs at a,NA,NA
higher rate than,NA,NA
ever before.,NA,NA
ever before.,NA,NA
ever before.,NA,NA
ever before.,NA,NA
======,NA,NA
======,NA,NA
Daily News Dec 30,Daily News,Dec 30
Wookies are,NA,NA
. recreationally,NA,NA
using drugs at a,NA,NA
higher rate than,NA,NA
ever before.,NA,NA
ever before.,NA,NA
======,NA,NA
======,NA,NA
Daily Bugle Dec 31,Daily Bugle,Dec 31
Wookies are,NA,NA
. recreationally,NA,NA
using drugs at a,NA,NA
higher rate than,NA,NA
ever before.,NA,NA
======,NA,NA
======,NA,NA
Weekly News Dec 31,Weekly News,Dec 31
Wookies are,NA,NA
. recreationally,NA,NA
higher rate than,NA,NA
ever before.,NA,NA
======,NA,NA") %>%
  as_tibble() %>% 
  mutate(publisher = ifelse(publisher=="NA", NA, publisher),
         date = ifelse(date=="NA", NA, date))
articles

输出

# A tibble: 52 x 3
   content           publisher  date  
   <chr>             <chr>      <chr> 
 1 ======            NA         NA    
 2 Daily News Dec 27 Daily News Dec 27
 3 Wookies are       NA         NA    
 4 . recreationally  NA         NA    
 5 using drugs at a  NA         NA    
 6 higher rate than  NA         NA    
 7 using drugs at a  NA         NA    
 8 higher rate than  NA         NA    
 9 using drugs at a  NA         NA    
10 higher rate than  NA         NA    
# ... with 42 more rows

我希望这是您的数据格式。对我来说,这是五篇文章。

现在让我们添加一个转换函数和一个简单的突变。

fConvert = function(data) tibble(
  publisher = data$publisher[2],
  date = data$date[2],
  content = data %>% slice(3:(nrow(.)-1)) %>% 
    pull(content) %>% paste(collapse = " ")
)

articles %>% mutate(
  idArticle = ifelse(!is.na(publisher),1, 0) %>% 
    cumsum() %>% lead(default=.[length(.)]) 
) %>% group_by(idArticle) %>% 
  nest() %>% 
  group_modify(~fConvert(.x$data[[1]]))

输出

# A tibble: 5 x 4
# Groups:   idArticle [5]
  idArticle publisher   date   content                                                                                            
      <dbl> <chr>       <chr>  <chr>                                                                                              
1         1 Daily News  Dec 27 Wookies are . recreationally using drugs at a higher rate than using drugs at a higher rate than u~
2         2 Daily News  Dec 28 Wookies are . recreationally using drugs at a higher rate than ever before. ever before. ever befo~
3         3 Daily News  Dec 30 Wookies are . recreationally using drugs at a higher rate than ever before. ever before.           
4         4 Daily Bugle Dec 31 Wookies are . recreationally using drugs at a higher rate than ever before.                        
5         5 Weekly News Dec 31 Wookies are . recreationally higher rate than ever before.     

如您所见,我能够提取五篇文章,尽管它们的长度不同,并将所有行合并成一篇content。希望这就是你的意思。

【讨论】:

  • 非常感谢您的帮助!我试过了,效果很好!我会赞成你的评论,但显然你需要至少 15 声望才能做到这一点。
  • 欢迎来到 Stack Overflow!我很高兴能帮上忙。我理解,一开始很多关于 Stack Overflow 本身的事情都会让人感到困惑。我自己是几个月前开始的,我完全记得当时我是多么的困惑。声望点、回复 cmets、标志、徽章等。你可能会出错。关于“这个答案很有用”标签,其实不是我,但你需要15个声望点。
  • 但是,您可以随时改变主意并将不同的答案标记为已接受。您不需要 15 点声望点。当然,我并不是想让你做某事。自己决定什么对你最清楚和最有用。如果您想快速获得声誉点数,请切换到其他服务。例如,Cross Validated 或任何其他 StackExchange 专家社区。​​span>
猜你喜欢
  • 1970-01-01
  • 2012-05-28
  • 2021-11-29
  • 1970-01-01
  • 2019-05-15
  • 1970-01-01
  • 2012-03-08
  • 1970-01-01
  • 2011-04-14
相关资源
最近更新 更多