【问题标题】:Count Frequency of Words in Text and create Plot计算文本中单词的频率并创建绘图
【发布时间】:2017-08-03 13:29:57
【问题描述】:

我有一个包含 40802 个基因名称的数据框列表,我有一个包含 14000 条文章信息的数据框。文章信息包含文章、摘要、日、月、年。

我已将日期转换为正常格式,并将摘要转换为字符。

我想有一个X的时间图,基因名称出现在摘要中的频率。 EG

| Date       | Gene Name | Frequency |
|------------|-----------|-----------|
| 2017-03-20 | GAPDH     | 5         |
| 2017-03-21 | AKT       | 6         |

基本上,我想知道过去 100 天内最常发表的基因名称,并有一个时间表来了解这些基因名称的演变。有点像趋势。

library(RISmed)

##Research the query - can be anything relevant to protein expression.
##Multiple research not tested yet

search_topic <- 'protein expression'

##Evaluate the query with reldate = days before today, retmax = maximun number of returned results

search_query <- EUtilsSummary(search_topic, retmax=15000, reldate = 100)
##explore the outcome

summary(search_query)

##get the ids for tall the queries to get the articles

QueryId(search_query)
##get all the records associated with the ID - THIS TAKES LOOONG TIME

records<- EUtilsGet(search_query)

##Analyze the structure
str(records)

summary(records)

##Create a data frame with article/abstract/date

pubmed_data <- data.frame('Title'=ArticleTitle(records),'Abstract'=AbstractText(records),
                             "Day"=DayPubmed(records), "Month" = MonthPubmed(records), "Year"=YearPubmed(records))
##explore the data
head(pubmed_data,1)
##gene names
genename <- read.csv("genename.csv", header = T, stringsAsFactors = F)

##remove any NA tittles

pubmed <-pubmed_data[-which(is.na(pubmed_data$Title)), ]
##Coerce the date to YYYY-MM-DD

pubmed$Date <- as.Date( paste( pubmed$Day , pubmed$Month , sep = "." )  , format = "%d.%m" )

我读了很多书,不知道如何在pubmed$Abstract 中找到genemane[1,1], 并按时间计算它出现的次数。 绘制一个图,其中 X 是最后 100 天,线 prot 将是基因名的频率, 传说将是基因名。因此可以观察到趋势。

我真的很感激任何想法如何做到这一点。

我尝试了tm,并尝试了很多不同的东西,但仍然碰壁。我的概念错了吗?

【问题讨论】:

    标签: r text bioinformatics biometrics mining


    【解决方案1】:
    # from: https://stackoverflow.com/questions/45485701/count-frequency-of-words-in-text-and-create-plot
    # get some text
    txt <- c("I have a list of data frame with 40802 gene names and I have data frame with 14000 article information. 
    The article information contains Article, Abstract, Day, Month, Year.I have transformed the date into normal format, 
    and the abstract as character. I want to have a plot of X in time, and the frequency of the gene names appears in the abstract.
    Basically, I want to know the gene names most frequently published in the last 100 days and have a timeline to see the evolution of said genenames. 
    Something like a trend.")
    
    # cut to ngramms for dataframe example
    txt <- strwrap(x = txt,width = 20)
    # create some data frame
    pubmed_data <- data.frame(Title=abbreviate(names.arg = txt,minlength = 5,method = "left.kept",named = F),Abstract=txt,stringsAsFactors = F)
    pubmed_data
    
    # tm package
    library(tm)
    wrds <- termFreq(doc = pubmed_data$Abstract,control = list(tolower=TRUE,removePunctuation=TRUE,removeNumbers=TRUE))
    wrds <- sort(unclass(wrds),decreasing = T)
    wrds <- data.frame(tokens=names(wrds),n=as.integer(wrds))
    wrds$tokens <- reorder(wrds$tokens,wrds$n)
    
    library(ggplot2)
    ggplot(data = wrds,aes(x = tokens,y = n,fill=n))+geom_bar(stat="identity")+scale_y_continuous(breaks = 1:max(wrds$n))+
      coord_flip()
    
    
    # tidy packages 
    library(tidytext)
    library(dplyr)
    wrds2 <- pubmed_data %>% select(-Title) %>% unnest_tokens(input = "Abstract",output = "tokens",to_lower = T) %>% 
      filter(grepl(pattern="\\D+",x=.$tokens)) %>% group_by(tokens) %>%
      count %>% ungroup %>% mutate(tokens=reorder(tokens,n))
    
    ggplot(data = wrds2,aes(x = tokens,y = n,fill=n))+geom_bar(stat="identity")+scale_y_continuous(breaks = 1:max(wrds$n))+
      coord_flip()
    

    【讨论】:

      猜你喜欢
      • 2017-11-27
      • 1970-01-01
      • 1970-01-01
      • 2011-05-30
      • 1970-01-01
      • 2013-12-28
      • 2015-01-07
      相关资源
      最近更新 更多