【问题标题】:Executing a script on a list of files in a directory in R在R中目录中的文件列表上执行脚本
【发布时间】:2013-12-21 16:45:12
【问题描述】:

我有一个程序可以让我使用 R 将 pdf 文件转换为 txt 文件。如何将此程序应用到我想要转换为 txt 文件的 pdf 文件目录?

这是我目前的代码,仅适用于链接到 pdf 文档的单个 url:

# download pdftotxt from 
# ftp://ftp.foolabs.com/pub/xpdf/xpdfbin-win-3.03.zip
# and extract to your program files folder

# here is a pdf for mining
url <- "http://www.noisyroom.net/blog/RomneySpeech072912.pdf"
dest <- tempfile(fileext = ".pdf")
download.file(url, dest, mode = "wb")

# set path to pdftotxt.exe and convert pdf to text
exe <- "C:\\Program Files\\xpdfbin-win-3.03\\bin32\\pdftotext.exe"
system(paste("\"", exe, "\" \"", dest, "\"", sep = ""), wait = F)

# get txt-file name and open it  
filetxt <- sub(".pdf", ".txt", dest)
shell.exec(filetxt); shell.exec(filetxt)    # strangely the first try always throws an error..


# do something with it, i.e. a simple word cloud 
library(tm)
library(wordcloud)
library(Rstem)

txt <- readLines(filetxt) # don't mind warning..

txt <- tolower(txt)
txt <- removeWords(txt, c("\\f", stopwords()))

corpus <- Corpus(VectorSource(txt))
corpus <- tm_map(corpus, removePunctuation)
tdm <- TermDocumentMatrix(corpus)
m <- as.matrix(tdm)
d <- data.frame(freq = sort(rowSums(m), decreasing = TRUE))

# Stem words
d$stem <- wordStem(row.names(d), language = "english")

# and put words to column, otherwise they would be lost when aggregating
d$word <- row.names(d)

# remove web address (very long string):
d <- d[nchar(row.names(d)) < 20, ]

# aggregate freqeuncy by word stem and
# keep first words..
agg_freq <- aggregate(freq ~ stem, data = d, sum)
agg_word <- aggregate(word ~ stem, data = d, function(x) x[1])

d <- cbind(freq = agg_freq[, 2], agg_word)

# sort by frequency
d <- d[order(d$freq, decreasing = T), ]

# print wordcloud:
wordcloud(d$word, d$freq)

# remove files
file.remove(dir(tempdir(), full.name=T)) # remove files

【问题讨论】:

  • lapply 和 list.files?
  • 这里有几个关于这个的主题。这与您所追求的非常接近。 stackoverflow.com/questions/20083454/run-every-file-in-a-folder/…你应该把你的脚本变成一个函数并将它传递给sapply
  • @RomanLuštrik 感谢您的提示!但是,如何将此方法应用于文件目录而不是 URL 向量?
  • 通过list.files查找文件并将结果(您可能需要使用full.names参数)传递给sapply。您需要稍微修改crawlPDFs 函数 - 无需下载文件。

标签: r pdf directory text-mining tm


【解决方案1】:

如果您有要处理的文件的 url 列表(实际上是一个向量),则可以将您的过程转换为函数并将此过程应用于每个 url。尝试以下方式:

crawlPDFs <- function(x) {
  # x is a character string to the url on the web
  url <- x
  dest <- tempfile(fileext = ".pdf")
  download.file(url, dest, mode = "wb")

  # set path to pdftotxt.exe and convert pdf to text
  exe <- "C:\\Program Files\\xpdfbin-win-3.03\\bin32\\pdftotext.exe"
  system(paste("\"", exe, "\" \"", dest, "\"", sep = ""), wait = F)

  # get txt-file name and open it  
  filetxt <- sub(".pdf", ".txt", dest)
  shell.exec(filetxt); shell.exec(filetxt)    # strangely the first try always throws an error..


  # do something with it, i.e. a simple word cloud 
  library(tm)
  library(wordcloud)
  library(Rstem)

  txt <- readLines(filetxt) # don't mind warning..

  txt <- tolower(txt)
  txt <- removeWords(txt, c("\\f", stopwords()))

  corpus <- Corpus(VectorSource(txt))
  corpus <- tm_map(corpus, removePunctuation)
  tdm <- TermDocumentMatrix(corpus)
  m <- as.matrix(tdm)
  d <- data.frame(freq = sort(rowSums(m), decreasing = TRUE))

  # Stem words
  d$stem <- wordStem(row.names(d), language = "english")

  # and put words to column, otherwise they would be lost when aggregating
  d$word <- row.names(d)

  # remove web address (very long string):
  d <- d[nchar(row.names(d)) < 20, ]

  # aggregate freqeuncy by word stem and
  # keep first words..
  agg_freq <- aggregate(freq ~ stem, data = d, sum)
  agg_word <- aggregate(word ~ stem, data = d, function(x) x[1])

  d <- cbind(freq = agg_freq[, 2], agg_word)

  # sort by frequency
  d <- d[order(d$freq, decreasing = T), ]

  # print wordcloud:
  wordcloud(d$word, d$freq)

  # remove files
  file.remove(dir(tempdir(), full.name=T)) # remove files
}

sapply(list.of.urls, FUN = crawlPDFs) 

list.of.urls 可以是字符向量或列表,其中每个列表元素是一个字符,即 pdf 的 url。

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2014-11-15
    • 1970-01-01
    • 2020-09-13
    • 1970-01-01
    • 1970-01-01
    • 2018-08-18
    相关资源
    最近更新 更多