【发布时间】:2017-04-21 15:20:44
【问题描述】:
我正在努力创建一个自动化流程,以从年度 PDF 报告中提取表格。理想情况下,我可以获取每年的报告,从其中的表格中提取数据,将所有年份组合成一个大数据框,然后对其进行分析。以下是我到目前为止的内容(仅关注报告的一年):
library(pdftools)
library(data.table)
library(dplyr)
download.file("https://higherlogicdownload.s3.amazonaws.com/NASBO/9d2d2db1-c943-4f1b-b750-0fca152d64c2/UploadedImages/SER%20Archive/State%20Expenditure%20Report%20(Fiscal%202014-2016)%20-%20S.pdf", "nasbo14_16.pdf", mode = "wb")
txt14_16 <- pdf_text("nasbo14_16.pdf")
## convert txt14_16 to data frame for analyzing
data <- toString(txt14_16[56])
data <- read.table(text = data, sep = "\n", as.is = TRUE)
data <- data[-c(1, 2, 3, 4, 5, 6, 7, 14, 20, 26, 34, 47, 52, 58, 65, 66, 67), ]
data <- gsub("[,]", "", data)
data <- gsub("[$]", "", data)
data <- gsub("\\s+", ",", gsub("^\\s+|\\s+$", "",data))
我的问题是将这些原始表数据转换为一个数据框,该数据框具有每行的每种状态以及每列的各自值。我确定解决方案很简单,但我对 R 有点陌生!有什么帮助吗?
编辑:所有这些解决方案都非常棒并且运行良好。但是,当我尝试另一年的报告时,我得到了一些错误:
: ' 0' does not exist in current working directory ('C:/Users/joshua_hanson/Documents').
在为下一份报告尝试此代码后:
将txt09_11转换为数据框进行分析
download.file("https://higherlogicdownload.s3.amazonaws.com/NASBO/9d2d2db1-c943-4f1b-b750-0fca152d64c2/UploadedImages/SER%20Archive/2010%20State%20Expenditure%20Report.pdf", "nasbo09_11.pdf", mode = "wb")
txt09_11 <- pdf_text("nasbo09_11.pdf")
df <- txt09_11[54] %>%
read_lines() %>% # separate lines
grep('^\\s{2}\\w', ., value = TRUE) %>% # select lines with states, which start with space, space, letter
paste(collapse = '\n') %>% # recombine
read_fwf(fwf_empty(.)) %>% # read as fixed-width file
mutate_at(-1, parse_number) %>% # make numbers numbers
mutate(X1 = sub('*', '', X1, fixed = TRUE)) # get rid of asterisks in state names
【问题讨论】:
标签: r dataframe tostring gsub read.table