【问题标题】:Why can't I clean pdf table and rename columns as a function?为什么我不能清理 pdf 表并将列重命名为函数?
【发布时间】:2020-08-28 09:48:49
【问题描述】:

我想出了如何抓取此 PDF,但我有很多这些文件需要浏览。我的意图是将其设置为一个函数,从所有 pdf 文件中导入数据(几年一个 pdf 文件),然后执行 rbind() 来制作一个数据表,然后我可以将其写为 csv。

这行得通。

library(tidyverse)
library(tabulizer)

#import the data
jan16s_raw <- extract_tables("https://www.nvsos.gov/sos/home/showdocument?id=4062")

#create data frame
cleanNvsen <- do.call(rbind, jan16s_raw)
cleanNvsen2 <-as.data.frame(cleanNvsen[3:nrow(cleanNvsen),])

#rename all of the columns
names(cleanNvsen2)[1] <- "District"
names(cleanNvsen2)[2] <- "Democrat"
names(cleanNvsen2)[3] <- "Independent American"
names(cleanNvsen2)[4] <- "Libertarian"
names(cleanNvsen2)[5] <- "Nonpartisan"
names(cleanNvsen2)[6] <- "Other"
names(cleanNvsen2)[7] <- "Republican"
names(cleanNvsen2)[8] <- "Total"

#check to see if it worked
head(example)

但这会产生一个 1 x 1 的数据框

library(tidyverse)
library(tabulizer)

#load data
jan16s_raw <- extract_tables("https://www.nvsos.gov/sos/home/showdocument?id=4062")

#create function to create data frame and then rename 
clean <- function(x) {
cleanNvsen <- do.call(rbind, x)
cleanNvsen2 <-as.data.frame(cleanNvsen[3:nrow(cleanNvsen),])

names(cleanNvsen2)[1] <- "District"
names(cleanNvsen2)[2] <- "Democrat"
names(cleanNvsen2)[3] <- "Independent American"
names(cleanNvsen2)[4] <- "Libertarian"
names(cleanNvsen2)[5] <- "Nonpartisan"
names(cleanNvsen2)[6] <- "Other"
names(cleanNvsen2)[7] <- "Republican"
names(cleanNvsen2)[8] <- "Total"
}

x2 <- clean(jan16s_raw)

head(x2)

我真的很想让它工作,这样我就可以向 R 提供 url,然后运行我创建的这个干净的函数。我有几十个文件要处理。

【问题讨论】:

    标签: r pdf scrape


    【解决方案1】:

    您可以编写clean 函数来提取数据并重命名列。我们可以一次重命名多个列,不需要单独重命名。

    clean <- function(url) {
      jan16s_raw <- extract_tables(url)
      #create data frame
      cleanNvsen <- do.call(rbind, jan16s_raw)
      cleanNvsen2 <- as.data.frame(cleanNvsen[3:nrow(cleanNvsen),])
      #rename all of the columns
      names(cleanNvsen2) <- c("District", "Democrat", "Independent American", 
                      "Libertarian","Nonpartisan","Other","Republican","Total")
    
      return(cleanNvsen2)
    }
    

    创建一个包含要从中提取数据的所有 url 的向量。

    list_of_urls <- c('https://www.nvsos.gov/sos/home/showdocument?id=4062', 
                      'https://www.nvsos.gov/sos/home/showdocument?id=4064')
    

    然后为每个url调用clean函数并合并数据。

    all_data <- purrr::map_df(list_of_urls, clean)
    #OR
    #all_data <- do.call(rbind, lapply(list_of_urls, clean))
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2021-05-13
      • 1970-01-01
      • 2020-01-14
      • 2023-04-11
      • 1970-01-01
      • 2014-11-24
      • 2019-12-17
      • 2021-05-12
      相关资源
      最近更新 更多