除了拥有灵活的工具包外,数据科学还经常需要开箱即用的思维(至少在我的职业中是这样)。
但是,首先,关于 PDF 文件。
我不认为他们是你认为的那样。 “粗体”(或“斜体”等)不是“元数据”。您应该花一些时间阅读 PDF 文件,因为它们是您在处理数据时可能经常遇到的复杂、讨厌、邪恶的东西。阅读本文 — https://stackoverflow.com/a/19777953/1457051 — 了解查找粗体文本的实际含义(点击 1.8.x Java pdfbox 解决方案的链接)。
回到我们不定期安排的答复
虽然我是 R 的 YUGEst 支持者之一,但并非所有的事情都需要或应该在 R 中完成。当然,我们将最终使用 R > 获取您的粗体文本,但我们将使用辅助命令行实用程序来执行此操作。
pdftools 包基于poppler 库。它带有源代码,因此“我只是 R 用户”的人可能在他们的系统上没有完整的 poppler 工具集。
Mac 用户可以使用 Homebrew 来(一旦您安装 Homebrew):
Linux 人知道如何做事。 Windows 人永远迷失了(有 poppler 二进制文件适合你,但你最好把时间花在切换到真正的操作系统上)。
一旦你这样做了,你就可以使用下面的方法来实现你的目标。
首先,我们将创建一个带有许多安全保险杠的辅助函数:
#' Uses the command-line pdftohtml function from the poppler library
#' to convert a PDF to HTML and then read it in with xml2::read_html()
#'
#' @md
#' @param path the path to the file [path.expand()] will be run on this value
#' @param extra_args extra command-line arguments to be passed to `pdftohtml`.
#' They should be supplied as you would supply arguments to the `args`
#' parameter of [system2()].
read_pdf_as_html <- function(path, extra_args=character()) {
# make sure poppler/pdftohtml is installed
pdftohtml <- Sys.which("pdftohtml")
if (pdftohtml == "") {
stop("The pdftohtml command-line utility must be installed.", call.=FALSE)
}
# make sure the file exists
path <- path.expand(path)
stopifnot(file.exists(path))
# pdf's should really have a PDF extension
stopifnot(tolower(tools::file_ext(path)) == "pdf")
# get by with a little help from our friends
suppressPackageStartupMessages({
library(xml2, warn.conflicts = FALSE, quietly = TRUE)
library(rvest, warn.conflicts = FALSE, quietly = TRUE)
})
# we're going to do the conversion in a temp directory space
td <- tempfile(fileext = "_dir")
dir.create(td)
on.exit(unlink(td, recursive=TRUE), add=TRUE)
# save our current working directory
curwd <- getwd()
on.exit(setwd(curwd), add=TRUE)
# move to the temp space
setwd(td)
file.copy(path, td)
# collect the extra arguments
c(
"-i" # ignore images
) -> args
args <- c(args, extra_args, basename(path), "r-doc") # saves it to r-doc-html.html
# this could take seconds so inform users what's going on
message("Converting ", basename(path), "...")
# we'll let stderr display so you can debug errors
system2(
command = pdftohtml,
args = args,
stdout = TRUE
) -> res
res <- gsub("^Page-", "", res[length(res)])
message("Converted ", res, " pages")
# this will need to be changed if poppler ever does anything different
xml2::read_html("r-docs.html")
}
现在,我们将使用它:
doc <- read_pdf_as_html("~/Data/Mulla__Indian_Contract_Act2018-11-12_01-00.PDF")
bold_tags <- html_nodes(doc, xpath=".//b")
bold_words <- html_text(bold_tags)
head(bold_words, 20)
## [1] "Preamble"
## [2] "WHEREAS it is expedient to define and amend certain parts of the law relating to contracts;"
## [3] "History"
## [4] "Ancient and Medieval Period"
## [5] "The Introduction of English Law Into India"
## [6] "Mofussal Courts"
## [7] "Legislation"
## [8] "The Indian Contract Act 1872"
## [9] "The Making of the Act"
## [10] "Law of Contract Until 1950"
## [11] "The Law of Contract after 1950"
## [12] "Amendments to This Act"
## [13] "Other Laws Affecting Contracts and Enforcement"
## [14] "Recommendations of the Indian Law Commission"
## [15] "Section 1."
## [16] "Short title"
## [17] "Extent, Commencement."
## [18] "Enactments Repealed."
## [19] "Applicability of the Act"
## [20] "Scheme of the Act"
length(bold_words)
## [1] 1939
完全不需要 Java,而且您有粗体字。
如果您确实想像 Ralf 所说的那样走 pdfbox-app 路线,您可以使用此包装器使其更易于使用:
read_pdf_as_html_with_pdfbox <- function(path) {
java <- Sys.which("java")
if (java == "") {
stop("Java binary is not on the system PATH.", call.=FALSE)
}
# get by with a little help from our friends
suppressPackageStartupMessages({
library(httr, warn.conflicts = FALSE, quietly = TRUE)
library(xml2, warn.conflicts = FALSE, quietly = TRUE)
library(rvest, warn.conflicts = FALSE, quietly = TRUE)
})
path <- path.expand(path)
stopifnot(file.exists(path))
# pdf's should really have a PDF extension
stopifnot(tolower(tools::file_ext(path)) == "pdf")
# download the pdfbox "app" if not installed
if (!dir.exists("~/.pdfboxjars")) {
message("~/.pdfboxjars not found. Creating it and downloading pdfbox-app jar...")
dir.create("~/.pdfboxjars")
httr::GET(
url = "http://central.maven.org/maven2/org/apache/pdfbox/pdfbox-app/2.0.12/pdfbox-app-2.0.12.jar",
httr::write_disk(file.path("~/.pdfboxjars", "pdfbox-app-2.0.12.jar")),
httr::progress()
) -> res
httr::stop_for_status(res)
}
# we're going to do the conversion in a temp directory space
tf <- tempfile(fileext = ".html")
on.exit(unlink(tf), add=TRUE)
c(
"-jar",
path.expand(file.path("~/.pdfboxjars", "pdfbox-app-2.0.12.jar")),
"ExtractText",
"-html",
path,
tf
) -> args
# this could take seconds so inform users what's going on
message("Converting ", basename(path), "...")
system2(
command = java,
args = args
) -> res
xml2::read_html(tf)
}