使用 R 从图像中提取关键字答案

【问题标题】：Extracting Keyword from an image using R使用 R 从图像中提取关键字
【发布时间】：2019-05-13 16:35:48
【问题描述】：

假设我有一个包含发票的 pdf 文件。所以，这是pdf文件中的图像。现在，如果我想提取关键字“total”，我该怎么做？

到目前为止，我已经想出了以下代码：

curl::curl_download("https://slicedinvoices.com/pdf/wordpress-pdf-invoice-plugin-sample.pdf", "wordpress-pdf-invoice-plugin-sample.pdf")
orig <- pdftools::pdf_text("wordpress-pdf-invoice-plugin-sample.pdf")
# Render pdf to png image
img_file <- pdftools::pdf_convert("wordpress-pdf-invoice-plugin-sample.pdf", format = 'tiff', pages = 1, dpi = 400)
# Extract text from png image
text <- ocr(img_file)
unlink(img_file)
cat(text)

上面的代码有助于从图像中提取文本，但是，它排除了表格形式中的文本。另外，如果我只想提取“发票编号”和“总应付金额 93.50 美元”，那么如何使用 R 来实现？如果有人能帮助我解决这个问题，我将非常感激。

【问题讨论】：

标签： r pdf ocr tesseract keyword

【解决方案1】：

使用tabulizer包

library(tabulizer)
library(dplyr)
library(data.table)

out <- extract_tables( "https://slicedinvoices.com/pdf/wordpress-pdf-invoice-plugin-sample.pdf")

out = as.data.table(out)
out %>% filter(V1=='Invoice Number' | V1=='Total Due')

              V1       V2
1 Invoice Number INV-3337
2      Total Due   $93.50

【讨论】：

这个答案在当前情况下很有帮助。如果表格中没有发票编号和到期总额，而只是 pdf 文件中的图像形式的文本，该怎么办。那么我们如何才能提取关键字呢？