【问题标题】:Extract data from a pdf into a table [closed]从pdf中提取数据到表格中[关闭]
【发布时间】:2022-01-18 18:31:11
【问题描述】:

pdf example

我想从一个大的 pdf 文件(图像中的示例)中提取物种信息到一个列表中,其中每个物种作为一行,元数据作为列。有没有办法在 python 或 R 中做到这一点?

【问题讨论】:

    标签: python r pdf pdf-extraction


    【解决方案1】:

    另一种方法是简单地使用pdftool 库。

    我的解决方案有两个部分:

    1. 将 1 个段落(物种)放入 data.frame 的一行中
    2. 将文本信息分离到meta.data列中

    第 1 部分:每行数据框设置 1 个物种信息:

    # get the path of the pdf:
    file_name <- "species_info.pdf"
    # read the text in the pdf:
    species.raw.text <- pdf_text(pdf = file_name, opw = "", upw = "")
    # split the text into part. Each corresponding to 1 species
    species.raw.text <- str_split(species.raw.text, "\n\n")
    # convert the list into a data.frame i.e. each row = 1 species
    species.df <- as.data.frame(species.raw.text)
    # change the column name to raw.text
    colnames(species.df) <- c("raw.text")
    

    第 2 部分:将原始文本中的信息提取到列中:

    为此,我使用了 dplyr 库和 separate() 函数。我认为每个物种都有相同类型的信息,即

    • 物种名称
    • 苏伊士湾:
    • 亚喀巴湾:
    • 红海主盆地:
    • 一般分布:
    • 备注:

    我建议使用此代码来获得您想要的:

    library(dplyr)
    # remove the `\n`
    species.df$raw.text <- gsub("\n", " ", species.df$raw.text)
    # get the meta.data
    species.df <- species.df %>% 
      separate(
        col = raw.text, sep = "Gulf of Suez:", 
        into = c("species.name", "rest")) %>%
      separate(
        col = rest, sep = "Gulf of Aqaba:", 
        into = c("Gulf.of.Suez", "rest")) %>%
      separate(
        col = rest, sep = "Red Sea main basin:", 
        into = c("Gulf.of.Aqaba", "rest")) %>%
      separate(
        col = rest, sep = "General distribution:", 
        into = c("Red.Sea.main.basin", "rest")) %>%
      separate(
        col = rest, sep = "Remark:", fill = "right",
        into = c("General.distribution", "Remark"))
    
    species.name Gulf.of.Suez Gulf.of.Aqaba Red.Sea.main.basin General.distribution Remark
    Carcharhinus albimarginatus (Rüppell 1837) - Israel (Baranes 2013). Egypt (Rüppell 1837, as Carcharias albimarginatus), Sudan (Ninni 1931), Saudi Arabia (Spaet & Berumen 2015). Red Sea, Indo-Pacific: East Africa east to Panama. NA
    Carcharhinus altimus (Springer 1950) - Egypt (Baranes & Ben-Tuvia 1978a), Israel (Baranes & Golani 1993). Saudi Arabia (Spaet & Berumen 2015). Circumglobal in tropical and warm temperate seas. NA
    Carcharhinus amboinensis (Müller & Henle 1839) - - Saudi Arabia (Spaet & Berumen 2015). Circumglobal in tropical and warm temperate seas, but not eastern Pacific. NA
    Carcharhinus brevipinna (Müller & Henle 1839) Egypt (Gohar & Mazhar 1964, as Aprionodon brevipinna). - Egypt (Gohar & Mazhar 1964, as Aprionodon brevipinna and Carcharhinus maculipinnis), Saudi Arabia (Spaet & Berumen 2015). Circumglobal in tropical and warm temperate seas, but not in the eastern Pacific. Not a Lessepsian migrant as previously reported by Ben-Tuvia (1966) (see Golani et al. 2002).
    Carcharhinus falciformis (Müller & Henle 1839) - - Egypt (Gohar & Mazhar 1964, as Carcharhinus menisorrah), Saudi Arabia (Klausewitz 1959a, as Carcharhinus menisorrah; Spaet & Berumen 2015). Circumglobal in tropical seas. NA

    【讨论】:

    • 感谢您的帮助,文档按姓氏组织(全部大写 - 我在原始帖子中添加了另一张图片)您知道如何处理吗?
    猜你喜欢
    • 2023-03-23
    • 2018-07-07
    • 1970-01-01
    • 2022-10-09
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2015-04-16
    相关资源
    最近更新 更多