从 PDF 中提取文本时如何删除标题？答案

【问题标题】：How to drop captions while extracting text from PDF?从 PDF 中提取文本时如何删除标题？
【发布时间】：2019-11-27 02:40:43
【问题描述】：

我正在尝试在一组 pdf 文件上运行 LDA，以访问这些文件中的主要主题。我可以使用 pdfminer 从 pdf 中提取数据。

问题1：但问题是pdf中的图表和图像的标题和描述对我没有用。如何从 pdf 中删除不需要的部分。

问题 2：在我运行 LDA 模型之前，我想从文本中删除所有换行符和标点符号。

我用来提取数据的代码如下：

from pdfminer import .layout import LAParams
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.converter import PDFPageAggregator
from pdfminer.pdfpage import PDFPage
from pdfminer.layout import LTTextBoxHorizontal
from pdfminer.layout import LTFigure
from pdfminer.pdfinterp import PDFPageInterpreter
import gensim
from gensim import corpora
from pprint import pprint
document = open('C:/Users/kaurj/Desktop/File1.pdf', 'rb')
rsrcmgr = PDFResourceManager()
laparams = LAParams()
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
for page in PDFPage.get_pages(document):
interpreter.process_page(page)    
layout = device.get_result() 

for element in layout:
    if isinstance(element, LTTextBoxHorizontal):
        values = element.get_text()
        print (values)

代码中用到的File1嵌入在这里：-

[https://onedrive.live.com/embed?cid=DA6170EA591F0D07&resid=DA6170EA591F0D07%21106&authkey=ALua6WdCD7Ct6zo&em=2"]

【问题讨论】：

请发布您尝试过的代码和错误。

标签： python pdf text-extraction

【解决方案1】：

如果标题本身遵循某种模式（就像在科学文本中那样），您可以使用正则表达式删除它们 - 请参阅此 link 以获得快速概述和 this one 尝试匹配的正则表达式这种模式（我假设他们会以“Figure”开头，后跟一个数字，以及一个不确定长度的字符串 - 这有点棘手 - 很可能是换行符或其他指示符，详细信息取决于解析器并记录你使用）。

要清理文本，您有多种选择。 Gensim 有一些工具，NLTK 也有。最简单的版本是使用replace，一个内置的python函数。 textdocument.replace(""\n", "")并为您想要与另一个切换的每个字符重复（或者在这种情况下，使用“”，即什么都没有）。我个人会推荐clean-text 包，它非常灵活，可以为您完成大部分工作。

一个例子：

from cleantext import clean

text = "I am a sample text. 
I have -many- weird characters, such as , . # and some numbers,
4335 and 12 more. 
Here is a newline character \n and a $ sign. 
Some words are CAPITALIZED and this is an email address: hello@example.com"


clean(text,
        fix_unicode=True,               # fix various unicode errors
        lower=True,                     # lowercase text
        no_line_breaks=True,           # strip line breaks 
        no_emails=True,                # replace all email addresses with a special token
        no_numbers=True,               # replace all numbers with a special token
        no_digits=True,                # replace all digits with a special token
        no_currency_symbols=True,      # replace all currency symbols with a special token
        no_punct=True,                 # fully remove punctuation
        replace_with_email="",
        replace_with_number="",
        replace_with_digit="",
        replace_with_currency_symbol="",
        lang="en") 

Out[3]: 'i am a sample text i have many weird characters such as and some numbers and more here is a newline character and a sign some words are capitalized and this is an email address'

【讨论】：