使用python从pdf中提取特定文本答案

【问题标题】：extraction of specific text from pdf using python使用python从pdf中提取特定文本
【发布时间】：2020-05-10 09:56:18
【问题描述】：

是否可以使用 python 从 pdf 中提取特定文本。

测试用例：我有一个超过10页的PDF文件，我需要提取出具体的文本和与之关联的值。例如：user:value 用户 id:value。需要提取这些值。

我能够阅读所有页面，我现在想要特定的文本

【问题讨论】：

这能回答你的问题吗？ How to extract text from pdf in python 3.7.3
作为新用户，也请收下tour，阅读How to Ask。特别是，可以用是或否回答的问题通常是不好的问题。
您可以将 PDF 转换为 XML 或 json，然后使用 lib-xml 库或 json 库从中提取您想要的任何内容。

标签： python

【解决方案1】：

如果您已经能够阅读 PDF 并将文本存储到字符串中，您可以执行以下操作：

import re # Import the Regex Module

pdf_text = """
user:John
user:Doe
user id:2
user id:4
"""

# re.findall will create a list of all strings matching the specified pattern
results = re.findall(r'user:\s\w+', pdf_text)
results = ['user: John', 'user: Doe']

这基本上意味着：查找所有以字符串 'user:' 开头的匹配项，后跟一个空格 '\s'，然后是组成单词（字母和数字）的字符 '\w' 直到它不再匹配'+'。

如果您只想取回“值”字段，您可以使用：r'user:\s(\w+)'，它会指示正则表达式引擎对由 '\w+' 匹配的字符串进行分组。如果您的正则表达式模式中有组，则 findall 会返回一个组匹配列表，因此结果将是：

results = re.findall(r'user:\s(\w+)', pdf_text)
['John', 'Doe']

查看正则表达式模块文档：https://docs.python.org/3/library/re.html

如果您想做更复杂的事情，其他一些方法（如 finditer()）也会有所帮助。

此正则表达式指南也可能会有所帮助：https://www.regexbuddy.com/regex.html?wlr=1

【讨论】：