使用python从段落中提取文本答案

【问题标题】：extracting text from paragraphs using python使用python从段落中提取文本
【发布时间】：2019-08-14 14:56:17
【问题描述】：

我正在进行一个项目，我们希望从段落中的文本块中提取公司名称、城市、州和美元金额。通常，此信息将位于段落的开头，并且我一直在使用正则表达式来查找第一个美元符号（这将是我们要提取的金额），并找到每个逗号之间的文本，因为我们知道哪个顺序文本进来。例如：

company name, city, state, amount $123,456,653

我们遇到过可能有 Xnumer 家公司的情况，然后是他们所在的城市和州，然后是美元金额。

Example: company name 1, city, state, company name 2, city, state, amount $123,456,653

可能有给出公司名称的情况，但下一条信息可能不是城市，而是公司名称为xxx。

Example: company name 1, company name 1 longer, city, state, amount $123,456,653

最后，我们看到了一些案例，其中可能会声明有多少公司获得了美元金额，然后是所有公司名称。

示例（sn-p）：Twenty-five companies have been awarded a firm-fixed-price contract under the following Global Heavyweight Service, indefinite-delivery/indefinite-quantity, fixed-price contracts with an estimated value of $284,932,621: ABX Air Inc., Wilmington, Ohio (HTC71119DC002); Air Transport International Inc., Wilmington, Ohio (HTC71119DC003); Alaska Airlines Inc., Seattle, Washington (HTC71119DC004); Allegiant Air LLC, Las Vegas, Nevada (HTC71119DC005); American Airlines, Fort Worth, Texas (HTC71119DC006); Amerijet International Inc., Fort Lauderdale, Florida (HTC71119DC007); Atlas Air Inc., Purchase, New York (HTC71119DC008;) Delta Air Lines Inc., Atlanta, Georgia (HTC71119DC009); Federal Express Corp., Washington, District of Columbia (HTC71119DC010);xxxxxxxxxxxxxx

通常，段落看起来像这样（70-80% 的时间）：

L-3 Chesapeake Sciences Corp., Millersville, Maryland, is being awarded a $43,094,331 fixed-price-incentive,xxxxxxxxxx

只是想知道是否有人对 python 库或提取特定文本的更好方法有一些建议。我考虑过实现某种类型的 API，它会获取提取的值（在用逗号分隔之后）并通过检查它是城市还是州来运行它，然后我们可能会知道数据在列表中的哪个位置是以及下一步可能是什么（状态）。

这是我正在使用的当前正则表达式：r'([^$]*),.*?\$([0-9,]+)

【问题讨论】：

哇。这是雄心勃勃的。我个人怀疑 regex 在这里是否能很好地工作，因为 regex 需要某种标准化。如果有不同的命令，特别是关于城市名称，这将是困难的。首先，您应该发布更多示例。其次，如果您发布您想要的输出结果可能会很好......
其次，在您的 70-80% 示例中，L-3 是典型的吗？在段落 blob 中，您需要有一些东西可以启动正则表达式，以了解将在组中捕获的内容代表公司名称，而不是其他单词
@FailSafe 文本分析库可能有点过头了，我仍然认为最好的方法是结合使用 ngrams 数据库和正则表达式。
大声笑，你和我都是。我认为在 70-80% 的时间里我可以捕捉到我需要的东西，但可能需要用户手动输入。
我看到你更新了自述文件，谢谢 ;)

标签： python regex python-3.x

【解决方案1】：

您可以设计一些表达来捕捉段落中的那些上市公司，例如：

(?i)([a-z0-9\s.-]*),([^\r\n,]*),\s*(Ohio|Washington|Georgia|Nevada|Florida|Texas|New York|District of Columbia)\s+\(\s*([a-z0-9]{13};?)\s*\)

并根据需要添加或删除边界，其他边界也是如此。

测试

import re

string = """
Twenty-five companies have been awarded a firm-fixed-price contract under the following Global Heavyweight Service, indefinite-delivery/indefinite-quantity, fixed-price contracts with an estimated value of $284,932,621: ABX Air Inc., Wilmington, Ohio (HTC71119DC002); Air Transport International Inc., Wilmington, Ohio (HTC71119DC003); Alaska Airlines Inc., Seattle, Washington (HTC71119DC004); Allegiant Air LLC, Las Vegas, Nevada (HTC71119DC005); American Airlines, Fort Worth, Texas (HTC71119DC006); Amerijet International Inc., Fort Lauderdale, Florida (HTC71119DC007); Atlas Air Inc., Purchase, New York (HTC71119DC008;) Delta Air Lines Inc., Atlanta, Georgia (HTC71119DC009); Federal Express Corp., Washington, District of Columbia (HTC71119DC010);

"""

expression = r'(?i)([a-z0-9\s.-]*),([^\r\n,]*),\s*(Ohio|Washington|Georgia|Nevada|Florida|Texas|New York|District of Columbia)\s+\(\s*([a-z0-9]{13};?)\s*\)'
matches = re.findall(expression, string)

print(matches)

输出

[(' ABX Air Inc.', ' Wilmington', 'Ohio', 'HTC71119DC002'), (' Air Transport International Inc.', ' Wilmington', 'Ohio', 'HTC71119DC003'), (' Alaska Airlines Inc.', ' Seattle', 'Washington', 'HTC71119DC004'), (' Allegiant Air LLC', ' Las Vegas', 'Nevada', 'HTC71119DC005'), (' American Airlines', ' Fort Worth', 'Texas', 'HTC71119DC006'), (' Amerijet International Inc.', ' Fort Lauderdale', 'Florida', 'HTC71119DC007'), (' Atlas Air Inc.', ' Purchase', 'New York', 'HTC71119DC008;'), (' Delta Air Lines Inc.', ' Atlanta', 'Georgia', 'HTC71119DC009'), (' Federal Express Corp.', ' Washington', 'District of Columbia', 'HTC71119DC010')]

如果您想探索/简化/修改表达式，它已经在右上角的面板上进行了解释 regex101.com。如果你愿意，你也可以在this link看，怎么搭配针对一些样本输入。

【讨论】：