如何使用正则表达式在熊猫数据框列中查找值答案

【问题标题】：How to use regular expressions to find a value in a pandas dataframe column如何使用正则表达式在熊猫数据框列中查找值
【发布时间】：2021-05-16 13:53:53
【问题描述】：

我有一份 PDF 文档，在文档的表格中报告了某个州的县的 COVID-19 编号。我正在使用 camelot 将表格读入 pandas 数据框中，并根据第一列中的值（县名）提取各行中的值。为此，我使用布尔索引，如下所述：How do I sum values in a column that match a given condition using pandas?

我正在使用提取的数据报告我的组织感兴趣的报告中列出的部分县的 COVID-19 统计数据。我也在提取该州的总数，但 PDF 的制作者无法决定是否要将该行数据称为“Gesamt”（“Total”）或“Gesamtergebnis”（“Total result”）。在 camelot 从 PDF 中提取表格后，我正在使用的数据框如下所示：

                0        1       2        3
...
9        A County   13.789   (+22)  1.566,0
10      My County   16.581   (+45)  3.040,0
11   Their County    7.445   (+15)  2.821,6
... 
55         Gesamt  304.950  (+820)  2.747,2

如果他们使用“Gesamt”，则下面的代码有效。我想写它，这样如果他们使用“Gesamtergebnis”它也能工作。我不能依赖总数（“Gesamt”或“Gesamtergebnis”）总是在同一行。

# Open LGA reports for yesterday and the day before
# TO DO: Sometimes the LGA report is named COVID_Lagebericht_LGA_yymmdd.pdf or it ends in _01
#        Add in a try/else statement to compensate for this
rptyes = f'Reports_LGA/{yday_yymmdd}_COVID_Tagesbericht_LGA.pdf'
rptdbf = f'Reports_LGA/{daybef_yymmdd}_COVID_Tagesbericht_LGA.pdf'

# Read the LGA reports into dataframes.
dfyes = camelot.read_pdf(rptyes, pages='2', flavor='stream')
dfdbf = camelot.read_pdf(rptdbf, pages='2', flavor='stream')

# Extract the statewide 7-D-I
# TO DO: Sometimes the last line says "Gesamt", sometimes "Gesamtergebnis" or something else.
#        Add in some sort of error checking or try/else statement or regular expression to compensate
landindexyes = lambda land: dfyes[0].df.loc[dfyes[0].df[0] == land].index[0]
landindexdbf = lambda land: dfdbf[0].df.loc[dfdbf[0].df[0] == land].index[0]
land = 'Gesamt'
bwname = 'Baden-Württemberg'
bwcases = int(dfyes[0].df.loc[landindexyes(land), 1].replace('.',''))
bwcasesdiff = dfyes[0].df.loc[landindexyes(land), 2]
bwdeaths = int(dfyes[0].df.loc[landindexyes(land), 4].replace('.',''))
bwdeathsdiff = dfyes[0].df.loc[landindexyes(land), 5]
bw7diyes = float(dfyes[0].df.loc[landindexyes(land), 7].replace(',','.'))
bw7didbf = float(dfdbf[0].df.loc[landindexdbf(land), 7].replace(',','.'))
bw7didiff = bw7diyes - bw7didbf
rptrowsbw = [bwname, bwcases, bwcasesdiff, bwdeaths, bwdeathsdiff, bw7diyes, bw7didbf]

如何使用正则表达式匹配传递给 lambda 表达式“landindexyes”和“landindexdbf”的变量中的“Gesamt”或“Gesamtergebnis”？

如果正则表达式不是要走的路，我愿意接受其他建议。我认为 if/else 可能会起作用，但我认为这不会那么优雅。

【问题讨论】：

我相信你可以使用extract或contains方法
contains() 方法可能会起作用，正如@Ena 所建议的那样，但我不明白如何将它用于代码的 [landindexyes(land) 部分。

标签： python pandas re

【解决方案1】：

很遗憾，我看不到您的数据框，所以我无法写出 100% 正确的行。我想在这里向您推荐第一个答案：Filtering DataFrame by finding exact word (not combined) in a column of strings。

所以，在你的情况下是这样的：

df[df["column_name"].str.contains(r'(?:\s|^)Gesamt(?:\s|$)')]]==True

或

df[df["column_name"].str.contains(r'(?:\s|^)Gesamtergebnis(?:\s|$)')]]==True

如果您不确定数据集中的拼写是否正确，可以尝试匹配算法，例如 Fuzzy Wuzzy：https://www.datacamp.com/community/tutorials/fuzzy-string-python。

编辑（来自 cmets）： RegEx 会大大降低代码速度，那么将列中的所有“Gesamtergebnis”值更改为“Gesamt”的想法怎么样？因此，您可以在 TODO 部分中使用类似的内容：

df_name['column_name'] = df_name['column_name'].str.replace('Gesamtergebnis','Gesamt')

然后继续你的代码。

【讨论】：

我编辑了我的原始问题以添加我正在使用的数据框的摘录。由于我感兴趣的行是该列中的值以“Gesamt”开头的唯一行。我试过：``` bwcases = int(dfyes[0].df.loc[dfyes[0].df.loc[dfyes[0].df[0].str.contains(r'(^)Gesamt') ], 1].replace('.','')) ``` 但是它给出了以下错误：ValueError: Cannot index with multidimensional key
RegEx 会大大降低代码速度，那么将列中的所有“Gesamtergebnis”值更改为“Gesamt”的想法怎么样？因此，您可以在 TODO 部分中使用类似的内容： df_name['column_name'] = df_name['column_name'].str.replace('Gesamtergebnis','Gesamt') 并继续使用您的代码。
此外，您还可以将 'gesamt' 替换为 'Gesamt' 等，并以这种方式进行数据清理。
我会试试 str.replace() 方法。我终于使用非 RegEx 解决方案让它工作了。我注意到最常见的 Gesamtergebnis 和 Gesamt 行是倒数第二行，因此我将 (df_name[0].shape[0] - 2) 分配给一个变量并将该变量用作行坐标。不过，我认为您将值替换为“Gesamt”的建议会更可靠。