Python - 从熊猫df中的字符串中提取多个值答案

【问题标题】：Python - Extract multiple values from string in pandas dfPython - 从熊猫df中的字符串中提取多个值
【发布时间】：2019-11-21 00:32:33
【问题描述】：

我已搜索以下问题的答案，但尚未找到答案。我有一个像这个小例子这样的大型数据集：

df =

A  B
1  I bought 3 apples in 2013
3  I went to the store in 2020 and got milk
1  In 2015 and 2019 I went on holiday to Spain
2  When I was 17, in 2014 I got a new car
3  I got my present in 2018 and it broke down in 2019

我想要提取 > 1950 的所有值并将其作为最终结果：

A  B                                                    C
1  I bought 3 apples in 2013                            2013
3  I went to the store in 2020 and got milk             2020
1  In 2015 and 2019 I went on holiday to Spain          2015_2019
2  When I was 17, in 2014 I got a new car               2014
3  I got my present in 2018 and it broke down in 2019   2018_2019

我尝试先提取值，但没有进一步：

df["C"] = df["B"].str.extract('(\d+)').astype(int)
df["C"] = df["B"].apply(lambda x: re.search(r'\d+', x).group())

但我得到的只是错误消息（几周前我才开始使用 python 并使用文本..）。有人可以帮我吗？

【问题讨论】：

应该包括 1950 年吗？您还想提取19555 和更多位数的数字吗？
你可以使用this
@WiktorStribiżew 我还没有走那么远，但我在想：因为我需要它发生的年份，在我提取它们之后过滤数字 >1950 我会得到年份和松散其他无用的值。
我会使用 df["C"] = df["B"].str.findall(r'(?<!\d)(?:19[5-9]\d|[2-9]\d{3}|\d{5,})(?!\d)').str.join('_') 之类的东西，其中还包括 1950 和 5+ 位数字。
如果您只需要 4 位数的年份，请从上面删除 |\d{5,}。要排除 1950，请在 (?<!\d) 之后添加 (?!1950) / (?!1950(?!\d))。仅当您的输入完全混乱时才使用它。

标签： python regex pandas

【解决方案1】：

这是一种使用str.findall 并从结果列表中加入大于1950 的项目的方法::

s = df["B"].str.findall('\d+')
df['C'] = s.apply(lambda x: '_'.join(i for i in x if int(i)> 1950))

   A                                                  B          C
0  1                          I bought 3 apples in 2013       2013
1  3           I went to the store in 2020 and got milk       2020
2  1        In 2015 and 2019 I went on holiday to Spain  2015_2019
3  2             When I was 17, in 2014 I got a new car       2014
4  3  I got my present in 2018 and it broke down in ...  2018_2019

【讨论】：

所以，我还有一个问题。如果我只想保留最早的年份怎么办？
试试min@lotw
是的，我得到了。我的问题是如何巧妙地做到这一点。现在我得到了： df2 = df['C'].str.split('_', expand=True) df2 = df2.fillna(0).astype(int) df2.columns = ['C{}'.format (col) for col in df2.columns ] df = df.join(df2) 这是再次拆分 C 的一个重要解决方法。我希望它直接取最小的数字..

【解决方案2】：

使用单一正则表达式模式（考虑到您的评论“需要年份它发生”）：

In [268]: pat = re.compile(r'\b(19(?:[6-9]\d|5[1-9])|[2-9]\d{3})')

In [269]: df['C'] = df['B'].apply(lambda x: '_'.join(pat.findall(x)))

In [270]: df
Out[270]: 
   A                                                  B          C
0  1                          I bought 3 apples in 2013       2013
1  3           I went to the store in 2020 and got milk       2020
2  1        In 2015 and 2019 I went on holiday to Spain  2015_2019
3  2             When I was 17, in 2014 I got a new car       2014
4  3  I got my present in 2018 and it broke down in ...  2018_2019

【讨论】：