Python Pandas 正则表达式特定字符串 [重复]答案

【问题标题】：Python Pandas Regex specific string [duplicate]Python Pandas 正则表达式特定字符串 [重复]
【发布时间】：2018-08-12 10:43:15
【问题描述】：

我想遍历一列记录（字符串目录路径）并提取括号内的记录 ID。但是，在其他情况下，括号中的详细信息不是记录 ID，需要忽略。

代码：

df1['Doc ID'] = df['Folder Path'].str.extract('.*\((.*)\).*', expand=True) #this does not ignore instances with (2018-03) or (yyyy-mm)

我也试过了：

df1['Doc ID'] = df['Folder Path'].str.extract('\((?!date_format)([^()]+)\)',expand=True) #this does not ignore (Data Only)

  Folder Path                                          Doc ID
1  /report/support + admin. (256)/ Global (2018-03)    (256) # ignores: (2018-03)
2  /reports/limit/sector(139)/2017                     (139)
3  /reports/sector/region(147,189 and 132)/2018        (147, 189 and 132)
4  /reports/support.(Data Only)/Region (2558)          (2558)  #ignores(Data Only)

【问题讨论】：

那么你想要的输出是什么？以及如何区分记录 ID 和非记录？
嗨！我想要的输出将在“文档 ID”列中。记录 ID 可以通过以下方式与非记录 ID 区分开来：它们不包括“（仅限数据）”，也不包括“（yyyy-mm）”或日期格式 - 我不确定如何将这两个参数组合成成一个在括号内查找数据的表达式
真的只有这两个选项吗？单词（仅限数据）和日期格式？还是 (Data Only) 只是一个表示其他文本的通用值？
不幸的是，我没有通过正则表达式进行解析的背景（您有什么好的学习地点的建议吗？）- 这个文件中的数据真的是一团糟，这是识别记录ID的最佳方式
(Data Only) 实际上是在字符串中列出的——在大约 50% 的包含这个字符串的记录中，有一个关联的记录 id，在另一半中，没有——我会运行现在通过脚本，感谢您的帮助！

标签： python regex pandas

【解决方案1】：

这使用负前瞻来过滤掉“仅数据”和日期格式：

(\((?!Data Only)[^\-]+\))

设置：

df = pd.DataFrame(
    {'Path': ['(Data Only) text (1, 2 and 3)',
    '(2013-08) foo (123)',
    '(Data Only) bar (1,2,3,4,5 and 6)']}
)

                                Path
0      (Data Only) text (1, 2 and 3)
1                (2013-08) foo (123)
2  (Data Only) bar (1,2,3,4,5 and 6)

使用str.extract：

df.Path.str.extract(r'(\((?!Data Only)[^\-]+\))', expand=True)

                   0
0      (1, 2, and 3)
1              (123)
2  (1,2,3,4,5 and 6)

【讨论】：