从数据帧中的字符串之间提取字符串答案

【问题标题】：extracting a string from between to strings in dataframe从数据帧中的字符串之间提取字符串
【发布时间】：2022-11-02 16:57:39
【问题描述】：

我试图从我的数据框中提取一个值我有一列 ['Desc'] 它包含以下格式的句子

_000it_ZZZ$$$-
_0780it_ZBZT$$$-
_011it_BB$$$-
_000it_CCCC$$$-

我想提取 'it_' 和 '$$$' 之间的字符串

我已经尝试过这段代码，但似乎不起作用
# initializing substrings
sub1 = "it_"
sub2 = "$$$"
 
# getting index of substrings
idx1 = df['DESC'].find(sub1)
idx2 = df['DESC'].find(sub2)
 
# length of substring 1 is added to
# get string from next character
df['results'] = df['DESC'][idx1 + len(sub1) + 1: idx2]
我会很感激你的帮助

【问题讨论】：

最后一行需要“str.find”而不是“find”和“str.slice”。

标签： python pandas dataframe data-analysis data-preprocessing

【解决方案1】：

您可以尝试使用正则表达式模式。它与您在此处列出的案例相匹配，但我不能保证它会推广到所有可能的模式。

import re

string = "_000it_ZZZ$$$-"
p = re.compile(r"(?<=it)(.*)(?<!W)")
m = p.findall(string)
print(m) # ['_ZZZ']

该模式在字符串中查找it，然后停止，直到遇到非单词字符。

【讨论】：

我可以编辑它以便在出现单词字符后停止吗？例如，如果 $$$ 是 SSS 并且 ZZZ 是 123 例如

【解决方案2】：

您可以使用str.extract 在新列中获得所需的输出。

df = pd.DataFrame({
    'DESC' : ["_000it_ZZZ$$$-","_0780it_ZBZT$$$-","_011it_BB$$$-","_000it_CCCC$$$-"]
})

pat = r"(?<=it_)(.*)(?=$$$-$)"
df['results'] = df['DESC'].str.extract(pat)
print(df)

               DESC results
0    _000it_ZZZ$$$-     ZZZ
1  _0780it_ZBZT$$$-    ZBZT
2     _011it_BB$$$-      BB
3   _000it_CCCC$$$-    CCCC

您可以在Regex101 上查看正则表达式模式以获取更多详细信息。

【讨论】：