RegEx 用于提取特殊字符和单词之间的所有字符答案

【问题标题】：RegEx for extracting all characters between a special character and a wordRegEx 用于提取特殊字符和单词之间的所有字符
【发布时间】：2019-04-27 21:30:05
【问题描述】：

我正在尝试提取特殊字符> 和单词模式.myword 之间的所有字符（通常是多个单词，包括空格）。在我的熊猫数据框中。

我尝试了以下方法，它只将一个词附加到.myword：

df['my_column'] = df['text'].str.findall(r'(\w+.myword)')

一些字符串示例：

str1 = 123abc >I want this1.myword #extract I want this1.myword
str2 =  123<>I want this2.myword<> #extract I want this2.myword

【问题讨论】：

一般>(.*?)\.myword或(?<=>).+?(?=\.myword)

标签： python regex pandas

【解决方案1】：

首先，一个简单的点 . 匹配任何字符，所以你想在你的正则表达式中转义它：\. 否则，正则表达式也会在例如：
123>Iwantthis!myword # extracts Iwantthis!myword 中找到匹配项

其次，您必须在捕获的组中允许空白字符：\s。

我想这应该为您完成这项工作： r'([\w\s]+\.myword)'

【讨论】：

【解决方案2】：

$ grep -Po '(?<=>)[^<$]+' <<EOF
123abc >I want this1.myword
123<>I want this2.myword<>
EOF

I want this1.myword
I want this2.myword

(?<=)积极的后视
[^]负字符集

【讨论】：

【解决方案3】：

我将定义一个特定的函数来提取子字符串，而不是使用正则表达式：

代码

def substring(original_string):
    start = original_string.find(">")
    end = original_string.find(".myword")

    if (start > -1) and (end > -1):
        return original_string[start + 1:end]
    else:
        return None


df['my_column'] = df['text'].apply(lambda x: substring(x))

【讨论】：