返回不包含超过某个最大长度的单词的字符串内容的行，同时保留和过滤掉包含某些内容的单词答案

【问题标题】：Return rows w/ string content that don't have words exceeding a certain max length while retaining and filtering out words w/ certain content返回不包含超过某个最大长度的单词的字符串内容的行，同时保留和过滤掉包含某些内容的单词
【发布时间】：2020-11-05 03:42:23
【问题描述】：

这是我的数据框

输入

        qid                     question_stemmed    target  question_length total_words
443216  56da6b6875d686b48fde    mathfracint1x53x5 tantanboxedint1x01x2 sumvarp...   1   589 40
163583  1ffca149bd0a19cd714c    mathoverbracesumvartheta8infty vecfracsumkappa...   1   498 31
522266  663c7523d48f5ee66a3e    httpgooglecom check out the content of the www..    0   449 66
522379  756678d3d48f5ee66a3e    mark had a great day he plans to go fishing with.   0   310 23

我正在使用以下逻辑仅从 df 中返回其 question_text 列具有的记录

长度不应超过 15 个字符的任何单词（注意：不是字符串长度）（使用否定）
在上述条件为真时不应包含数值的任何单词（使用否定）
同时确保保留带有 http 或 www 值的字词（同时以上 2 个条件仍然成立）

df = df[(~df['question_stemmed'].str.len() > 15) & (~df['question_stemmed'].str.contains(r'[0-9]')) & (df.question_stemmed.str.match('^[^\http]*$'))]

出现错误 error: bad escape \h at position 3

预期输出

        qid                     question_stemmed     target    question_length  total_words
522266  663c7523d48f5ee66a3e    httpgooglecom check out the content of the www..    0   449 66
522379  756678d3d48f5ee66a3e    mark had a great day he plans to go fishing with.   0   310 23

另外，想知道上面的逻辑是否可以满足所有 3 个条件感谢任何帮助

【问题讨论】：

错误是由于\h转义，没有这样的字符串转义序列。你能澄清一下吗？那么，您想忽略对 URL 的前两项检查吗？您能否为上述 df 提供预期的输出？
@WiktorStribiżew - 我添加了预期的输出。希望这说明清楚。我想基本上过滤掉所有包含长度> 15的单词并且在这些单词中包含数字内容的行（例如：mathfracint1x53x5），同时确保我不会过滤掉字符串内容中包含 http 或 www 值的单词跨度>
您真的要分析question_stemmed 列的值如56da6b6875d686b48fde？
这只是qid。要分析的主要内容是“question_stemmed”列（请原谅我的错误格式：/）。让我尝试更改它以使其更具可读性：D
试试df = df[~df['question_stemmed'].str.contains(r'(?<!\S)(?!\S*(?:http|www\.))\S{15}')]

标签： python regex pandas string

【解决方案1】：

我建议使用

df = df[~df['question_stemmed'].str.contains(r'(?<!\S)(?!\S*(?:http|www\.))\S{15}')]

见regex demo

详情

(?<!\S) - 空格或字符串开头应紧接在当前位置之前
(?!\S*(?:http|www\.)) - 当前位置右侧不允许紧跟 http 或 www. 子字符串的 0 个或多个非空白字符
\S{15} - 十五个非空白字符。

【讨论】：