使用正则表达式从 Twitter 数据中提取“提及”的问题答案

【问题标题】：Issue extracting "mentions" from Twitter data using regex使用正则表达式从 Twitter 数据中提取“提及”的问题
【发布时间】：2021-02-04 00:21:32
【问题描述】：

我正在尝试从 Twitter 中提取推文中的提及，即 @Google 或 @Apple。

到目前为止，这是我的代码，用于从列中提取提及，然后使用提及创建另一个列。

df_bdtu['mentions'] = df_bdtu['tweet_text'].str.findall('(?:^|\s)[＠ @]{1}([^\s#<>[\]|{}]+)')

它最有效，但我在一些边缘情况下遇到了一些问题，例如这条推文：

Check out @Dreams_n_Songs and give them a follow! I can't recommend their hoodies enough!Shop now  ????…

存储在下方mentions 列中的提及不正确，因为出于某种原因它包含表情符号。

['Dreams_n_Songs', '????…']

另一个问题是在提及之前有一个.，例如这个例子：

.@ChelseaFC, @FCBayern, @VfL_Wolfsburg and more are among the latest names to be confirmed at -…

产生的提及不包括第一次提及。

[FCBayern,, VfL_Wolfsburg]

我将如何为此修复正则表达式？

【问题讨论】：

【解决方案1】：

你可以使用

[＠@]([^][\s#<>|{}]+)

请参阅regex demo。因此，删除需要字符串开头或匹配开始时有空格的 (?:\s|^)，并且您需要从 [＠ @] 字符类中删除文字空格。

在 Pandas 代码中，你可以这样使用它：

df_bdtu['mentions'] = df_bdtu['tweet_text'].str.findall(r'[＠@]([^][\s#<>|{}]+)')

注意r'...' 原始字符串文字表示法。

【讨论】：