从推文文本中提取主题标签、用户提及和网址的快速方法？答案

【问题标题】：Fast way to extract hashtags, user mentions and urls from tweet text?从推文文本中提取主题标签、用户提及和网址的快速方法？
【发布时间】：2014-02-07 19:53:44
【问题描述】：

我正在尝试找到一种快速的方法来获取为每个字符串制作的数组：1- 主题标签，2- 用户在推文文本中提及 3- url。我在 csv 文件中有推文文本。

我解决问题的方法需要很长时间的处理时间，我想知道我是否可以稍微优化一下我的代码。我将为每种匹配类型显示我的正则表达式规则，但只是不发布长代码，我将仅显示我如何匹配主题标签。相同的技术适用于 url 和用户提及。

这里是：

public static String hashtagRegex = "^#\\w+|\\s#\\w+";
public static Pattern hashtagPattern = Pattern.compile(hashtagRegex);

public static String urlRegex = "http+://[\\S]+|https+://[\\S]+";
public static Pattern urlPattern = Pattern.compile(urlRegex);

public static String mentionRegex = "^@\\w+|\\s@\\w+";
public static Pattern mentionPattern = Pattern.compile(mentionRegex);

public static String[] getHashtag(String text) {
   String hashtags[];
   matcher = hashtagPattern.matcher(tweet.getText());

    if ( matcher.find() ) {
        hashtags = new String[matcher.groupCount()];
        for ( int i = 0; matcher.find(); i++ ) {
                    //Also i'm getting an ArrayIndexOutOfBoundsException
            hashtags[i] = matcher.group().replace(" ", "").replace("#", "");
        }
    }

   return hashtags;

}

【问题讨论】：

标签： java regex twitter

【解决方案1】：

Matcher#groupCount 为您提供捕获组的数量，而不是匹配的数量。这就是为什么你得到一个ArrayIndexOutOfBoundsException （在你的例子中，数组初始化为零）。您可能希望使用List 来收集动态增长的匹配项，而不是数组。

一种（潜在的）加速方法可能是在空格上标记文本，然后检查标记的开头是否有诸如http、@ 或# 之类的片段。这样，您可以完全避免使用正则表达式。（尚未分析，因此我无法说明性能影响）。

【讨论】：

在空格上进行标记的代码比使用正则表达式要简单得多，因此绝对赞成！