自定义词分词器答案

【问题标题】：Custom word tokenizer自定义词分词器
【发布时间】：2016-04-06 19:20:03
【问题描述】：

我正在分析 Twitter 数据以进行情绪分析，我需要对推文进行标记化以进行分析。

让这是一个示例推文：

tweet = "Barça, que más veces ha jugado contra 10 en la historia https://twitter.com/7WUjZrMJah #UCL"

nltk.word_tokenize() 可以标记推文，但在链接和主题标签处中断。

word_tokenize(tweet)

>>> ['Bar\xc3\xa7a', ',', 'que', 'm\xc3\xa1s', 'veces', 'ha', 'jugado', 'contra', '10', 'en', 'la', 'historia', 'https', ':', '//twitter.com/7WUjZrMJah', '#', 'UCL']`

Unicode 字符保持不变，但链接已损坏。我设计了一个自定义的正则表达式标记器，它是：

emoticons = r'(?:[:;=\^\-oO][\-_\.]?[\)\(\]\[\-DPOp_\^\\\/])'

regex_tweets = [
    emoticons,
    r'<[^>]+>',      ## HTML TAGS
    r'(?:@[\w\d_]+)',   ## @-mentions
    r'(?:\#[\w]+)',  ## #HashTags
    r'http[s]?://(?:[a-z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-f][0-9a-f]))+', # URLs
    r"(?:[a-z][a-z'\-_]+[a-z])", # words with - and '
    r'(?:(?:\d+,?)+(?:\.?\d+)?)',  ##numbers
    r'(?:[\w_]+)',   #other words
    r'(?:\S)'        ## normal text 
]

#compiling regex
tokens_re = re.compile(r'('+'|'.join(regex_tweets)+')' ,re.IGNORECASE | re.VERBOSE)
tokens_re.findall(string)

>>> ['Bar', '\xc3', '\xa7', 'a', ',', 'que', 'm', '\xc3', '\xa1', 's', 'veces', 'ha', 'jugado', 'contra', '10', 'en', 'la', 'historia', 'https://twitter.com/7WUjZrMJah', '#UCL']

现在主题标签和链接以我希望的方式显示，但在 unicode 字符处中断（如 Barça -> ['Bar', '\xc3', '\xa7', 'a'] 而不是 ['Bar\xc3\xa7a']

有什么方法可以整合这两者？还是包含 unicode 字符的正则表达式？？

我也尝试过来自nltk.tokenize 库的TweetTokenizer，但不是很有用。

【问题讨论】：

您还需要指定re.U 标志。 re.IGNORECASE | re.VERBOSE | re.UNICODE。还要注意[\w\d_]+ = \w+。另外，这个(?:(?:\d+,?)+(?:\.?\d+)?) 看起来很脆弱。
是 Python 2.7 吗？您是否将输入文本编码为 UTF8？
不，是like this。
@WiktorStribiżew 我没有这样做，现在它似乎按照我想要的方式工作。它仍然打破了几个字符，但它比以前更好！谢谢！
您可以顺便发布您的解决方案。

标签： python regex tokenize tweepy

【解决方案1】：

如果我将字符串声明为 unicode 字符串，大多数 unicode 字符都不会中断。它仍然在许多单词处中断，但性能更好。

# coding=utf-8

tweet = u"Barça, que más veces ha jugado contra 10 en la historia https://twitter.com/7WUjZrMJah #UCL"

emoticons = r'(?:[:;=\^\-oO][\-_\.]?[\)\(\]\[\-DPOp_\^\\\/])'

regex_tweets = [
    emoticons,
    r'<[^>]+>',      ## HTML TAGS
    r'(?:@[\w\d_]+)',   ## @-mentions
    r'(?:\#[\w]+)',  ## #HashTags
    r'http[s]?://(?:[a-z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-f][0-9a-f]))+', # URLs
    r"(?:[a-z][a-z'\-_]+[a-z])", # words with - and '
    r'(?:(?:\d+,?)+(?:\.?\d+)?)',  ##numbers
    r'(?:[\w_]+)',   #other words
    r'(?:\S)'        ## normal text 
]

#compiling regex
tokens_re = re.compile(r'('+'|'.join(regex_tweets)+')' ,re.IGNORECASE | re.VERBOSE)
tokens_re.findall(string)

>>>[u'Bar', u'\xe7a', u',', u'que', u'm\xe1s', u'veces', u'ha', u'jugado', u'contra', u'10', u'en', u'la', u'historia', u'https://twitter.com/7WUjZrMJah', u'#UCL']

它仍然将Barça 标记为[u'Bar', u'\xe7a']，这比['Bar', '\xc3', '\xa7', 'a'] 好，但仍然不是原来的术语['Bar\xc3\xa7a']。但它确实适用于许多表达式。

【讨论】：