解析推文以将主题标签提取到数组中答案

【问题标题】：Parsing a tweet to extract hashtags into an array解析推文以将主题标签提取到数组中
【发布时间】：2010-03-27 02:25:09
【问题描述】：

我花了很长时间在推文中获取包括主题标签在内的信息，并使用 Python 将每个主题标签拉入一个数组中。我什至不好意思说出我迄今为止一直在尝试的东西。

例如，“我喜欢#stackoverflow，因为#people 非常有帮助！”

这应该将 3 个主题标签拉入一个数组。

【问题讨论】：

您感兴趣的类型称为“列表”。 Python中的同名模块中其实有一个叫“数组”的东西，但是很少用到。

标签： python arrays

【解决方案1】：

一个简单的正则表达式就可以完成这项工作：

>>> import re
>>> s = "I love #stackoverflow because #people are very #helpful!"
>>> re.findall(r"#(\w+)", s)
['stackoverflow', 'people', 'helpful']

请注意，正如其他答案中所建议的，这也可能会找到非主题标签，例如 URL 中的哈希位置：

>>> re.findall(r"#(\w+)", "http://example.org/#comments")
['comments']

因此，另一个简单的解决方案如下（删除重复项作为奖励）：

>>> def extract_hash_tags(s):
...    return set(part[1:] for part in s.split() if part.startswith('#'))
...
>>> extract_hash_tags("#test http://example.org/#comments #test")
set(['test'])

【讨论】：

您的简单解决方案包含太多字符：例如，如果您在主题标签后有一个逗号，那么它最终会包含在主题标签中。
您应该添加 re.UNICODE 以使用 unicode 主题标签
它不仅涵盖数字标签异常，如“#123”。我使用 r'\B#\w*[a-zA-Z]+\w*'。
可能有用：这里是如何为这些函数计时 import timeit timeit.timeit( timeit.timeit('re.findall(r"#(\w+)", s)', setup="import re; s='我喜欢#strackoverflow，因为#people 非常有帮助！'", number=10000) timeit.timeit('extract_hash_tags(s)', setup="import re; s='我喜欢#strackoverflow 因为#人们非常#helpful!'; from main import extract_hash_tags", number=10000) 在我的机器上，正则表达式更快。见docs.python.org/2/library/timeit.html#timeit-examples
我自己的幼稚方法，然后在谷歌上搜索主题标签...#this-is-not-a-hashtag #nufsaid。

【解决方案2】：

>>> s="I love #stackoverflow because #people are very #helpful!"
>>> [i  for i in s.split() if i.startswith("#") ]
['#stackoverflow', '#people', '#helpful!']

【讨论】：

我认为这比使用接受的响应正则表达式更好。这样example.com/index.html#anchor_link 之类的东西就不会被标记为主题标签。
如何使用 pandas 做类似的问题？ stackoverflow.com/questions/38044375/…

【解决方案3】：

最好的 Twitter 标签正则表达式：

import re
text = "#promovolt #1st # promovolt #123"
re.findall(r'\B#\w*[a-zA-Z]+\w*', text)

>>> ['#promovolt', '#1st']

【讨论】：

#1st 不是有效的标签？
嗨@benzkji，正如您在上面看到的，#1st 是一个有效的 Twitter 主题标签。 #123 是无效的 Twitter 标签。
确实！我正在查看第一个谷歌结果，并认为它是理所当然的......hashtags.org/platforms/twitter/… 同样，您的解决方案是最短的，并且可以正常工作。赞成！

【解决方案4】：

假设您必须从一个充满标点符号的句子中检索您的#Hashtags。假设#stackoverflow #people 和#helpful 以不同的符号终止，您想从text 中检索它们，但您可能希望避免重复：

>>> text = "I love #stackoverflow, because #people... are very #helpful! Are they really #helpful??? Yes #people in #stackoverflow are really really #helpful!!!"

如果你尝试单独使用set([i for i in text.split() if i.startswith("#")])，你会得到：

>>> set(['#helpful???',
 '#people',
 '#stackoverflow,',
 '#stackoverflow',
 '#helpful!!!',
 '#helpful!',
 '#people...'])

在我看来这是多余的。使用带有模块 re 的 RE 的更好解决方案：

>>> import re
>>> set([re.sub(r"(\W+)$", "", j) for j in set([i for i in text.split() if i.startswith("#")])])
>>> set(['#people', '#helpful', '#stackoverflow'])

现在对我来说没问题。

编辑：UNICODE #Hashtags

如果您想删除标点符号，请添加 re.UNICODE 标志，但仍保留带有重音符号、撇号和其他 unicode 编码内容的字母，如果预计 #Hashtags 可能不仅仅是英文，这可能很重要。 .也许这只是一个意大利人的噩梦，也许不是！ ;-)

例如：

>>> text = u"I love #stackoverflòw, because #peoplè... are very #helpfùl! Are they really #helpfùl??? Yes #peoplè in #stackoverflòw are really really #helpfùl!!!"

将被 unicode 编码为：

>>> u'I love #stackoverfl\xf2w, because #peopl\xe8... are very #helpf\xf9l! Are they really #helpf\xf9l??? Yes #peopl\xe8 in #stackoverfl\xf2w are really really #helpf\xf9l!!!'

您可以通过这种方式检索您的（正确编码的）#Hashtags：

>>> set([re.sub(r"(\W+)$", "", j, flags = re.UNICODE) for j in set([i for i in text.split() if i.startswith("#")])])
>>> set([u'#stackoverfl\xf2w', u'#peopl\xe8', u'#helpf\xf9l'])

EDITx2：UNICODE #Hashtags 并控制 # 重复

如果您想控制# 符号的多次重复，如（如果text 示例变得几乎无法阅读，请原谅我）：

>>> text = u"I love ###stackoverflòw, because ##################peoplè... are very ####helpfùl! Are they really ##helpfùl??? Yes ###peoplè in ######stackoverflòw are really really ######helpfùl!!!"
>>> u'I love ###stackoverfl\xf2w, because ##################peopl\xe8... are very ####helpf\xf9l! Are they really ##helpf\xf9l??? Yes ###peopl\xe8 in ######stackoverfl\xf2w are really really ######helpf\xf9l!!!'

那么您应该将这些多次出现替换为唯一的#。一种可能的解决方案是引入另一个嵌套的隐式 set() 定义，其中 sub() 函数将出现的多个 # 替换为单个 #：

>>> set([re.sub(r"#+", "#", k) for k in set([re.sub(r"(\W+)$", "", j, flags = re.UNICODE) for j in set([i for i in text.split() if i.startswith("#")])])])
>>> set([u'#stackoverfl\xf2w', u'#peopl\xe8', u'#helpf\xf9l'])

【讨论】：

您的解决方案看起来是最完整的。但。破折号和特殊字符仍保留在哈希标签中？所以你会得到#hash,tag 和#what-ever 作为标签......？有没有简单的解决方案？

【解决方案5】：

AndiDogs 的回答会被链接和其他东西搞砸，您可能需要先将它们过滤掉。之后使用此代码：

UTF_CHARS = ur'a-z0-9_\u00c0-\u00d6\u00d8-\u00f6\u00f8-\u00ff'
TAG_EXP = ur'(^|[^0-9A-Z&/]+)(#|\uff03)([0-9A-Z_]*[A-Z_]+[%s]*)' % UTF_CHARS
TAG_REGEX = re.compile(TAG_EXP, re.UNICODE | re.IGNORECASE)

这可能看起来有点矫枉过正，但这是从http://github.com/mzsanford/twitter-text-java 转换而来的。它将以与 twitter 相同的方式处理 99% 的主题标签。

有关更多转换的 twitter 正则表达式，请查看：http://github.com/BonsaiDen/Atarashii/blob/master/atarashii/usr/share/pyshared/atarashii/formatter.py

编辑：
签出：http://github.com/BonsaiDen/AtarashiiFormat

【讨论】：

你可以使用 (?:^|[^0-9A-Z&/]+)(?:#|\uff03)([0-9A-Z_]*[A-Z_]+ [%s]*) 仅提取主题标签文本本身，而不是带有空格和主题标签和文本的元组

【解决方案6】：

简单的要点（比选择的答案更好） https://gist.github.com/mahmoud/237eb20108b5805aed5f 也可以使用 unicode 主题标签

【讨论】：

是的，它肯定更好，因为接受的答案从(#test) 中提取test)，而这个要点返回test，正如预期的那样。

【解决方案7】：

hashtags = [word for word in tweet.split() if word[0] == "#"]

【讨论】：

你的意思是==，而不是=。（另外，word.startswith("#") 优于 word[0] == "#"。）

【解决方案8】：

我对 unicode 语言有很多问题。

我见过很多提取主题标签的方法，但发现没有一种方法能回答所有情况

所以我写了一些小的 Python 代码来处理大多数情况。它对我有用。

def get_hashtagslist(string):
    ret = []
    s=''
    hashtag = False
    for char in string:
        if char=='#':
            hashtag = True
            if s:
                ret.append(s)
                s=''           
            continue

        # take only the prefix of the hastag in case contain one of this chars (like on:  '#happy,but i..' it will takes only 'happy'  )
        if hashtag and char in [' ','.',',','(',')',':','{','}'] and s:
            ret.append(s)
            s=''
            hashtag=False 

        if hashtag:
            s+=char

    if s:
        ret.append(s)

    return list(set([word for word in ret if len(ret)>1 and len(ret)<20]))

【讨论】：

【解决方案9】：

我以一种愚蠢但有效的方式提取主题标签。

def retrive(s):
    indice_t = []
    tags = []
    tmp_str = ''
    s = s.strip()
    for i in range(len(s)):
        if s[i] == "#":
            indice_t.append(i)
    for i in range(len(indice_t)):
        index = indice_t[i]
        if i == len(indice_t)-1:
            boundary = len(s)
        else:
            boundary = indice_t[i+1]
        index += 1
        while index < boundary:
            if s[index] in "`~!@#$%^&*()-_=+[]{}|\\:;'"",.<>?/ \n\t":
                tags.append(tmp_str)
                tmp_str = ''
                break
            else:
                tmp_str += s[index]
                index += 1
        if tmp_str != '':
            tags.append(tmp_str)
    return tags

【讨论】：

请删除垃圾评论