【问题标题】:Python - Extract hashtags from text; end at punctuationPython - 从文本中提取主题标签;以标点结尾
【发布时间】:2017-04-18 15:05:57
【问题描述】:

对于我的编程课,我要根据以下描述创建一个函数:

参数是一条推文。此函数应返回包含推文中所有主题标签的列表,按照它们在推文中出现的顺序排列。返回列表中的每个主题标签都应删除初始哈希符号,并且主题标签应该是唯一的。 (如果一条推文两次使用相同的主题标签,则它仅包含在列表中一次。主题标签的顺序应与推文中每个标签第一次出现的顺序相匹配。)

我不确定如何制作,以便在遇到标点符号时结束主题标签(请参阅第二个 doctest 示例)。我当前的代码没有输出任何东西:

def extract(start, tweet):
    """ (str, str) -> list of str

    Return a list of strings containing all words that start with a specified character.

    >>> extract('@', "Make America Great Again, vote @RealDonaldTrump")
    ['RealDonaldTrump']
    >>> extract('#', "Vote Hillary! #ImWithHer #TrumpsNotMyPresident")
    ['ImWithHer', 'TrumpsNotMyPresident']
    """

    words = tweet.split()
    return [word[1:] for word in words if word[0] == start]

def strip_punctuation(s):
    """ (str) -> str

    Return a string, stripped of its punctuation.

    >>> strip_punctuation("Trump's in the lead... damn!")
    'Trumps in the lead damn'
    """
    return ''.join(c for c in s if c not in '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~')

def extract_hashtags(tweet):
    """ (str) -> list of str

    Return a list of strings containing all unique hashtags in a tweet.
    Outputted in order of appearance.

    >>> extract_hashtags("I stand with Trump! #MakeAmericaGreatAgain #MAGA #TrumpTrain")
    ['MakeAmericaGreatAgain', 'MAGA', 'TrumpTrain']
    >>> extract_hashtags('NEVER TRUMP. I'm with HER. Does #this! work?')
    ['this']
    """

    hashtags = extract('#', tweet)

    no_duplicates = []

    for item in hashtags:
        if item not in no_duplicates and item.isalnum():
            no_duplicates.append(item)

    result = []
    for hash in no_duplicates:
        for char in hash:
            if char.isalnum() == False and char != '#':
                hash == hash[:char.index()]
                result.append()
    return result

在这一点上我很迷茫;任何帮助,将不胜感激。先感谢您。

注意:我们不允许使用正则表达式或导入任何模块。

【问题讨论】:

  • 好吧..如果你需要以标点符号结尾,并且没有那么多个标点符号,为什么不检查下一个字符是否是标点符号?跨度>

标签: python twitter


【解决方案1】:

你看起来确实有点迷茫。解决这类问题的关键是将问题分成更小的部分,解决这些问题,然后将结果组合起来。你已经得到了你需要的每一件......:

def extract_hashtags(tweet):
    # strip the punctuation on the tags you've extracted (directly)
    hashtags = [strip_punctuation(tag) for tag in extract('#', tweet)]
    # hashtags is now a list of hash-tags without any punctuation, but possibly with duplicates

    result = []
    for tag in hashtags:
        if tag not in result:  # check that we haven't seen the tag already (we know it doesn't contain punctuation at this point)
            result.append(tag)
    return result

ps:这是一个非常适合正则表达式解决方案的问题,但如果您想要一个快速的strip_punctuation,您可以使用:

def strip_punctuation(s):
    return s.translate(None, '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~')

【讨论】:

    猜你喜欢
    • 2018-07-15
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2017-04-12
    • 1970-01-01
    • 1970-01-01
    • 2016-06-20
    相关资源
    最近更新 更多