获取html标签的Python循环返回空列表而不是标签答案

【问题标题】：Python Loop that gets html tags returning empty list instead of tags获取html标签的Python循环返回空列表而不是标签
【发布时间】：2016-10-07 20:24:12
【问题描述】：

所以我正在尝试创建一个函数，它将通过列表中的 html 标签列表作为字符并返回标签。一个例子是它会通过下面的列表

['', ' ', '']

并返回这样的列表

['html','head','meta']

但是，当我运行下面的函数时，它会返回一个空列表 []

def getTag(htmlList):
    tagList=[]
    for iterate, character in enumerate(htmlList):
        tagAppend = ''
        if character=='<':
            for index, word in enumerate(htmlList):
                if index>iterate:
                    if character=='>':
                        tagList.append(tagAppend)
                        break
                    tagAppend += character

    return tagList

这个程序对我来说似乎有意义吗？它创建一个空列表（tagList），然后像我发布的第一个列表一样遍历列表（htmlList）。

在迭代时，如果遇到“”时停止。然后将 tagAppend 添加到 tagList。然后它清除 tagList 并重做循环。

【问题讨论】：

标签： python html list if-statement for-loop

【解决方案1】：

这看起来太复杂了。相反，将列表连接成一个字符串，删除左尖括号，然后拆分右尖括号，记住丢弃空字符串：

def get_tag(l):
    return [item for item in ''.join(l).replace('<','').split('>') if item]

结果：

>>> l = ['<', 'h', 't', 'm', 'l', '>', '<', 'h', 'e', 'a', 'd', '>', '<', 'm', 'e', 't', 'a', '>']
>>> get_tag(l)
['html', 'head', 'meta']

【讨论】：

【解决方案2】：

我认为re 会是一个不错的选择。

def get_tag(l):
    return re.findall(r'<([a-z]+)>', ''.join(l))

get_tag(l)
['html', 'head', 'meta']

【讨论】：

【解决方案3】：

您的代码几乎正确，您只需将内循环中所有出现的character 替换为word； word 从未在该内部循环中使用：

        ...
        for index, word in enumerate(htmlList):
            if index > iterate:
                if word == '>':   # here
                    tagList.append(tagAppend)
                    break
                tagAppend += word # here
        ...

您可以不使用 enumerate 和嵌套 for 循环，如下所示：

def get_tag(htmlList):
    tag_list = []
    for x in htmlList:
        if x == '<':
            tag = ''
            continue
        elif x == '>':
            tag_list.append(tag)
            continue
        tag += x
    return tag_list

【讨论】：

【解决方案4】：

我假设这只是为了学习而进行的练习。一般来说，Python 有更好的工具来解析 HTML (https://www.crummy.com/software/BeautifulSoup/) 或字符串 (https://docs.python.org/2/library/re.html)。

def getTag(htmlList):
    tagList=[]
    for iterate, character in enumerate(htmlList):
        tagAppend = ''
        if character=='<':
            for index, word in enumerate(htmlList):
                if index>iterate:
                    # use word here otherwise this will never be True
                    if word=='>':
                        tagList.append(tagAppend)
                        break
                    # and here
                    tagAppend += word

    return tagList

关键错误是使用字符而不是单词。我认为否则它会正常工作。虽然效率低。

我们也可以简化。不需要嵌套的 for 循环。

def getTag(htmlList):
    tagList=[]
    tag = ""
    for character in htmlList:
        if character == "<":
            tag = ""
        elif character == ">":
            tagList.append(tag)
        else:
            tag.append(character)

    return tagList

上面有一些严重的问题，具体取决于输入数据的约束条件。仔细思考一下，看看你能不能找到它们，这可能是有益的。

我们还可以使用像 split 和 join 这样的内置插件来产生巨大的影响，如另一个答案中所述。

【讨论】：