如何使用 Python 将 HTML 存储在开始和结束标记中答案

【问题标题】：How to store the HTML within an opening and closing tag with Python如何使用 Python 将 HTML 存储在开始和结束标记中
【发布时间】：2018-01-28 06:15:40
【问题描述】：

我正在阅读一个 HTML 文档，并希望将嵌套在某个名称的 div 标记中的 HTML 存储起来，同时保持其结构（间距）。这是为了能够将 HTML 文档转换为 React 的组件。我正在努力解决如何存储嵌套 HTML 的结构，并为 div 找到正确的结束标记，这表示嵌套在其中的所有内容都将成为 React 组件（div class='rc-componentname' 是开始标记）。任何帮助将不胜感激。谢谢！

编辑：我认为正则表达式是解决此问题的最佳方法。我以前没有使用过正则表达式，所以如果这是正确的，有人可以为我指出在这种情况下使用的表达式的正确方向。

import os

components = []

class react_template():
    def __init__(self, component_name): # add nested html as second element
        self.Import = "import React, { Component } from ‘react’;"
        self.Class = "Class " + component_name + ' extends Component {'
        self.Render = "render() {"
        self.Return = "return "
        self.Export = "Default export " + component_name + ";"

def react(component):
    r = react_template(component)

    if not os.path.exists('components'): # create components folder
        os.mkdir('components')
    os.chdir('components')

    if not os.path.exists(component): # create folder for component
        os.mkdir(component)
    os.chdir(component)

    with open(component + '.js', 'wb') as f: # create js component file
        for j_key, j_code in r.__dict__.items():
            f.write(j_code.encode('utf-8') + '\n'.encode('utf-8'))
    f.close()


def process_html():
    with open('file.html', 'r') as f:
        for line in f:
            if 'rc-' in line:
                char_soup = list(line)
                for index, char in enumerate(char_soup):
                    if char == 'r' and char_soup[index+1] == 'c' and char_soup[index+2] == '-':
                        sliced_soup = char_soup[int(index+3):]
                        c_slice_index = sliced_soup.index("\'")
                        component = "".join(sliced_soup[:c_slice_index])
                        components.append(component)
                        innerHTML(sliced_soup)
                        # react(component)

def innerHTML(sliced_soup): # work in progress
    first_closing = sliced_soup.index(">")
    sliced_soup = "".join(sliced_soup[first_closing:]).split(" ")


def generate_components(components):
    for c in components:
        react(c)


if __name__ == "__main__":
    process_html()

【问题讨论】：

标签： python parsing html-parsing

【解决方案1】：

我看到您在代码中使用了 soup 这个词...也许您已经尝试过但不喜欢 BeautifulSoup？如果您还没有尝试过，我建议您查看 BeautifulSoup 而不是尝试使用正则表达式解析 HTML。尽管正则表达式对于单个标签甚至少数标签就足够了，但标记语言看似简单。 BeautifulSoup 是一个很好的库，可以让处理标记变得更容易。

https://www.crummy.com/software/BeautifulSoup/bs4/doc/

这将允许您将整个 html 视为一个对象，并使您能够：

# create a list of specific elements as objects
soup.find_all('div')

# find a specific element by id
soup.find(id="custom-header")

【讨论】：

很棒的建议。我还没有尝试使用 bs4，但它似乎类似于我命名我的变量汤哈哈。谢谢！