用 HTML 制作 pandas 数据框答案

【问题标题】：Making a pandas dataframe out of HTML用 HTML 制作 pandas 数据框
【发布时间】：2021-09-05 21:17:06
【问题描述】：

我正在尝试将 word 文档 .docx 转换为数据框。这些 docx 文件首先使用以下内容转换为 HTML：

#fill path in function 
path = os.chdir('C:jan_2021')
filename = "newsupdatedocx"
regex = '\xc2\xb7'

with open(filename, "rb") as docx_file:
    result = mammoth.convert_to_html(docx_file)
    text = result.value # The raw text
    text2=re.sub(u'[|•●]', " ", text, count= 0) 
    with open('output.txt', 'w', encoding='utf-8') as text_file:
        text_file.write(text2)

这提供了以下 HTML 输出：

print(prettify)
<html>
 <body>
  <p>
   Newsupdate of date 01-01-2021
  </p>
  <h1>
   Header - worldwide news - category nr 1.
  </h1>
  <h2>
   Header - title of article nr. 1
  </h2>
  <p>
   Source: economist, google, NYTimes
  </p>
  <ul>
   <li>
    First bullet point related to article 1
   </li>
   <li>
    Second bullet point
   </li>
 </p>
 </body>

如您所见，将文档转换为 HTML 为它提供了一个可以相应分析的结构。现在我想将其转换为数据框。我想从所有元素中创建一个列表，以便遍历列表并检查它是列表项中的 <h1> 元素还是 <h2> 元素。最终，我想要一个数据框，其中包含以下列：日期、新闻类型、文章标题、列表项之前的来源，最后是项目（要点）。

第一步是将所有元素转换为列表，那么是否有某个函数可以将这个HTML转换为列表？

【问题讨论】：

您可以使用beautifulsoup 将html table 转换为pandas dataframe。
问题是，不幸的是它没有找到任何表。 word文件不是表格格式，它有一个简单的自上而下的结构，没有任何列。

标签： python html pandas

【解决方案1】：

我找到了答案，使用以下代码：

tags = soup.find_all(['p', 'h1', 'h2', 'li'])

【讨论】：