美丽的汤，XML 到 Pandas 数据框答案

【问题标题】：Beautiful soup, XML to Pandas dataframe美丽的汤，XML 到 Pandas 数据框
【发布时间】：2021-08-08 15:46:51
【问题描述】：

我是机器学习的初学者，并为我的 nlp 项目使用数据库进行探索。这里我从http://www.cs.jhu.edu/~mdredze/datasets/sentiment/index2.html 得到数据。我正在尝试创建一个 pd 数据框，我想在其中解析 xml 数据，我还想在正面评论中添加一个标签（1），有人可以帮我写代码吗，已经给出了示例输出，

from bs4 import BeautifulSoup
positive_reviews = BeautifulSoup(open('/content/drive/MyDrive/sorted_data_acl/electronics/positive.review', encoding='utf-8').read())
positive_reviews = positive_reviews.findAll('review_text')
positive_reviews[0]



<review_text>
I purchased this unit due to frequent blackouts in my area and 2 power supplies going bad.  It will run my cable modem, router, PC, and LCD monitor for 5 minutes.  This is more than enough time to save work and shut down.   Equally important, I know that my electronics are receiving clean power.

I feel that this investment is minor compared to the loss of valuable data or the failure of equipment due to a power spike or an irregular power supply.

As always, Amazon had it to me in &lt;2 business days
</review_text>

【问题讨论】：

标签： python pandas beautifulsoup nlp

【解决方案1】：

主要问题是它是 pseudo xml 的注释
下载tar.gz文件并解压/解压
构建所有文件的字典
处理 pseudo xml 的解决方法 - 在文档的字符串表示中插入文档元素
然后是使用list/dict理解生成pandas构造函数格式的简单案例
dfs 是一个可以使用的数据帧字典

import requests
from pathlib import Path
from tarfile import TarFile
from bs4 import BeautifulSoup
import io
import pandas as pd

# download tar with psuedo XML...
url = "http://www.cs.jhu.edu/%7Emdredze/datasets/sentiment/domain_sentiment_data.tar.gz"
fn = Path.cwd().joinpath(url.split("/")[-1])
if not fn.exists():
    r = requests.get(url, stream=True)
    with open(fn, 'wb') as f:
        for chunk in r.raw.stream(1024, decode_content=False):
            if chunk:
                f.write(chunk)

# untar downloaded file and generate a dictionary of all files
TarFile.open(fn, "r:gz").extractall()
files = {f"{p.parent.name}/{p.name}":p for p in Path.cwd().joinpath("sorted_data_acl").glob("**/*") if p.is_file()}

# convert all files into dataframes in a dict
dfs = {}
for file in files.keys():
    with open(files[file]) as f: text = f.read()
    # psuedo xml where there is not root element stops it from being well formed
    # force it in...
    soup = BeautifulSoup(f"<root>{text}</root>", "xml")
    # simple case of each review is a row and each child element is a column
    dfs[file] = pd.DataFrame([{c.name:c.text.strip("\n") for c in r.children if c.name} for r in soup.find_all("review")])

【讨论】：

非常感谢！有效。即使代码对我来说有点复杂：D。因为我还不熟悉 pathlib 模块。但是非常感谢您的时间。 :)
解决方案的关键部分是插入文档元素以使 xml 格式正确。使用 pathlib 只是让计算机完成管理文件的工作:-)