【问题标题】:Beautiful soup, XML to Pandas dataframe美丽的汤,XML 到 Pandas 数据框
【发布时间】:2021-08-08 15:46:51
【问题描述】:

我是机器学习的初学者,并为我的 nlp 项目使用数据库进行探索。这里我从http://www.cs.jhu.edu/~mdredze/datasets/sentiment/index2.html 得到数据。我正在尝试创建一个 pd 数据框,我想在其中解析 xml 数据,我还想在正面评论中添加一个标签(1),有人可以帮我写代码吗,已经给出了示例输出,

from bs4 import BeautifulSoup
positive_reviews = BeautifulSoup(open('/content/drive/MyDrive/sorted_data_acl/electronics/positive.review', encoding='utf-8').read())
positive_reviews = positive_reviews.findAll('review_text')
positive_reviews[0]



<review_text>
I purchased this unit due to frequent blackouts in my area and 2 power supplies going bad.  It will run my cable modem, router, PC, and LCD monitor for 5 minutes.  This is more than enough time to save work and shut down.   Equally important, I know that my electronics are receiving clean power.

I feel that this investment is minor compared to the loss of valuable data or the failure of equipment due to a power spike or an irregular power supply.

As always, Amazon had it to me in &lt;2 business days
</review_text>

【问题讨论】:

    标签: python pandas beautifulsoup nlp


    【解决方案1】:
    • 主要问题是它是 pseudo xml 的注释
    • 下载tar.gz文件并解压/解压
    • 构建所有文件的字典
    • 处理 pseudo xml 的解决方法 - 在文档的字符串表示中插入文档元素
    • 然后是使用list/dict理解生成pandas构造函数格式的简单案例
    • dfs 是一个可以使用的数据帧字典
    import requests
    from pathlib import Path
    from tarfile import TarFile
    from bs4 import BeautifulSoup
    import io
    import pandas as pd
    
    # download tar with psuedo XML...
    url = "http://www.cs.jhu.edu/%7Emdredze/datasets/sentiment/domain_sentiment_data.tar.gz"
    fn = Path.cwd().joinpath(url.split("/")[-1])
    if not fn.exists():
        r = requests.get(url, stream=True)
        with open(fn, 'wb') as f:
            for chunk in r.raw.stream(1024, decode_content=False):
                if chunk:
                    f.write(chunk)
    
    # untar downloaded file and generate a dictionary of all files
    TarFile.open(fn, "r:gz").extractall()
    files = {f"{p.parent.name}/{p.name}":p for p in Path.cwd().joinpath("sorted_data_acl").glob("**/*") if p.is_file()}
    
    # convert all files into dataframes in a dict
    dfs = {}
    for file in files.keys():
        with open(files[file]) as f: text = f.read()
        # psuedo xml where there is not root element stops it from being well formed
        # force it in...
        soup = BeautifulSoup(f"<root>{text}</root>", "xml")
        # simple case of each review is a row and each child element is a column
        dfs[file] = pd.DataFrame([{c.name:c.text.strip("\n") for c in r.children if c.name} for r in soup.find_all("review")])
    

    【讨论】:

    • 非常感谢!有效。即使代码对我来说有点复杂:D。因为我还不熟悉 pathlib 模块。但是非常感谢您的时间。 :)
    • 解决方案的关键部分是插入文档元素以使 xml 格式正确。使用 pathlib 只是让计算机完成管理文件的工作:-)
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2016-12-18
    • 1970-01-01
    • 2021-01-15
    • 1970-01-01
    • 2023-03-12
    • 1970-01-01
    相关资源
    最近更新 更多