【问题标题】:Python web scraping unstructured tablePython网页抓取非结构化表格
【发布时间】:2020-11-04 01:49:25
【问题描述】:

我正在尝试从出现在网页上的表格中提取一些信息,但表格是非结构化的,行是标题,列是这样的内容:(我很抱歉没有披露网页)

<table class="table-detail">
            <tbody>
                <tr>
                    <td colspan="4" class="noborder">General Information
                    </td>
                </tr>
                <tr>
                    <th>Full name</th>
                    <td>
                        James Smith
                    </td>
                    <th>Year of birth</th>
                    <td>1992</td>
                </tr>
                <tr>
                    <th>Gender</th>
                    <td>Male</td>
                </tr>
                <tr>
                    <th>Place of birth</th>
                    <td>TTexas, USA</td>
                    <td>&nbsp;</td>
                    <td>&nbsp;</td>
                </tr>
                <tr>
                    <th>Address</th>
                    <td>Texas, USA</td>
                    <td>&nbsp;</td>
                    <td></td>
                </tr>

目前,我可以使用此脚本提取表格:

import pandas as pd
import requests

url = "example.com"

r = requests.get(url)
df_list = pd.read_html(r.text)
df = df_list[0]
df.head()

df.to_csv('myfile.csv',encoding='utf-8-sig')

表格基本上如下所示:

但是,我对如何在 Python 上实现这一点有些困惑。我似乎无法集中精力获取数据。我想要的结果如下:

任何帮助将不胜感激。非常感谢您。

【问题讨论】:

    标签: html python-3.x pandas web-scraping python-requests


    【解决方案1】:

    您可以使用beautifulsoup 来解析HTML。例如:

    import pandas as pd
    from bs4 import BeautifulSoup
    
    
    txt = '''<table class="table-detail">
                <tbody>
                    <tr>
                        <td colspan="4" class="noborder">General Information
                        </td>
                    </tr>
                    <tr>
                        <th>Full name</th>
                        <td>
                            James Smith
                        </td>
                        <th>Year of birth</th>
                        <td>1992</td>
                    </tr>
                    <tr>
                        <th>Gender</th>
                        <td>Male</td>
                    </tr>
                    <tr>
                        <th>Place of birth</th>
                        <td>TTexas, USA</td>
                        <td>&nbsp;</td>
                        <td>&nbsp;</td>
                    </tr>
                    <tr>
                        <th>Address</th>
                        <td>Texas, USA</td>
                        <td>&nbsp;</td>
                        <td></td>
                    </tr>'''
    
    
    soup = BeautifulSoup(txt, 'html.parser')
    
    row = {}
    for h in soup.select('th:has(+td)'):
        row[h.text] = h.find_next('td').get_text(strip=True)
    
    df = pd.DataFrame([row])
    print(df)
    

    打印:

         Full name Year of birth Gender Place of birth     Address
    0  James Smith          1992   Male    TTexas, USA  Texas, USA
    

    【讨论】:

    • 我一直在努力解决这个问题。你的代码就像一个魅力。非常感谢你的帮助!非常感谢!
    猜你喜欢
    • 1970-01-01
    • 2021-01-13
    • 2020-01-11
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2023-03-12
    • 2021-07-25
    相关资源
    最近更新 更多