如何从python中的html表中抓取数据答案

【问题标题】：How to scrape data from html table in python如何从python中的html表中抓取数据
【发布时间】：2017-08-09 09:12:41
【问题描述】：

<tr class="even">
<td><strong><a href='../eagleweb/viewDoc.jsp?node=DOC186S8881'>DEED<br/>
2016002023</a></strong></td>
<td><a href='../eagleweb/viewDoc.jsp?node=DOC186S8881'><b> Recording Date: </b>01/12/2016 08:05:17 AM&nbsp;&nbsp;&nbsp;<b>Book Page: </b> <table cellspacing=0 width="100%"><tr><td width="50%"  valign="top"><b>Grantor:</b> ARELLANO ISAIAS</td><td width="50%"  valign="top"><b>Grantee:</b> ARELLANO ISAIAS, ARELLANO ALICIA</td></tr></table>
<b>Number Pages:</b> 3<br></a></td>
<td></td>
<td></td></tr>

我是 python 和抓取的新手，请帮助我如何从这个表中抓取数据。对于登录，请转到公共登录，然后输入到和从日期。

数据模型：数据模型具有按此特定顺序和大小写的列：“record_date”、“doc_number”、“doc_type”、“role”、“name”、“apn”、“transfer_amount”、“county”、和“状态”。 “角色”列将是“Grantor”或“Grantee”，具体取决于名称的分配位置。如果授权人和受让人有多个名称，请为每个名称换行，并复制记录日期、文档编号、文档类型、角色和 apn。

https://crarecords.sonomacounty.ca.gov/recorder/eagleweb/docSearchResults.jsp?searchId=0

【问题讨论】：

我想提取这些东西。数据模型：数据模型具有按此特定顺序和大小写的列：“record_date”、“doc_number”、“doc_type”、“role”、“name”、“apn”、“transfer_amount”、“county”和“state” ”。 “角色”列将是“Grantor”或“Grantee”，具体取决于名称的分配位置。如果授予人和受让人有多个名称，请为每个名称换行并复制记录日期、文档编号、文档类型、角色和 apn。如果您对如何构建 csv 结果有疑问，请咨询我。
这看起来像是一个需要凭据的安全网站，我只能访问You must be logged in to access the requested page。您可以将 html 表复制到您的问题中吗？
好的，等我截图
我粘贴了代码

标签： python html python-3.x web-scraping beautifulsoup

【解决方案1】：

我知道这是一个老问题，但这项任务的一个被低估的秘密是 Panda 的 read_clipboard 函数：https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_clipboard.html

我认为它在后台使用 BeautifulSoup，但简单使用的界面非常简单。考虑这个简单的脚本：

# 1. Go to a website, e.g. https://www.wunderground.com/hurricane/hurrarchive.asp?region=ep
# 2. Highlight the table of data, e.g. of Hurricanes in the East Pacific
# 3. Copy the text from your browser
# 4. Run this script: the data will be available as a dataframe
import pandas as pd
df = pd.read_clipboard()
print(df)

当然，这个解决方案需要用户交互，但在很多情况下，我发现它在没有方便的 CSV 下载或 API 端点时很有用。

【讨论】：

【解决方案2】：

您发布的 html 不包含您的数据模型中列出的所有列字段。但是，对于它确实包含的字段，这将生成一个 python dictionary，您可以从中获取数据模型的字段：

import urllib.request
from bs4 import BeautifulSoup

url = "the_url_of_webpage_to_scrape" # Replace with the URL of your webpage

with urllib.request.urlopen(url) as response:
    html = response.read()

soup = BeautifulSoup(html, 'html.parser')

table = soup.find("tr", attrs={"class":"even"})

btags = [str(b.text).strip().strip(':') for b in table.find_all("b")]

bsibs = [str(b.next_sibling.replace(u'\xa0', '')).strip() for b in table.find_all('b')]

data = dict(zip(btags, bsibs))

data_model = {"record_date": None, "doc_number": None, "doc_type": None, "role": None, "name": None, "apn": None, "transfer_amount": None, "county": None, "state": None}

data_model["record_date"] = data['Recording Date']
data_model['role'] = data['Grantee']

print(data_model)

输出：

{'apn': None,
 'county': None,
 'doc_number': None,
 'doc_type': None,
 'name': None,
 'record_date': '01/12/2016 08:05:17 AM',
 'role': 'ARELLANO ISAIAS, ARELLANO ALICIA',
 'state': None,
 'transfer_amount': None}

你可以这样做：

print(data_model['record_date']) # 01/12/2016 08:05:17 AM
print(data_model['role'])        # ARELLANO ISAIAS, ARELLANO ALICIA

希望这会有所帮助。

【讨论】：