在 Python 中抓取 HTML 表答案

【问题标题】：Scrape HTML Table in Python在 Python 中抓取 HTML 表
【发布时间】：2020-02-25 17:07:59
【问题描述】：

我正在尝试抓取美国证券交易委员会的报告页面，以获取有关一些股票代码的一些基本信息。

这是 Apple 的示例 URL - https://sec.report/CIK/0000320193

页面内有一个“公司详细信息”表，其中包含基本信息。我基本上只是想抓取 IRS 号码、公司所在州和地址。

我很酷，只是刮下这张图表并将其保存到 PD Df 中。我对网络抓取非常陌生，因此正在寻找一些技巧来完成这项工作！下面是我的代码，但是一旦我提取面板主体，我不知道该去哪里。谢谢大家！

session = requests.Session()
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.76 Safari/537.36'}
page = requests.get('https://sec.report/CIK/0000051143.html', headers = headers)
page.content

from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')

soup.find_all(class_='panel-body')

【问题讨论】：

提取面板主体后，我不知道该去哪里。这非常模糊，你能更具体吗？

标签： python web web-scraping beautifulsoup scrape

【解决方案1】：

尝试使用 lxml 包而不是 BeautifoulSoup，对我来说更容易找到带有 xpath 语句的元素：

import requests
from lxml import html

session = requests.Session()
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.76 Safari/537.36'}
page = requests.get('https://sec.report/CIK/0000051143', headers=headers)

raw_html = html.fromstring(page.text)

irs = raw_html.xpath('//tr[./td[contains(text(),"IRS Number")]]/td[2]/text()')[0]

state_incorp = raw_html.xpath('//tr[./td[contains(text(),"State of Incorporation")]]/td[2]/text()')

address = raw_html.xpath('//tr[./td[contains(text(),"Business Address")]]/td[2]/text()')[0]

【讨论】：

谢谢你！当我运行代码时，我收到一个 IndexError。 '列表索引超出范围'。你看到的一样吗？
那是因为 xpath 失败，并且找不到元素，在上面的示例中对我有用。
这很奇怪，我想知道为什么 xpath 对我来说失败了，但对你来说却成功了。