【发布时间】:2019-04-23 05:43:31
【问题描述】:
我正在尝试从“https://www.wunderground.com/personal-weather-station/dashboard?ID=KMAHADLE7#history/tdata/s20170201/e20170201/mcustom.html”天气地下页面中删除历史天气数据。我有以下代码:
import pandas as pd
page_link = 'https://www.wunderground.com/personal-weather-station/dashboard?ID=KMAHADLE7#history/tdata/s20170201/e20170201/mcustom.html'
df = pd.read_html(page_link)
print(df)
我有以下回应:
Traceback (most recent call last):
File "weather_station_scrapping.py", line 11, in <module>
result = pd.read_html(page_link)
File "/anaconda3/lib/python3.6/site-packages/pandas/io/html.py", line 987, in read_html
displayed_only=displayed_only)
File "/anaconda3/lib/python3.6/site-packages/pandas/io/html.py", line 815, in _parse raise_with_traceback(retained)
File "/anaconda3/lib/python3.6/site-packages/pandas/compat/__init__.py", line 403, in raise_with_traceback
raise exc.with_traceback(traceback)
ValueError: No tables found
虽然,这个页面显然有一个表格,但它并没有被 read_html 选中。我曾尝试使用 Selenium,以便在阅读之前加载页面。
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Firefox()
driver.get("https://www.wunderground.com/personal-weather-station/dashboard?ID=KMAHADLE7#history/tdata/s20170201/e20170201/mcustom.html")
elem = driver.find_element_by_id("history_table")
head = elem.find_element_by_tag_name('thead')
body = elem.find_element_by_tag_name('tbody')
list_rows = []
for items in body.find_element_by_tag_name('tr'):
list_cells = []
for item in items.find_elements_by_tag_name('td'):
list_cells.append(item.text)
list_rows.append(list_cells)
driver.close()
现在,问题是它找不到“tr”。我将不胜感激任何建议。
【问题讨论】:
-
该表格在页面html中不存在,它在页面的其余部分之后异步加载。 Pandas 不会等待页面加载 java 内容。在尝试解析页面之前,您可能需要某种自动化(如 Selenium)来加载页面
-
嗨,我尝试过使用 Selenium,但仍然遇到问题。如果可能的话,你介意看看我的编辑并提出任何建议吗?
-
不同的选择器
df=pd.read_html(driver.find_element_by_id("history_table").get_attribute('outerHTML'))[0]请参阅下面发布的我的答案
标签: python html pandas parsing web-scraping