如何使用 python 从 HTML 页面中读取不同的表格？答案

【问题标题】：How to read different tables from a HTML page using python?如何使用 python 从 HTML 页面中读取不同的表格？
【发布时间】：2021-07-28 19:47:08
【问题描述】：

我正在使用下面的 html 链接来读取表格，

http://a810-bisweb.nyc.gov/bisweb/ActionsByLocationServlet?requestid=1&allbin=1040645

我能够使用下面的代码阅读第一页中的表格，但问题是页面继续，那么我如何才能同时阅读下一页中的表格？不管有多少页，我都想把表的所有记录都拉出来。

这是我的尝试，

import requests
import pandas as pd

    url = 'http://a810-bisweb.nyc.gov/bisweb/ActionsByLocationServlet?requestid=1&allbin=1040645'
    html = requests.get(url).content
    df_list = pd.read_html(html,header=0)
    df = df_list[3]
    
    df

感谢任何帮助。谢谢。

【问题讨论】：

标签： python html web-scraping datatables

【解决方案1】：

试试：

import requests
import pandas as pd
from io import StringIO

url = "http://a810-bisweb.nyc.gov/bisweb/ActionsByLocationServlet?requestid=1&allbin=1040645&allcount={}"
page = 1
out = []
while True:
    try:
        t = requests.get(url.format(page), timeout=1).text
        df = pd.read_html(StringIO(t))[3]
        print("Page:", page)
        df = df.loc[1:, :]
        if len(df) == 0:
            break
        out.append(df)
        page += 25
    except requests.exceptions.ReadTimeout:
        continue

df = pd.concat(out)
df.columns = ["NUMBER", "NUMBER", "TYPE", "FILE DATE"]
print(df)
df.to_csv("data.csv", index=None)

打印：

                          NUMBER NUMBER                                   TYPE   FILE DATE
1                    ALT 867-90*    NaN                             ALTERATION  00/00/0000
2                    ALT 1078-83    NaN                             ALTERATION  00/00/0000
3                    ALT 2164-33    NaN                             ALTERATION  00/00/1933
4                    ALT 1307-67    NaN                             ALTERATION  00/00/1967
5                    ALT 1307-67    NaN                             ALTERATION  00/00/1967
6                     ALT 853-68    NaN                             ALTERATION  00/00/1968
7                    ALT 312-71P    NaN                             ALTERATION  00/00/1971

...

并保存data.csv（来自 LibreOffice 的屏幕截图）：

【讨论】：

按预期工作良好。谢谢你的帮助。只是一个简单的问题，那个页面 += 25 实际上是做什么的？
@TahsinAlam 它正在增加页面计数器，以将 URL 推进到下一页...

【解决方案2】：

硒答案。

from selenium import webdriver
import requests
import pandas as pd
url_base = r'http://a810-bisweb.nyc.gov/bisweb/ActionsByLocationServlet?requestid=1&allbin=1040645'

driver = webdriver.Chrome(LINK_TO_YOUR_CHROME_DRIVER)

driver.get(url_base)

output = []

while True:
    df_temp = pd.read_html(driver.page_source, header = 0)[3]
    output.append(df_temp)
    
    next_page = driver.find_elements_by_name('next')
    if len(next_page) < 1:
        print("Complete")
        break
    else:
        next_page[0].click()
        
output = pd.concat(output)

driver.quit()

【讨论】：