我可以从 Python 中的 html 文件中抓取表格吗？答案

【问题标题】：Can I scrape table from html file in Python?我可以从 Python 中的 html 文件中抓取表格吗？
【发布时间】：2020-09-03 05:56:26
【问题描述】：

我想从这个文本文件text_file 中抓取表格，我想要的表格是SUMMARY CONSOLIDATED FINANCIAL AND OTHER DATA。 BeautifulSoup.content 给我的代码看起来像这样The Origin Code。我的代码已附上，谁能告诉我哪里出错了？

url = r'https://www.sec.gov/Archives/edgar/data/1181232/000104746903038553/a2123752z424b4.htm'

filing_url = requests.get(url)
content = filing_url.text
soup = BeautifulSoup(content, 'lxml') 

tables = soup.find_all(text=re.compile('SUMMARY CONSOLIDATED FINANCIAL AND OTHER DATA'))

n_columns = 0
n_rows = 0
column_names = []
for table in tables:
    for row in table.find_next('table').find_all('tr'):

        # Determine the number of rows in the table
        td_tags = row.find_all('td')
        if len(td_tags) > 0:
            n_rows += 1
            if n_columns == 0:
                # Set the number of columns for the table
                n_columns = len(td_tags)

        # Handle column names if find them
        th_tags = row.find_all('th')
        if len(th_tags) > 0 and len(column_names) == 0:
            for th in th_tags:
                column_names.append(th.get_text())

        # Safeguard on Column Titles
    if len(column_names) > 0 and len(column_names) != n_columns:
        raise Exception("Column titles do not match the number of columns")

    columns = column_names if len(column_names) > 0 else range(0, n_columns)
    df = pd.DataFrame(columns=columns,
                      index=range(0, n_rows))
    row_marker = 0
    for row in table.find_all('tr'):
        column_marker = 0
        columns = row.find_all('td')
        for column in columns:
            df.iat[row_marker, column_marker] = column.get_text()
            column_marker += 1
        if len(columns) > 0:
            row_marker += 1

    print(df)

【问题讨论】：

标签： python web beautifulsoup screen-scraping

【解决方案1】：

在这种特殊情况下，您可以使用 pandas 显着简化：

import pandas as pd
url = 'https://www.sec.gov/Archives/edgar/data/1181232/000104746903038553/a2123752z424b4.htm'

tables = pd.read_html(url)
#there are more than 100 tables on that page, so you have to narrow it down

targets = []
for t in tables:
    if 'Unaudited' in str(t.columns):
        targets.append(t)
targets[0] #only two meet that requirement, and the first is your target

输出是您的目标表。

【讨论】：