【问题标题】:How to obtain link inside of a DIV nested in a TD with BeautifulSoup如何使用 BeautifulSoup 获取嵌套在 TD 中的 DIV 内部的链接
【发布时间】:2021-08-16 01:50:10
【问题描述】:

问题:使用 Pandas 获取表信息的“简单”方式 (pd.read_html()) 不适用于我的用例。

它只是提取我​​认为是标签文本的内容,这让这个新手感到困惑。我需要的至少是链接(到pdf)文本。

表格是通过 Requests/BeautifulSoup 从 ASPX 页面获得的。我能够毫无问题地将该表放入 Pandas DataFrame。

如果您使用下面的链接,请将其复制并粘贴为删除引荐来源网址。我的运气,一些 IT 人员更改了代码,破坏了 脚本早于需要。大声笑

Link to page (您必须使用脚本中定义的变量手动搜索)

Scraper.py:

import requests
import pandas as pd
from lxml import html
from bs4 import BeautifulSoup as bs

# User-defined variables
SearchBy = 'DateFiled'
FiledStartDate = '2020-01-01'
FiledEndDate = '2020-01-01'
County = 'Luzerne'
MDJSCourtOffice = 'MDJ-11-1-01'

host = "ujsportal.pacourts.us"
base_url = "https://" + host
search_url = base_url + "/CaseSearch"

# Headers are required. Do not change.
headers = {
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,\
               image/webp,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.5',
    'Connection': 'keep-alive',
    'Content-Type': 'application/x-www-form-urlencoded',
    'Host': host,
    'Origin': base_url,
    'Referer': search_url,
    'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:88.0) \
                   Gecko/20100101 Firefox/88.0'
}

# Open session request to obtain proper cookies
ses = requests.session()
req = ses.get(search_url, headers=headers)

# Get required hidden token so we can search
tree = html.fromstring(req.content)
veri_token = tree.xpath("/html/body/div[3]/div[2]/div/form/input/@value")[0]

# Import search criteria from user-defined variables
payload = {
    'SearchBy': SearchBy,
    'AdvanceSearch': 'true',
    'FiledStartDate': FiledStartDate,
    'FiledEndDate': FiledEndDate,
    'County': County,
    'MDJSCourtOffice': MDJSCourtOffice,
    '__RequestVerificationToken': veri_token
}

# Make search request
results = ses.post(
    search_url,
    data=payload,
    headers=headers
)

# Save html page to disk
with open("tmp/test_draft1.html", "w") as f:
    f.write(results.text)

# Open local HTML page for processing
with open("tmp/test_draft1.html") as html:
    page = bs(html, 'lxml')

table = page.find('table', {'id': 'caseSearchResultGrid'})

# Save table as sperate HTML for later audit
with open("tmp/test_draft1_table.html", "w") as f:
    f.write(table.prettify())


# Remove unneeded tags so we don't have to do it in Pandas
def clean_tags(page):
    for tag in table.select('div.bubble-text'):
        tag.decompose()
    for tag in table.select('div.modal'):
        tag.decompose()
    for tag in table.find_all(['th', 'tr', 'td'], class_="display-none"):
        tag.decompose()
    for tag in table.select('tfoot'):
        tag.decompose()


clean_tags(page)

# Start constructing dataset
columns = table.find('thead').find_all('th')
column_names = [c.get_text() for c in columns]

table_rows = table.find('tbody').find_all('tr')

case_info = []
for tr in table_rows:
    td = tr.find_all('td')
    row = [tr.get_text() for tr in td]
    case_info.append(row)

# Forward dataset to Pandas for analysis
df = pd.DataFrame(case_info, columns=column_names)
df.columns.values[16] = "Docket URL"

if SearchBy == 'DateFiled':
    df.drop(columns=['Event Type',
            'Event Status', 'Event Date', 'Event Location'], inplace=True)

df
exit("Scrape Complete!")

这可以将案卷 pdf 链接本身拉到一个单独的列表中。但无法正确更新单元格。

for row in table_rows:
    row_processed = []
    cells = row.find_all("td")
    if len(cells) == 17:
        docket_url = base_url + cells[16].find('a')['href']
        row_processed.append(docket_url)

当前print(df)截断输出sn-p:

              Docket Number  ...                 Docket URL
0  MJ-11101-CR-0000001-2020  ...  Docket SheetCourt Summary
1  MJ-11101-CR-0000003-2020  ...  Docket SheetCourt Summary
2  MJ-11101-CR-0000006-2020  ...  Docket SheetCourt Summary
3  MJ-11101-NT-0000081-2020  ...  Docket SheetCourt Summary

需要print(df)截断输出sn-p:

              Docket Number  ...                 Docket URL
0  MJ-11101-CR-0000001-2020  ...  https://link/to/docketPDF
1  MJ-11101-CR-0000003-2020  ...  https://link/to/docketPDF
2  MJ-11101-CR-0000006-2020  ...  https://link/to/docketPDF
3  MJ-11101-NT-0000081-2020  ...  https://link/to/docketPDF

【问题讨论】:

  • 当我 print(df) 我得到以下内容并输出:[Empty DataFrame Columns: [Info Sheet] Index: []]
  • 我有一种感觉,我将不得不解析 BS 中的每一列并将其发送给 Pandas”。你应该这样做
  • 如果可以的话,你能提供来源吗?我用提供的table snippet 尝试了数据并加载它并得到@MendelG 得到的结果。
  • @MendelG,KISS 方法就这么多。哈哈。如果我正在编写网页,我几乎不会在表格中放置 div。但是,我认为,如果它是细胞内的图像,也会发生这种情况。一般来说,我是抓取和 python 的新手。但经过进一步搜索,我想出了 for 循环选项。
  • 是的,您只需要 splited_html_page.forEach(regex find)

标签: python pandas web-scraping beautifulsoup


【解决方案1】:

好的,在原始问题的comments 中 QHarr 的帮助和指导下,我想出了一个解决方案。与任何与编码相关的事情一样,我确信这不是唯一的答案。

无论如何...在尝试按照我想要的方式集成这两个迭代循环之后,连接两个 Pandas DataFrame 就可以了。

案例信息数据:

case_info = []
for tr in table_rows:
    td = tr.find_all('td')
    row = [tr.get_text() for tr in td]
    case_info.append(row)

Docket URLs: (生成一个带有 urls 的列表)

docket_urls = []
for drow in table_rows:
    docket_sheets = []
    cells = drow.find_all("td")
    if len(cells) == 17:
        docket_url = base_url + cells[16].find('a')['href']
        docket_sheets.append(docket_url)
    docket_urls.append(docket_sheets)

数据帧:

# Import case info dataset to Pandas
df_case_info = pd.DataFrame(case_info, columns=column_names)
df_case_info.columns.values[16] = "Docket Text"  # Rename col = easy to drop

df_case_info.drop(columns=['Docket Text'], inplace=True)

if SearchBy == 'DateFiled':
    df_case_info.drop(columns=['Event Type',
            'Event Status', 'Event Date', 'Event Location'], inplace=True)

# Import docket URLs into Pandas
df_docket_urls = pd.DataFrame(docket_urls, columns={'Docket URL'})

# Concatonate both DataFrames into one
df_mdj = pd.concat([df_case_info, df_docket_urls], axis=1)

我仍然想按照我最初计划的方式学习如何做到这一点。但如果它没有坏,就不要修理它,对吧?

感谢那些帮助过的人,谢谢。这就是我所说的Suckcess。 :)

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2022-01-10
    • 1970-01-01
    • 2021-09-07
    • 2019-04-10
    • 1970-01-01
    • 1970-01-01
    • 2016-09-09
    相关资源
    最近更新 更多