【发布时间】:2021-08-16 01:50:10
【问题描述】:
问题:使用 Pandas 获取表信息的“简单”方式 (pd.read_html()) 不适用于我的用例。
它只是提取我认为是标签文本的内容,这让这个新手感到困惑。我需要的至少是链接(到pdf)文本。
表格是通过 Requests/BeautifulSoup 从 ASPX 页面获得的。我能够毫无问题地将该表放入 Pandas DataFrame。
如果您使用下面的链接,请将其复制并粘贴为删除引荐来源网址。我的运气,一些 IT 人员更改了代码,破坏了 脚本早于需要。大声笑
Link to page (您必须使用脚本中定义的变量手动搜索)。
Scraper.py:
import requests
import pandas as pd
from lxml import html
from bs4 import BeautifulSoup as bs
# User-defined variables
SearchBy = 'DateFiled'
FiledStartDate = '2020-01-01'
FiledEndDate = '2020-01-01'
County = 'Luzerne'
MDJSCourtOffice = 'MDJ-11-1-01'
host = "ujsportal.pacourts.us"
base_url = "https://" + host
search_url = base_url + "/CaseSearch"
# Headers are required. Do not change.
headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,\
image/webp,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
'Connection': 'keep-alive',
'Content-Type': 'application/x-www-form-urlencoded',
'Host': host,
'Origin': base_url,
'Referer': search_url,
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:88.0) \
Gecko/20100101 Firefox/88.0'
}
# Open session request to obtain proper cookies
ses = requests.session()
req = ses.get(search_url, headers=headers)
# Get required hidden token so we can search
tree = html.fromstring(req.content)
veri_token = tree.xpath("/html/body/div[3]/div[2]/div/form/input/@value")[0]
# Import search criteria from user-defined variables
payload = {
'SearchBy': SearchBy,
'AdvanceSearch': 'true',
'FiledStartDate': FiledStartDate,
'FiledEndDate': FiledEndDate,
'County': County,
'MDJSCourtOffice': MDJSCourtOffice,
'__RequestVerificationToken': veri_token
}
# Make search request
results = ses.post(
search_url,
data=payload,
headers=headers
)
# Save html page to disk
with open("tmp/test_draft1.html", "w") as f:
f.write(results.text)
# Open local HTML page for processing
with open("tmp/test_draft1.html") as html:
page = bs(html, 'lxml')
table = page.find('table', {'id': 'caseSearchResultGrid'})
# Save table as sperate HTML for later audit
with open("tmp/test_draft1_table.html", "w") as f:
f.write(table.prettify())
# Remove unneeded tags so we don't have to do it in Pandas
def clean_tags(page):
for tag in table.select('div.bubble-text'):
tag.decompose()
for tag in table.select('div.modal'):
tag.decompose()
for tag in table.find_all(['th', 'tr', 'td'], class_="display-none"):
tag.decompose()
for tag in table.select('tfoot'):
tag.decompose()
clean_tags(page)
# Start constructing dataset
columns = table.find('thead').find_all('th')
column_names = [c.get_text() for c in columns]
table_rows = table.find('tbody').find_all('tr')
case_info = []
for tr in table_rows:
td = tr.find_all('td')
row = [tr.get_text() for tr in td]
case_info.append(row)
# Forward dataset to Pandas for analysis
df = pd.DataFrame(case_info, columns=column_names)
df.columns.values[16] = "Docket URL"
if SearchBy == 'DateFiled':
df.drop(columns=['Event Type',
'Event Status', 'Event Date', 'Event Location'], inplace=True)
df
exit("Scrape Complete!")
这可以将案卷 pdf 链接本身拉到一个单独的列表中。但无法正确更新单元格。
for row in table_rows:
row_processed = []
cells = row.find_all("td")
if len(cells) == 17:
docket_url = base_url + cells[16].find('a')['href']
row_processed.append(docket_url)
当前print(df)截断输出sn-p:
Docket Number ... Docket URL
0 MJ-11101-CR-0000001-2020 ... Docket SheetCourt Summary
1 MJ-11101-CR-0000003-2020 ... Docket SheetCourt Summary
2 MJ-11101-CR-0000006-2020 ... Docket SheetCourt Summary
3 MJ-11101-NT-0000081-2020 ... Docket SheetCourt Summary
需要print(df)截断输出sn-p:
Docket Number ... Docket URL
0 MJ-11101-CR-0000001-2020 ... https://link/to/docketPDF
1 MJ-11101-CR-0000003-2020 ... https://link/to/docketPDF
2 MJ-11101-CR-0000006-2020 ... https://link/to/docketPDF
3 MJ-11101-NT-0000081-2020 ... https://link/to/docketPDF
【问题讨论】:
-
当我
print(df)我得到以下内容并输出:[Empty DataFrame Columns: [Info Sheet] Index: []] -
“我有一种感觉,我将不得不解析 BS 中的每一列并将其发送给 Pandas”。你应该这样做
-
如果可以的话,你能提供来源吗?我用提供的
table snippet尝试了数据并加载它并得到@MendelG 得到的结果。 -
@MendelG,KISS 方法就这么多。哈哈。如果我正在编写网页,我几乎不会在表格中放置 div。但是,我认为,如果它是细胞内的图像,也会发生这种情况。一般来说,我是抓取和 python 的新手。但经过进一步搜索,我想出了 for 循环选项。
-
是的,您只需要 splited_html_page.forEach(regex find)
标签: python pandas web-scraping beautifulsoup