【发布时间】:2020-03-17 08:39:08
【问题描述】:
在 python 中使用 BeautifulSoup 我试图提取这个Table。
首先,我必须输入“开始日期”和“截止日期”以在该持续时间内收集所需的数据。我通过检查 Html 页面找到的链接如下。
<ul class="pager"><li class="pager-current first">1</li>
<li class="pager-item"><a title="Go to page 2" href="/shop/finance-manager/mprequest?page=1&shop%2Ffinance-manager%2Fmprequest=">2</a></li>
<li class="pager-item"><a title="Go to page 3" href="/shop/finance-manager/mprequest?page=2&shop%2Ffinance-manager%2Fmprequest=">3</a></li>
<li class="pager-item"><a title="Go to page 4" href="/shop/finance-manager/mprequest?page=3&shop%2Ffinance-manager%2Fmprequest=">4</a></li>
<li class="pager-item"><a title="Go to page 5" href="/shop/finance-manager/mprequest?page=4&shop%2Ffinance-manager%2Fmprequest=">5</a></li>
<li class="pager-item"><a title="Go to page 6" href="/shop/finance-manager/mprequest?page=5&shop%2Ffinance-manager%2Fmprequest=">6</a></li>
<li class="pager-item"><a title="Go to page 7" href="/shop/finance-manager/mprequest?page=6&shop%2Ffinance-manager%2Fmprequest=">7</a></li>
<li class="pager-item"><a title="Go to page 8" href="/shop/finance-manager/mprequest?page=7&shop%2Ffinance-manager%2Fmprequest=">8</a></li>
<li class="pager-item"><a title="Go to page 9" href="/shop/finance-manager/mprequest?page=8&shop%2Ffinance-manager%2Fmprequest=">9</a></li>
<li class="pager-ellipsis">…</li>
<li class="pager-next"><a title="Go to next page" href="/shop/finance-manager/mprequest?page=1&shop%2Ffinance-manager%2Fmprequest=">next ›</a></li>
<li class="pager-last last"><a title="Go to last page" href="/shop/finance-manager/mprequest?page=11&shop%2Ffinance-manager%2Fmprequest=">last »</a></li>
</ul>
这里的链接都在“pager-item”下,但页面的实际数量可以从“pager-last last”部分(即 11)中看到。所以我必须运行一个适用于所有这 11 页的代码(可能使用 forloop)。
这是我打算抓取的html portion。
这是我运行良好的单页代码scrape。
Beautiful_Fin_Page = bs(Total_Fin_Page.content, 'lxml')
OrderID_Container = Beautiful_Fin_Page('tbody')
Table = {
"Transaction Number": [],
"Sale Order": [],
"Return Sale Order": [],
"Requisition Date": [],
"Requisition Time": []
}
for orders in OrderID_Container:
if orders.find('tr') is not None:
trs = orders.find_all('tr',{'class': ['odd', 'even']})
for tr in trs:
td = tr.find_all('td')
print(td)
Table["Transaction Number"].append(td[0].text)
Table["Requisition Date"].append(td[3].text)
Table["Requisition Time"].append(td[4].text)
Table["Customer Name"].append(td[5].text)
df = pd.DataFrame(Table)
print(df)
那么您能否分享一下如何使用“forloop”提取总表数据以进行分页?
【问题讨论】:
标签: python html web-scraping pagination