【问题标题】:Extracting a table by webscraping using python使用python通过网络抓取提取表格
【发布时间】:2020-03-17 08:39:08
【问题描述】:

在 python 中使用 BeautifulSoup 我试图提取这个Table

首先,我必须输入“开始日期”和“截止日期”以在该持续时间内收集所需的数据。我通过检查 Html 页面找到的链接如下。

<ul class="pager"><li class="pager-current first">1</li>
<li class="pager-item"><a title="Go to page 2" href="/shop/finance-manager/mprequest?page=1&amp;shop%2Ffinance-manager%2Fmprequest=">2</a></li>
<li class="pager-item"><a title="Go to page 3" href="/shop/finance-manager/mprequest?page=2&amp;shop%2Ffinance-manager%2Fmprequest=">3</a></li>
<li class="pager-item"><a title="Go to page 4" href="/shop/finance-manager/mprequest?page=3&amp;shop%2Ffinance-manager%2Fmprequest=">4</a></li>
<li class="pager-item"><a title="Go to page 5" href="/shop/finance-manager/mprequest?page=4&amp;shop%2Ffinance-manager%2Fmprequest=">5</a></li>
<li class="pager-item"><a title="Go to page 6" href="/shop/finance-manager/mprequest?page=5&amp;shop%2Ffinance-manager%2Fmprequest=">6</a></li>
<li class="pager-item"><a title="Go to page 7" href="/shop/finance-manager/mprequest?page=6&amp;shop%2Ffinance-manager%2Fmprequest=">7</a></li>
<li class="pager-item"><a title="Go to page 8" href="/shop/finance-manager/mprequest?page=7&amp;shop%2Ffinance-manager%2Fmprequest=">8</a></li>
<li class="pager-item"><a title="Go to page 9" href="/shop/finance-manager/mprequest?page=8&amp;shop%2Ffinance-manager%2Fmprequest=">9</a></li>
<li class="pager-ellipsis">…</li>
<li class="pager-next"><a title="Go to next page" href="/shop/finance-manager/mprequest?page=1&amp;shop%2Ffinance-manager%2Fmprequest=">next ›</a></li>
<li class="pager-last last"><a title="Go to last page" href="/shop/finance-manager/mprequest?page=11&amp;shop%2Ffinance-manager%2Fmprequest=">last »</a></li>
</ul>

这里的链接都在“pager-item”下,但页面的实际数量可以从“pager-last last”部分(即 11)中看到。所以我必须运行一个适用于所有这 11 页的代码(可能使用 forloop)。

这是我打算抓取html portion

这是我运行良好的单页代码scrape

Beautiful_Fin_Page = bs(Total_Fin_Page.content, 'lxml')
OrderID_Container = Beautiful_Fin_Page('tbody')

Table = {
           "Transaction Number": [],
           "Sale Order": [],
           "Return Sale Order": [],
           "Requisition Date": [],
           "Requisition Time": []
           }
           
    for orders in OrderID_Container:
     if orders.find('tr') is not None:
         trs = orders.find_all('tr',{'class': ['odd', 'even']})
         for tr in trs:
             td = tr.find_all('td')
             print(td)
             
             Table["Transaction Number"].append(td[0].text)
             Table["Requisition Date"].append(td[3].text)
             Table["Requisition Time"].append(td[4].text)
             Table["Customer Name"].append(td[5].text)
             
df = pd.DataFrame(Table)
print(df)

那么您能否分享一下如何使用“forloop”提取总表数据以进行分页?

【问题讨论】:

    标签: python html web-scraping pagination


    【解决方案1】:

    我假设您有一些要从中抓取的基本网址:

    你需要的for循环是

    for i in range(12):
        url=f"{baseurl}/shop/finance-manager/mprequest?page={i}&amp;shop%2Ffinance-manager%2Fmprequest="
        your_function_to_extract_the_data_you_need(url)
    

    【讨论】:

    • 感谢分享。但事实是你永远不知道的“范围”中的值是多少。因为生成的页面数量完全取决于我首先选择的日期范围。这完全是一个变量。我必须创建一个变量,如果其中任何一个可用,它将从这个“pager-next”或“pager-last last”类中​​获取循环的编号
    • @OmarHasib 然后尝试根据 css 选择器选择元素。现在不记得语法了,但是选择器应该是 li.pager-last.last 然后获取它的 href 并使用这个正则表达式提取数字:(?&lt;=page=).*(?=&amp;)
    猜你喜欢
    • 1970-01-01
    • 2016-02-19
    • 1970-01-01
    • 2021-05-23
    • 2023-02-07
    • 1970-01-01
    • 2017-08-14
    • 1970-01-01
    • 2021-10-22
    相关资源
    最近更新 更多