【问题标题】:How to extract number of max results from pagination with beautifulsoup?如何使用beautifulsoup从分页中提取最大结果数?
【发布时间】:2022-01-21 20:16:57
【问题描述】:

我尝试选择分页部分并喜欢提取最大结果数2143

numbers = contents.find(name="div", attrs={"class": "pagination"})
print(numbers .attrs)
print(numbers )
print(numbers .get_text(' ', strip=True))

这段代码给了我这样的结果:

    {'class': ['pagination']}
    <div class="pagination"><span>Showing 1-30 of 2143</span><ul><li><div class="prev"></div></li><li><span class="disabled">1</span></li><li><a data-analytics='{"click_id":132,"module":1,"listing_page":2}' data-page="2" data-remote="true" href="/san-francisco-ca/dentists?page=2">2</a></li><li><a data-analytics='{"click_id":132,"module":1,"listing_page":3}' data-page="3" data-remote="true" href="/san-francisco-ca/dentists?page=3">3</a></li><li><a data-analytics='{"click_id":132,"module":1,"listing_page":4}' data-page="4" data-remote="true" href="/san-francisco-ca/dentists?page=4">4</a></li><li><a data-analytics='{"click_id":132,"module":1,"listing_page":5}' data-page="5" data-remote="true" href="/san-francisco-ca/dentists?page=5">5</a></li><li><a class="next ajax-page" data-analytics='{"click_id":132}' data-page="2" data-remote="true" href="/san-francisco-ca/dentists?page=2">Next</a></li></ul></div>
    Showing 1-30 of 2143 1 2 3 4 5 Next

如何仅提取2143

Showing 1-30 of 2143 1 2 3 4 5 Next

【问题讨论】:

    标签: python web-scraping beautifulsoup


    【解决方案1】:

    选择更具体的标签,一个选项是使用css selectors 链接条件 - 选择&lt;div&gt; 的第一个直接&lt;span&gt;pagination,用空格分割文本并获取列表的最后一个元素:

    soup.select_one('div.pagination > span').text.split(' ')[-1]
    

    示例

    html = '''<div class="pagination"><span>Showing 1-30 of 2143</span><ul><li><div class="prev"></div></li><li><span class="disabled">1</span></li><li><a data-analytics='{"click_id":132,"module":1,"listing_page":2}' data-page="2" data-remote="true" href="/san-francisco-ca/dentists?page=2">2</a></li><li><a data-analytics='{"click_id":132,"module":1,"listing_page":3}' data-page="3" data-remote="true" href="/san-francisco-ca/dentists?page=3">3</a></li><li><a data-analytics='{"click_id":132,"module":1,"listing_page":4}' data-page="4" data-remote="true" href="/san-francisco-ca/dentists?page=4">4</a></li><li><a data-analytics='{"click_id":132,"module":1,"listing_page":5}' data-page="5" data-remote="true" href="/san-francisco-ca/dentists?page=5">5</a></li><li><a class="next ajax-page" data-analytics='{"click_id":132}' data-page="2" data-remote="true" href="/san-francisco-ca/dentists?page=2">Next</a></li></ul></div>'''
    
    soup=BeautifulSoup(html,'lxml')
    
    soup.select_one('div.pagination > span').text.split(' ')[-1]
    

    输出

    2143
    

    【讨论】:

    • 此代码:soup.select_one('div.pagination &gt; span').text.split(' ')[3] 与上面的代码类似
    【解决方案2】:

    代替numbers.get_textfind "span",获取文本和rsplit 1 并获取第二个元素:

    out = numbers.find('span').text.rsplit(' ', 1)[1]
    

    输出:

    '2143'
    

    【讨论】:

      猜你喜欢
      • 2017-12-26
      • 2019-06-03
      • 1970-01-01
      • 2016-12-27
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2016-11-19
      • 2021-10-15
      相关资源
      最近更新 更多