【问题标题】:How to parse next page by Beautiful Soup?如何通过 Beautiful Soup 解析下一页?
【发布时间】:2016-03-04 12:55:58
【问题描述】:

我使用下面的代码来解析带有下一页的页面:

def parseNextThemeUrl(url):
  ret = []
  ret1 = []
  html = urllib.request.urlopen(url)
  html = BeautifulSoup(html, PARSER)
  html = html.find('a', class_='pager_next')
  if html:
    html = urljoin(url, html.get('href'))
    ret1 = parseNextThemeUrl(html)

    for r in ret1:
        ret.append(r)
  else:
    ret.append(url)
  return ret

但我收到如下错误,如果有链接,我该如何解析下一个链接。

Traceback (most recent call last):
html = urllib.request.urlopen(url)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/urllib/request.py", line 162, in urlopen
return opener.open(url, data, timeout)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/urllib/request.py", line 456, in open
req.timeout = timeout
AttributeError: 'list' object has no attribute 'timeout'

【问题讨论】:

  • 可以给我们网址吗?如果不知道网页,我们无法确定太多。
  • http://003.b2btoys.net/en/ProductList.aspx?Class1=12 http://003.b2btoys.net/en/ProductList.aspx?PageIndex=2&Class1=13&Class2=0&type=&keyWord=

标签: html python-3.x web-scraping bs4


【解决方案1】:

我自己的答案如下:

def parseNextThemeUrl(url):
  urls = []
  urls.append(url)
  html = urllib.request.urlopen(url)
  soup = BeautifulSoup(html, 'lxml')
  new_page = soup.find('a', class_='pager_next')

  if new_page:
    new_url = urljoin(url, new_page.get('href'))
    urls1 = parseNextThemeUrl(new_url)

    for url1 in urls1:
        urls.append(url1)
  return urls

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 2019-11-14
    • 2023-03-24
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2012-06-29
    • 2020-12-01
    • 1970-01-01
    相关资源
    最近更新 更多