如何使用bs4制作爬虫来抓取网站答案

【问题标题】：how to make a crawler to scrape website with bs4如何使用bs4制作爬虫来抓取网站
【发布时间】：2018-09-01 14:47:09
【问题描述】：

我编写了一个脚本来抓取quotes to scrape 的引号和作者姓名。在这个项目中，我使用 requests 来获取页面的代码和 bs4 来解析 HTML。我使用while循环通过分页链接到下一页，但我希望我的代码在没有页面时停止运行。我的代码有效，但不会停止运行。

这是我的代码：

from bs4 import BeautifulSoup as bs
import requests

def scrape():
    page = 1
    url = 'http://quotes.toscrape.com'
    r = requests.get(url)
    soup = bs(r.text,'html.parser')
    quotes = soup.find_all('span',attrs={"class":"text"})
    authors = soup.find_all('small',attrs={"class":"author"})
    p_link = soup.find('a',text="Next")

    condition = True
    while condition:
        with open('quotes.txt','a') as f:
            for i in range(len(authors)):
                f.write(quotes[i].text+' '+authors[i].text+'\n')
        if p_link not in soup:
            condition = False
            page += 1
            url = 'http://quotes.toscrape.com/page/{}'.format(page)
            r = requests.get(url)
            soup = bs(r.text,'html.parser')
            quotes = soup.find_all('span',attrs={"class":"text"})
            authors = soup.find_all('small',attrs={"class":"author"})
            condition = True
        else:
            condition = False

    print('done')


scrape()

【问题讨论】：

标签： python python-3.x web-scraping beautifulsoup request

【解决方案1】：

因为p_link 从来都不是汤。我发现这有两个原因。

您使用文本“下一步”搜索它。但似乎实际链接是“下一步”+空格+右箭头
标签包含指向下一页的属性“href”。对于每个页面，这将具有不同的值。

在第一个 if 块的 while 循环中，将条件设置为 False 也没有区别。无论如何，您将其设置回块的末尾。

所以...

不要按 Next 搜索，而是使用：

soup.find('li',attrs={"class":"next"})

对于条件，使用：

if soup.find('li',attrs={"class":"next"}) is None:
   condition = False

最后，如果你也想写最后一页的引号，我建议你把“写入文件”部分放在最后。或者完全避免..像这样：

from bs4 import BeautifulSoup as bs
import requests

def scrape():
    page = 1
    while True:

        if page == 1:
            url = 'http://quotes.toscrape.com'
        else:
            url = 'http://quotes.toscrape.com/page/{}'.format(page)

        r = requests.get(url)
        soup = bs(r.text,'html.parser')

        quotes = soup.find_all('span',attrs={"class":"text"})
        authors = soup.find_all('small',attrs={"class":"author"})

        with open('quotes.txt','a') as f:
            for i in range(len(authors)):
                f.write(str(quotes[i].encode("utf-8"))+' '+str(authors[i].encode("utf-8"))+'\n')       

        if soup.find('li',attrs={"class":"next"}) is None:
            break

        page+=1

    print('done')


scrape()

【讨论】：