如何从下一页中抓取价格？答案

【问题标题】：How can scrape prices from next pages?如何从下一页中抓取价格？
【发布时间】：2019-11-08 12:15:27
【问题描述】：

我是 python 和网络抓取的新手。我使用 requests 和 beautifulsoup 编写了一些代码。一种代码用于抓取价格、名称和链接。效果很好，如下所示：

from bs4 import BeautifulSoup
import requests

urls = "https://www.meisamatr.com/fa/product/cat/2-%D8%A2%D8%B1%D8%A7%DB%8C%D8%B4%DB%8C.html#/pagesize-24/order-new/stock-1/page-1"
source = requests.get(urls).text
soup = BeautifulSoup(source, 'lxml')

for figcaption in soup.find_all('figcaption'):
    price = figcaption.div.text
    name = figcaption.find('a', class_='title').text
    link = figcaption.find('a', class_='title')['href']

    print(price)
    print(name)
    print(link)

还有一个用于制作我需要从中刮取这些信息的其他网址，当我使用 print() 时，它也会提供正确的网址：

x = 0
counter = 1

for x in range(0, 70)
    urls = "https://www.meisamatr.com/fa/product/cat/2-%D8%A2%D8%B1%D8%A7%DB%8C%D8%B4%DB%8C.html#/pagesize-24/order-new/stock-1/page-" + str(counter)
    counter += 1
    x += 1
    print(urls)

但是当我尝试将这两者结合起来以抓取一个页面然后将 url 更改为新的然后抓取它时，它只会在第一页上提供 70 次抓取的信息。请指导我完成这个。整个代码如下：

from bs4 import BeautifulSoup
import requests

x = 0
counter = 1
for x in range(0, 70):
    urls = "https://www.meisamatr.com/fa/product/cat/2-%D8%A2%D8%B1%D8%A7%DB%8C%D8%B4%DB%8C.html#/pagesize-24/order-new/stock-1/page-" + str(counter)
    source = requests.get(urls).text
    soup = BeautifulSoup(source, 'lxml')
    counter += 1
    x += 1
    print(urls)

    for figcaption in soup.find_all('figcaption'):
        price = figcaption.div.text
        name = figcaption.find('a', class_='title').text
        link = figcaption.find('a', class_='title')['href']

        print(price)
        print()
        print(name)
        print()
        print(link)

【问题讨论】：

其次，你没有得到其他页面的原因是你的 for 循环中没有它。
我相信我的 for 循环中有它们。在网站上显示代码令人困惑。我会努力让它变得更好......
不需要增加x，因为它是循环变量。另外，counter 可以完全去掉，直接写ursl = "......" + str(x+1)

标签： python web-scraping beautifulsoup python-requests

【解决方案1】：

您的 x=0 然后将其加 1 是多余的并且不需要，因为您让它遍历该范围 range(0, 70)。我也不确定你为什么有counter，因为你也不需要它。下面是你将如何做到这一点：

但是，我认为问题不在于迭代或循环，而在于 url 本身。如果您手动转到下面列出的两个页面，则内容不会改变：

https://www.meisamatr.com/fa/product/cat/2-%D8%A2%D8%B1%D8%A7%DB%8C%D8%B4%DB%8C.html#/pagesize-24/order-new/stock-1/page-1

然后

https://www.meisamatr.com/fa/product/cat/2-%D8%A2%D8%B1%D8%A7%DB%8C%D8%B4%DB%8C.html#/pagesize-24/order-new/stock-1/page-2

由于网站是动态的，您需要找到一种不同的方式来逐页迭代，或者找出确切的网址。所以试试：

from bs4 import BeautifulSoup
import requests

for x in range(0, 70):
    try:
        urls = 'https://www.meisamatr.com/fa/product/cat/2-%D8%A2%D8%B1%D8%A7%DB%8C%D8%B4%DB%8C.html&pagesize[]=24&order[]=new&stock[]=1&page[]=' +str(x+1) + '&ajax=ok?_=1561559181560'
        source = requests.get(urls).text
        soup = BeautifulSoup(source, 'lxml')

        print('Page: %s' %(x+1))

        for figcaption in soup.find_all('figcaption'):

            price = figcaption.find('span', {'class':'new_price'}).text.strip()
            name = figcaption.find('a', class_='title').text
            link = figcaption.find('a', class_='title')['href']

            print('%s\n%s\n%s' %(price, name, link))
    except:
        break

您可以通过访问网站并查看开发工具（Ctrl +Shift+I 或右键单击“检查”）-> 网络 -> XHR 找到该链接

当我这样做然后物理单击到下一页时，我可以看到该数据是如何呈现的，并找到了参考 url。

【讨论】：

伟大的编码！谢谢。奇迹般有效。我不明白'&ajax=ok?_=1561559181560'虽然
啊，是的。好问题。明天有机会我会回答/解决这个问题，并说明它的来源。
@Noshad70，好的，我添加了如何为您找到该网址的方法。请务必同时接受解决方案的答案。