【问题标题】:Is there an easier way to use a dictionary to scrape multiple webpages?有没有更简单的方法来使用字典来抓取多个网页?
【发布时间】:2016-06-21 19:37:57
【问题描述】:

我正在尝试使用请求来抓取黄页。我知道在这些页面上获取数据不需要登录,但我只是想练习登录网站。

有没有办法使用“s.get()”一次抓取多个 url?这就是我目前的代码布局方式,但似乎应该有一种更简单的方法,这样我就不必在每次想要添加新页面时编写额外的五行代码。

这段代码对我有用,但似乎太长了。

import requests
from bs4 import BeautifulSoup
import requests.cookies

s = requests.Session()

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}

url = "https://accounts.yellowpages.com/login?next=https%3A%2F%2Faccounts.yellowpages.com%2Fdialog%2Foauth&client_id=590d26ff-34f1-447e-ace1-97d075dd7421&response_type=code&app_id=WEB&source=ypu_register&vrid=cc9cb936-50d8-493b-83c6-842ec2f068ed&register=true"
r = s.get(url).content
page = s.get(url)
soup = BeautifulSoup(page.content, "lxml")
soup.prettify()

csrf = soup.find("input", value=True)["value"]

USERNAME = 'myusername'
PASSWORD = 'mypassword'

cj = s.cookies
requests.utils.dict_from_cookiejar(cj)

login_data = dict(email=USERNAME, password=PASSWORD, _csrf=csrf)
s.post(url, data=login_data, headers={'Referer': "https://accounts.yellowpages.com/login?next=https%3A%2F%2Faccounts.yellowpages.com%2Fdialog%2Foauth&client_id=590d26ff-34f1-447e-ace1-97d075dd7421&response_type=code&app_id=WEB&source=ypu_login&vrid=63dbd394-afff-4794-aeb0-51dd19957ebc&merge_history=true"})

targeted_page = s.get('http://m.yp.com/search?search_term=restaurants&search_type=category', cookies=cj)

targeted_soup = BeautifulSoup(targeted_page.content, "lxml")

targeted_soup.prettify()

for record in targeted_soup.findAll('div'):
    print(record.text)

targeted_page_2 = s.get('http://www.yellowpages.com/search?search_terms=Gas+Stations&geo_location_terms=Los+Angeles%2C+CA', cookies=cj)

targeted_soup_2 = BeautifulSoup(targeted_page_2.content, "lxml")

targeted_soup_2.prettify()

for data in targeted_soup_2.findAll('div'):
    print(data.text)

当我尝试使用这样的字典时,我得到一个我不理解的回溯。

import requests
from bs4 import BeautifulSoup
import requests.cookies

s = requests.Session()

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}

url = "https://accounts.yellowpages.com/login?next=https%3A%2F%2Faccounts.yellowpages.com%2Fdialog%2Foauth&client_id=590d26ff-34f1-447e-ace1-97d075dd7421&response_type=code&app_id=WEB&source=ypu_register&vrid=cc9cb936-50d8-493b-83c6-842ec2f068ed&register=true"
r = s.get(url).content
page = s.get(url)
soup = BeautifulSoup(page.content, "lxml")
soup.prettify()

csrf = soup.find("input", value=True)["value"]

USERNAME = 'myusername'
PASSWORD = 'mypassword'

login_data = dict(email=USERNAME, password=PASSWORD, _csrf=csrf)
s.post(url, data=login_data, headers={'Referer': "https://accounts.yellowpages.com/login?next=https%3A%2F%2Faccounts.yellowpages.com%2Fdialog%2Foauth&client_id=590d26ff-34f1-447e-ace1-97d075dd7421&response_type=code&app_id=WEB&source=ypu_login&vrid=63dbd394-afff-4794-aeb0-51dd19957ebc&merge_history=true"})

targeted_pages = {'http://m.yp.com/search?search_term=restaurants&search_type=category',
                  'http://www.yellowpages.com/search?search_terms=Gas+Stations&geo_location_terms=Los+Angeles%2C+CA'
                  }
targeted_page = s.get(targeted_pages)

targeted_soup = BeautifulSoup(targeted_page.content, "lxml")

targeted_soup.prettify()

for record in targeted_soup.findAll('div'):
    print(record.text)

targeted_page_2 = s.get('http://www.yellowpages.com/search?search_terms=Gas+Stations&geo_location_terms=Los+Angeles%2C+CA')

targeted_soup_2 = BeautifulSoup(targeted_page_2.content, "lxml")

targeted_soup_2.prettify()

错误

raise InvalidSchema("No connection adapters were found for '%s'" % url)
requests.exceptions.InvalidSchema: No connection adapters were found for '{'http://www.yellowpages.com/search?search_terms=Gas+Stations&geo_location_terms=Los+Angeles%2C+CA', 'http://m.yp.com/search?search_term=restaurants&search_type=category'}'

我是 python 和 requests 模块的新手,但我不明白为什么以这种格式使用字典不起作用。感谢您的任何意见。

【问题讨论】:

    标签: python-3.x dictionary web-scraping beautifulsoup python-requests


    【解决方案1】:

    首先你有一个 set 而不是 dict,如果你想请求每个 url 你需要遍历它,requests.get 将 url 作为其第一个参数,而不是一组或任何其他可迭代的 url。:

    targeted_pages = {'http://m.yp.com/search?search_term=restaurants&search_type=category',
                      'http://www.yellowpages.com/search?search_terms=Gas+Stations&geo_location_terms=Los+Angeles%2C+CA'
                      }
    for target in targeted_pages:
        targeted_page = s.get(target)
        targeted_soup = BeautifulSoup(targeted_page.content, "lxml")
        for record in targeted_soup.findAll('div'):
            print(record.text)
    

    【讨论】:

    • @user6326823,不用担心,只需传递每个请求的标头,以便您使用用户代理等。
    • 好吧,酷!谢谢您的帮助。我确实有另一个问题,似乎每次我转到一个新链接时它都会让我登录和注销,因为当我打印(s.cookies)时,它会打印两行相同的cookie。这是否意味着它让我登录和退出?我只是想确保我不会成为机器人。我会接受这个答案,因为它运行良好。
    • 是的,使用 Session 你应该让你登录,但这真的取决于服务器。
    猜你喜欢
    • 2020-10-26
    • 1970-01-01
    • 2017-12-18
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2010-10-14
    • 1970-01-01
    • 2021-12-03
    相关资源
    最近更新 更多