使用 beautifulsoup 和 Python 抓取 html 数据答案

【问题标题】：Scrape html data using beautifulsoup and Python使用 beautifulsoup 和 Python 抓取 html 数据
【发布时间】：2020-06-03 23:24:46
【问题描述】：

我正在尝试从以下网址抓取学校名称：https://www.niche.com/k12/search/best-public-high-schools/s/indiana/?page=1。

我想抓取 10 页，因此是 for 循环。我以前从未使用过beautifulsoup，文档也没有解决我的问题。最终，我想刮一下，因为那是学校名称所在的地方。以下是我拥有的少量代码。任何帮助都会非常有帮助！谢谢！

import bs4 as bs
import requests

numbers = ['1', '2', '3', '4', '5', '6', '7', '8', '9', '10']

names = []
for number in numbers:
    resp = requests.get('https://www.niche.com/k12/search/best-public-high-schools/s/indiana/?page='+number)
    soup = bs.BeautifulSoup(resp.text, "lxml")
    school_names = soup.find('div', {'class':'"search-results"'})
    for school_name in school_names:
        school = school_name.find('h2')
        if school:
            print (school.text)

【问题讨论】：

您的问题/错误是什么？
我看到的问题是403 Forbidden，是不是User-Agent引起的？
我在请求之后添加了print(resp.text)，并得到了<head><title>403 Forbidden</title></head>，这是您的第一个问题。您需要阅读有关使用请求进行身份验证的内容。我不需要提及，但是，如果您需要更多帮助，请不要在此处发布您的用户/密码！
@CCebrian 有一个很好的观点。我跑了resp = requests.get('https://www.niche.com/k12/search/best-public-high-schools/s/indiana/?page='+number, headers={"user-agent":"Mozilla/5.0"}) 并得到了你的网页。关于代码的下一个问题...
这次我得到访问此页面已被拒绝，因为我们认为您正在使用自动化工具浏览该网站。 .哎哟!因为它是真的！你需要研究如何打败它。同时，您可以在浏览器中打开该页面，将其保存，然后在文件上练习您的网络抓取。

标签： python html beautifulsoup screen-scraping

【解决方案1】：

通过传递标题试试这个。使用https://curl.trillworks.com/ 作为助手，我得到：

import requests

headers = {
    'authority': 'fonts.gstatic.com',
    'pragma': 'no-cache',
    'cache-control': 'no-cache',
    'upgrade-insecure-requests': '1',
    'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.87 Safari/537.36',
    'sec-fetch-dest': 'font',
    'accept': '*/*',
    'sec-fetch-site': 'cross-site',
    'sec-fetch-mode': 'cors',
    'sec-fetch-user': '?1',
    'accept-language': 'en-US,en;q=0.9',
    'cookie': '_pxhd=120bcbd3ded2e33c1496a0ff505f52a169b1f9c1db59a881c1cd00495b9442ee:62dfdf81-5341-11ea-95d7-e144631f0943; xid=6fef7398-e61d-46d2-be72-ee8e8fecc13d; navigation=%7B%22location%22%3A%7s%22%3A%7B%22colleges%22%3A%22%2Fs%2Findiana%2F%22%2C%22graduate-schools%22%3A%22%2Fs%2Findiana%2F%22%2C%22k12%22%3A%22%2Fs%2Findiana%2F%22%2C%22places-to-live%22%3A%22%2Fs%2Findiana%2F%22%2C%22places-to-work%22%3A%22%2Fs%2Findiana%2F%22%7D%7D; experiments=%5E%5E%5E%24%5D; recentlyViewed=entityHistory%7CsearchHistory%7CentityName%7CIndiana%7CentityGuid%7Cad8b4b4c-f8d2-4015-8b22-c0f002a720bb%7CentityType%7CState%7CentityFragment%7Cindiana%5E%5E%5E%240%7C%40%5D%7C1%7C%40%242%7C3%7C4%7C5%7C6%7C7%7C8%7C9%5D%5D%5D; hintSeenLately=second_hint',
    'Connection': 'keep-alive',
    'Pragma': 'no-cache',
    'Cache-Control': 'no-cache',
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.87 Safari/537.36',
    'Sec-Fetch-Dest': 'image',
    'Accept': 'image/webp,image/apng,image/*,*/*;q=0.8',
    'Sec-Fetch-Site': 'cross-site',
    'Sec-Fetch-Mode': 'no-cors',
    'Referer': 'https://www.niche.com/k12/search/best-public-high-schools/s/indiana/?page=1',
    'Accept-Language': 'en-US,en;q=0.9',
    'x-client-data': 'CI+2yQEIorbJAQjBtskBCKmdygEIy67KAQi8sMoBCJa1ygEIm7XKAQjstcoBCI66ygEIsL3KARirpMoB',
    'referer': 'https://fonts.googleapis.com/css?family=Source+Sans+Pro:300,400,600,700',
    'origin': 'https://www.niche.com',
    'Origin': 'https://www.niche.com',
}

params = (
    ('page', '1'),
)

response = requests.get('https://www.niche.com/k12/search/best-public-high-schools/s/indiana/', headers=headers, params=params)

#NB. Original query string below. It seems impossible to parse and
#reproduce query strings 100% accurately so the one below is given
#in case the reproduced version is not "correct".
# response = requests.get('https://www.niche.com/k12/search/best-public-high-schools/s/indiana/?page=1', headers=headers)

这给了我一个 200 而不是 403。当然，上面的标题很冗长（我从浏览器复制了这个），你可能会使用试错法来查看实际需要哪些标题（我是猜测它只是少数）保证200 OK。

【讨论】：

【解决方案2】：

您尝试抓取的网页包含验证码，这使得收集数据变得困难。看看这个链接：

https://sqa.stackexchange.com/questions/17022/how-to-fill-captcha-using-test-automation

【讨论】：