【问题标题】:Web scraping results in 403 Forbidden Error网页抓取导致 403 禁止错误
【发布时间】:2018-07-23 04:54:30
【问题描述】:

我正在尝试使用 BeautifulSoup 从 SeekingAlpha 获取每家公司的收益。但是,该网站似乎检测到正在使用网络抓取工具?我收到“HTTP 错误 403:禁止”

我试图抓取的页面是:https://seekingalpha.com/symbol/AMAT/earnings

有谁知道可以做些什么来绕过这个?

【问题讨论】:

    标签: python python-3.x web-scraping beautifulsoup


    【解决方案1】:

    您应该尝试将User-Agent 设置为请求标头之一。值可以是任何已知的浏览器。

    例子:

    Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36

    【讨论】:

    • 你会怎么做?
    【解决方案2】:

    我能够使用代理访问网站内容,从这里找到:

    https://free-proxy-list.net/

    然后,使用requests 模块创建一个播放负载,您可以抓取该站点:

    import requests
    import re
    from bs4 import BeautifulSoup as soup
    r = requests.get('https://seekingalpha.com/symbol/AMAT/earnings', proxies={'http':'50.207.31.221:80'}).text
    results = re.findall('Revenue of \$[a-zA-Z0-9\.]+', r)
    s = soup(r, 'lxml')
    titles = list(map(lambda x:x.text, s.find_all('span', {'class':'title-period'})))
    epas = list(map(lambda x:x.text, s.find_all('span', {'class':'eps'})))
    deciding = list(map(lambda x:x.text, s.find_all('span', {'class':re.compile('green|red')})))
    results = list(map(list, zip(titles, epas, results, epas)))
    

    输出:

    [[u'Q4: 11-16-17', u'EPS of $0.93 beat by $0.02', u'Revenue of $3.97B', u'EPS of $0.93 beat by $0.02'], [u'Q3: 08-17-17', u'EPS of $0.86 beat by $0.02', u'Revenue of $3.74B', u'EPS of $0.86 beat by $0.02'], [u'Q2: 05-18-17', u'EPS of $0.79 beat by $0.03', u'Revenue of $3.55B', u'EPS of $0.79 beat by $0.03'], [u'Q1: 02-15-17', u'EPS of $0.67 beat by $0.01', u'Revenue of $3.28B', u'EPS of $0.67 beat by $0.01'], [u'Q4: 11-17-16', u'EPS of $0.66 beat by $0.01', u'Revenue of $3.30B', u'EPS of $0.66 beat by $0.01'], [u'Q3: 08-18-16', u'EPS of $0.50 beat by $0.02', u'Revenue of $2.82B', u'EPS of $0.50 beat by $0.02'], [u'Q2: 05-19-16', u'EPS of $0.34 beat by $0.02', u'Revenue of $2.45B', u'EPS of $0.34 beat by $0.02'], [u'Q1: 02-18-16', u'EPS of $0.26 beat by $0.01', u'Revenue of $2.26B', u'EPS of $0.26 beat by $0.01'], [u'Q4: 11-12-15', u'EPS of $0.29  in-line ', u'Revenue of $2.37B', u'EPS of $0.29  in-line '], [u'Q3: 08-13-15', u'EPS of $0.33  in-line ', u'Revenue of $2.49B', u'EPS of $0.33  in-line '], [u'Q2: 05-14-15', u'EPS of $0.29 beat by $0.01', u'Revenue of $2.44B', u'EPS of $0.29 beat by $0.01'], [u'Q1: 02-11-15', u'EPS of $0.27  in-line ', u'Revenue of $2.36B', u'EPS of $0.27  in-line '], [u'Q4: 11-13-14', u'EPS of $0.27  in-line ', u'Revenue of $2.26B', u'EPS of $0.27  in-line '], [u'Q3: 08-14-14', u'EPS of $0.28 beat by $0.01', u'Revenue of $2.27B', u'EPS of $0.28 beat by $0.01'], [u'Q2: 05-15-14', u'EPS of $0.28  in-line ', u'Revenue of $2.35B', u'EPS of $0.28  in-line '], [u'Q1: 02-11-14', u'EPS of $0.23 beat by $0.01', u'Revenue of $2.19B', u'EPS of $0.23 beat by $0.01']]
    

    【讨论】:

    • 谢谢。解决方案非常优雅。我只需要弄清楚如何在该页面上获取其他信息,例如季度日期、每股收益等。
    • @user172839 您还在寻找哪些其他信息?
    • 我只需要该表中的所有列信息
    • @user172839 宿舍和环保局?
    • 例如,该表的第一行我想要“第四季度:11-16-17 年每股收益为 0.93 美元,高于 0.002 美元,收入为 3.97 美元(+203%),高于 3000 万美元。是否有单独列出它们容易吗?(抱歉,我是 Python 新手)。我的最终结果是,我想从该列表中抓取大量公司,以便对结果进行分析
    【解决方案3】:

    对于任何使用 PyQuery 的人:

    from pyquery import PyQuery as pq
    import requests
    
    
    page = pq('https://seekingalpha.com/article/4151372-tesla-fools-media-model-s-model-x-demand', proxies={'http':'34.231.147.235:8080'})
    print(page)
    
    • (使用来自https://free-proxy-list.net/的代理信息)
    • 确保您使用的是 Requests 库而不是 Urllib。不要尝试使用“urlopen”加载页面。

    【讨论】:

    • “不要尝试使用 'urlopen' 加载页面” 为什么?
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2020-04-01
    • 2020-07-07
    • 1970-01-01
    • 1970-01-01
    • 2022-10-16
    • 2021-11-26
    • 2011-03-20
    相关资源
    最近更新 更多