【发布时间】:2019-04-10 10:08:55
【问题描述】:
我想抓取以下网站的内容:
https://www.morningstar.com/stocks/xnys/mmm/quote.html
从那里我想点击Executive,然后点击Board of Director,然后我想从中抓取传记 >每位董事的简介。理想情况下,最终结果将包含董事会 12 名成员中的每一位的传记。 Board of Directors Member Profile
我正在尝试使用 BeautifulSoup 执行此操作,但我无法访问那个嵌套的 div。
from bs4 import BeautifulSoup as soup
import re, time
import csv
from selenium import webdriver
def get_directors(_html):
_names = [i.text for i in soup(_html, 'html.parser').find_all('div', {'class':'name ng-binding'})]
return _names[_names.index('Compensation for all Key Executives')+1:-1]
_board = {}
d = webdriver.Chrome('/Users/tS0u/Downloads/chromedriver')
d.get('https://www.morningstar.com/stocks/xnys/mmm/quote.html')
time.sleep(5)
_exec = d.find_elements_by_class_name("mds-button")
_exec[8].click()
time.sleep(3)
d.find_element_by_link_text("Board of Directors").click()
time.sleep(3)
full_directors = d.find_elements_by_class_name('person-row')[19:31]
for _name, _link in zip(get_directors(d.page_source), full_directors):
_link.click()
time.sleep(3)
d.find_element_by_link_text("Profile").click()
time.sleep(3)
_board[_name] = soup(d.page_source, 'html.parser').find_all('div', {'class':'biography'})[-1].text
_link.click()
time.sleep(3)
print(_board)
with open('filename.csv', 'w') as f:
write = csv.writer(f)
write.writerows([['name', 'biography'], *map(list, _board.items())])
使用 selenium 并关注 @Ajax1234 我收到以下错误。
Traceback (most recent call last):
File "/Users/tS0u/Desktop/morningstar_stackoverflowanswer.py", line 21, in <module>
d.find_element_by_link_text("Profile").click()
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/selenium/webdriver/remote/webelement.py", line 80, in click
self._execute(Command.CLICK_ELEMENT)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/selenium/webdriver/remote/webelement.py", line 628, in _execute
return self._parent.execute(command, params)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 314, in execute
self.error_handler.check_response(response)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.WebDriverException: Message: unknown error:
Element <a href="#" ng-click="subTab.tabSelect(tabItem, $event, item)"
data-linkbinding="profile" class="ng-binding" label-
short="...">Profile</a> is not clickable at point (57, 595). Other
element would receive the click: <div id="_evidon_banner"
class="evidon-banner" style="position: fixed; display: flex; align-
items: center; width: 100%; background: rgb(239, 239, 239); font-size:
14px; color: rgb(0, 0, 0); z-index: 2147000001; padding: 10px 0px;
font-family: UniversNextMorningStarW04, Arial, Helvetica, sans-serif;
border-top: 2px solid rgb(153, 153, 153); bottom: 0px;">...</div>
(Session info: chrome=70.0.3538.77)
(Driver info: chromedriver=2.43.600229
(3fae4d0cda5334b4f533bede5a4787f7b832d052),platform=Mac OS X 10.12.6 x86_64)
导出 csv 时的错误
Traceback (most recent call last):
File "/Users/tS0u/Desktop/morningstar_stackoverflowanswer.py", line 22, in <module>
d.find_element_by_link_text("Profile").click()
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/selenium/webdriver/remote/webelement.py", line 80, in click
self._execute(Command.CLICK_ELEMENT)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/selenium/webdriver/remote/webelement.py", line 628, in _execute
return self._parent.execute(command, params)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 314, in execute
self.error_handler.check_response(response)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.WebDriverException: Message: unknown error:
Element <a href="#" ng-click="subTab.tabSelect(tabItem, $event, item)"
data-linkbinding="profile" class="ng-binding" label-
short="...">Profile</a> is not clickable at point (57, 595). Other
element would receive the click: <div id="_evidon_banner"
class="evidon-banner" style="position: fixed; display: flex; align-
items: center; width: 100%; background: rgb(239, 239, 239); font-size:
14px; color: rgb(0, 0, 0); z-index: 2147000001; padding: 10px 0px;
font-family: UniversNextMorningStarW04, Arial, Helvetica, sans-serif;
border-top: 2px solid rgb(153, 153, 153); bottom: 0px;">...</div>
无论哪种方式,我都非常感谢您花时间解决我的问题。
【问题讨论】:
-
如果您要抓取的内容是由 javascript 操作触发的,那么您可能需要使用 scraper 或 selenium 来执行点击等操作。
-
刚刚查了一下,需要使用scrapy或者selenium来scrape。这应该可以帮助您入门:medium.com/@hoppy/…
标签: python web-scraping beautifulsoup