【问题标题】:Web scraping nested divs with BeautifulSoup]使用 BeautifulSoup 抓取嵌套的 div]
【发布时间】:2019-04-10 10:08:55
【问题描述】:

我想抓取以下网站的内容:

https://www.morningstar.com/stocks/xnys/mmm/quote.html

从那里我想点击Executive,然后点击Board of Director,然后我想从中抓取传记 >每位董事的简介。理想情况下,最终结果将包含董事会 12 名成员中的每一位的传记。 Board of Directors Member Profile

我正在尝试使用 BeautifulSoup 执行此操作,但我无法访问那个嵌套的 div。

from bs4 import BeautifulSoup as soup
import re, time
import csv
from selenium import webdriver
def get_directors(_html):
  _names = [i.text for i in soup(_html, 'html.parser').find_all('div', {'class':'name ng-binding'})]
  return _names[_names.index('Compensation for all Key Executives')+1:-1]

_board = {}
d = webdriver.Chrome('/Users/tS0u/Downloads/chromedriver')
d.get('https://www.morningstar.com/stocks/xnys/mmm/quote.html')
time.sleep(5)
_exec = d.find_elements_by_class_name("mds-button")
_exec[8].click()
time.sleep(3)
d.find_element_by_link_text("Board of Directors").click()
time.sleep(3)
full_directors = d.find_elements_by_class_name('person-row')[19:31]
for _name, _link in zip(get_directors(d.page_source), full_directors):
   _link.click()
   time.sleep(3)
   d.find_element_by_link_text("Profile").click()
   time.sleep(3)
   _board[_name] = soup(d.page_source, 'html.parser').find_all('div', {'class':'biography'})[-1].text
   _link.click()
   time.sleep(3)
   print(_board)
   with open('filename.csv', 'w') as f:
      write = csv.writer(f)
      write.writerows([['name', 'biography'], *map(list, _board.items())])

使用 selenium 并关注 @Ajax1234 我收到以下错误。

Traceback (most recent call last):
File "/Users/tS0u/Desktop/morningstar_stackoverflowanswer.py", line 21, in <module>
d.find_element_by_link_text("Profile").click()
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/selenium/webdriver/remote/webelement.py", line 80, in click
self._execute(Command.CLICK_ELEMENT)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/selenium/webdriver/remote/webelement.py", line 628, in _execute
return self._parent.execute(command, params)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 314, in execute
self.error_handler.check_response(response)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.WebDriverException: Message: unknown error: 
Element <a href="#" ng-click="subTab.tabSelect(tabItem, $event, item)" 
data-linkbinding="profile" class="ng-binding" label- 
short="...">Profile</a> is not clickable at point (57, 595). Other 
element would receive the click: <div id="_evidon_banner" 
class="evidon-banner" style="position: fixed; display: flex; align- 
items: center; width: 100%; background: rgb(239, 239, 239); font-size: 
14px; color: rgb(0, 0, 0); z-index: 2147000001; padding: 10px 0px; 
font-family: UniversNextMorningStarW04, Arial, Helvetica, sans-serif; 
border-top: 2px solid rgb(153, 153, 153); bottom: 0px;">...</div>
(Session info: chrome=70.0.3538.77)
(Driver info: chromedriver=2.43.600229 
(3fae4d0cda5334b4f533bede5a4787f7b832d052),platform=Mac OS X 10.12.6 x86_64)

导出 csv 时的错误

Traceback (most recent call last):
File "/Users/tS0u/Desktop/morningstar_stackoverflowanswer.py", line 22, in <module>
d.find_element_by_link_text("Profile").click()
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/selenium/webdriver/remote/webelement.py", line 80, in click
self._execute(Command.CLICK_ELEMENT)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/selenium/webdriver/remote/webelement.py", line 628, in _execute
return self._parent.execute(command, params)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 314, in execute
self.error_handler.check_response(response)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.WebDriverException: Message: unknown error: 
Element <a href="#" ng-click="subTab.tabSelect(tabItem, $event, item)" 
data-linkbinding="profile" class="ng-binding" label- 
short="...">Profile</a> is not clickable at point (57, 595). Other 
element would receive the click: <div id="_evidon_banner" 
class="evidon-banner" style="position: fixed; display: flex; align- 
items: center; width: 100%; background: rgb(239, 239, 239); font-size: 
14px; color: rgb(0, 0, 0); z-index: 2147000001; padding: 10px 0px; 
font-family: UniversNextMorningStarW04, Arial, Helvetica, sans-serif; 
border-top: 2px solid rgb(153, 153, 153); bottom: 0px;">...</div>

无论哪种方式,我都非常感谢您花时间解决我的问题。

【问题讨论】:

  • 如果您要抓取的内容是由 javascript 操作触发的,那么您可能需要使用 scraper 或 selenium 来执行点击等操作。
  • 刚刚查了一下,需要使用scrapy或者selenium来scrape。这应该可以帮助您入门:medium.com/@hoppy/…

标签: python web-scraping beautifulsoup


【解决方案1】:

该站点是动态的,因此,您必须使用浏览器操作工具,例如selenium

from bs4 import BeautifulSoup as soup
import re, time
from selenium import webdriver
def get_directors(_html):
  _names = [i.text for i in soup(_html, 'html.parser').find_all('div', {'class':'name ng-binding'})]
  return _names[_names.index('Compensation for all Key Executives')+1:-1]

_board = {}
d = webdriver.Chrome('/path/to/chromedriver')
d.get('https://www.morningstar.com/stocks/xnys/mmm/quote.html')
time.sleep(5)
_exec = d.find_elements_by_class_name("mds-button")
_exec[8].click()
time.sleep(3)
d.find_element_by_link_text("Board of Directors").click()
time.sleep(3)
full_directors = d.find_elements_by_class_name('person-row')[19:31]
for _name, _link in zip(get_directors(d.page_source), full_directors):
   _link.click()
   time.sleep(3)
   d.find_element_by_link_text("Profile").click()
   time.sleep(3)
   _board[_name] = soup(d.page_source, 'html.parser').find_all('div', {'class':'biography'})[-1].text
   _link.click()
   time.sleep(3)

print(_board)

输出(缩短以节省空间):

{'Inge G. Thulin': '\nBiography\n\n                Mr. Thulin is the Chairman of the Board, President and Chief Executive Officer of 3M Company. Mr. Thulin served as President and Chief Executive Officer of 3M Company from ....', 'Sondra L. Barbour': '\nBiography\n\n                Ms. Barbour is Executive Vice President, Information Systems and Global Solutions, Lockheed Martin Corporation, a high technology aerospace and defense company. Since joini....', 'Thomas K. Brown': '\nBiography\n\n                Mr. Brown is the Retired Group Vice President, Global Purchasing, Ford Motor Company, a global automotive industry leader. Mr. Brown served in various leadership capacities....', 'David B. Dillon': '\nBiography\n\n                —\n            \n....', 'Michael L Eskew': '\nBiography\n\n                Mr. Eskew is the Retired Chairman of the Board and Chief Executive Officer, United Parcel Service, Inc., a provider of specialized transportation and logistics services. Mr....', 'Herbert L. Henkel': '\nBiography\n\n                Mr. Henkel is the Retired Chairman of the Board and Chief Executive Officer, Ingersoll-Rand plc, a manufacturer of industrial products and components. Mr. Henkel retired as....', 'Amy Hood': "\nBiography\n\n                On August 13, 2017, the Board of Directors of 3M Company elected Amy E. Hood to the Company's Board of Directors, effective August 13, 2017. At Microsoft, Hood is responsib....", 'Muhtar Kent': "\nBiography\n\n                Mr. Kent is the Chairman of the Board and Chief Executive Officer, The Coca-Cola Company, the world's largest beverage company. Mr. Kent has held the position of Chairman o....", 'Edward M. Liddy': '\nBiography\n\n                Mr. Liddy is the Retired Chairman of the Board and Chief Executive Officer, The Allstate Corporation, and former Partner at Clayton, Dubilier & Rice, LLC, a private equity ....', 'Dambisa F. Moyo': "\nBiography\n\n                On August 12, 2018, the Board of Directors of 3M Company elected Dambisa F. Moyo to the Company's Board of Directors, effective August 12, 2018. Dr. Moyo is the founder and....", 'Gregory R. Page': "\nBiography\n\n                On February 1, 2016, the Board of Directors of 3M Company elected Gregory R. Page to the Company's Board of Directors, effective February 1, 2016. Page previously was Cargi....", 'Patricia A. Woertz': "\nBiography\n\n                On February 1, 2016, the Board of Directors of 3M Company elected Patricia A. Woertz to the Company's Board of Directors, effective at the close of business on February 2, ...."}

编辑:

将结果写入csv:

import csv
with open('filename.csv', 'w') as f:
  write = csv.writer(f)
  write.writerows([['name', 'biography'], *map(list, _board.items())])

创建一个更通用的解决方案来处理不同的 url(可能从列表中的内容创建):

def scrape_bios(_driver:webdriver, _url:str) -> dict:
  _driver.get(_url)
  time.sleep(5)
  _exec = _driver.find_elements_by_class_name("mds-button")
  _exec[8].click()
  time.sleep(3)
  _board = {}
  _driver.find_element_by_link_text("Board of Directors").click()
  time.sleep(3)
  full_directors = _driver.find_elements_by_class_name('person-row')[19:31]
  for _name, _link in zip(get_directors(_driver.page_source), full_directors):
    _link.click()
    time.sleep(3)
    _driver.find_element_by_link_text("Profile").click()
    time.sleep(3)
    _board[_name] = soup(_driver.page_source, 'html.parser').find_all('div', {'class':'biography'})[-1].text
    _link.click()
    time.sleep(3)
  return _board

现在,您可以遍历 url 列表:

d = webdriver.Chrome('/path/to/chromedriver')
for url in urls:
  _results = scrape_bios(d, url)

【讨论】:

    猜你喜欢
    • 2022-11-14
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2018-08-29
    • 1970-01-01
    • 2023-03-24
    • 1970-01-01
    相关资源
    最近更新 更多