【问题标题】:Get url of link using Python web scraping; requests, requests_html, selenium使用 Python 网页抓取获取链接的 url;请求,requests_html,硒
【发布时间】:2020-10-28 02:55:13
【问题描述】:

我是网络抓取的新手,我在获取来自 USGS 地震数据的链接时遇到问题,你感觉到了吗?我试图从中获取数据的网址是:https://earthquake.usgs.gov/earthquakes/eventpage/us7000biji/dyfi/intensity

我正在尝试自动收集这些数据,这样我就不必在每次地震后手动收集这些数据。我试图提取的数据的 url 是一致的,除了我拥有的地震 id 和一个似乎与任何东西无关的数字,所以我想我可以用 web 获取 url刮。

如果您查看该页面,则会有一个名为下载的下拉菜单,其中包含不同的数据产品。我正在尝试获取 DYFI 地理空间数据、UTM 聚合(10 公里间距)的 url,以便我可以使用 curl 提取 geojson 文件。

我对网页抓取或 html 代码了解不多,而且我尝试的大部分内容都是基于我在这里和 youtube 上找到的。

我的尝试:

我尝试使用请求来获取 html 并用漂亮的汤解析它,但是页面是动态生成的,所以过来的 html 不包含我要查找的内容。

import requests
import bs4 #beautiful soup

res = requests.get('https://earthquake.usgs.gov/earthquakes/eventpage/us7000bi0e/dyfi/intensity')
soup = bs4.BeautifulSoup(res.text, 'html.parser')
for link in soup.find_all('a'):
    print(link)

这会输出三个链接,但不是我需要的:

<a href="/earthquakes/feed/">Real-time Notifications, Feeds, and Web Services</a>
<a href="https://angular.io/guide/browser-support">view supported
            browsers</a>
<a href="/earthquakes/feed/">Real-time Notifications, Feeds, and
            Web Services</a>

我认为 USGS 站点使用 javascript 来填充下拉下载菜单,这就是常规请求方法不起作用的原因,所以我想我可能会尝试使用 selenium。我希望它能给我在使用检查元素工具时可以看到的 html,但我没有任何运气。

from selenium import webdriver
path = "/Users/jon/Desktop/selenium_webdriver/chromedriver" #path to chromedriver on my machine
driver = webdriver.Chrome(executable_path=path)
driver.get('https://earthquake.usgs.gov/earthquakes/eventpage/us7000bi0e/dyfi/intensity')
html_eq = driver.page_source
soup = bs4.BeautifulSoup(html_eq, 'html.parser')
for link in soup.find_all('a'):
    print(link) 

这比我最初的尝试输出更多的链接,但没有得到我正在寻找的链接。 这是我的硒尝试的输出:

<a _ngcontent-fgi-c8="" class="hazdev-site-logo" href="/" title="U.S. Geological Survey"><img _ngcontent-fgi-c8="" alt="U.S. Geological Survey logo" src="assets/usgs-logo.svg"/></a>
<a _ngcontent-fgi-c8="" class="hazdev-jumplink-navigation" href="#site-sectionnav">Jump to Navigation</a>
<a _ngcontent-fgi-c5="" class="up-one-level ng-star-inserted" href="/earthquakes/map/" templatesidenavigation=""> Latest Earthquakes </a>
<a _ngcontent-fgi-c15="" class="mat-list-item ng-star-inserted" href="/earthquakes/eventpage/us7000bi0e/executive" mat-list-item="" routerlinkactive="active-link"><div class="mat-list-item-content"><div class="mat-list-item-ripple mat-ripple" mat-ripple=""></div><div class="mat-list-text"></div> Overview </div></a>
<a _ngcontent-fgi-c15="" class="mat-list-item ng-star-inserted" href="/earthquakes/eventpage/us7000bi0e/map" mat-list-item="" routerlinkactive="active-link"><div class="mat-list-item-content"><div class="mat-list-item-ripple mat-ripple" mat-ripple=""></div><div class="mat-list-text"></div> Interactive Map </div></a>
<a _ngcontent-fgi-c15="" class="mat-list-item ng-star-inserted" href="/earthquakes/eventpage/us7000bi0e/region-info" mat-list-item="" routerlinkactive="active-link"><div class="mat-list-item-content"><div class="mat-list-item-ripple mat-ripple" mat-ripple=""></div><div class="mat-list-text"></div> Regional Information </div></a>
<a _ngcontent-fgi-c15="" class="mat-list-item ng-star-inserted" href="/earthquakes/eventpage/us7000bi0e/impact" mat-list-item="" routerlinkactive="active-link"><div class="mat-list-item-content"><div class="mat-list-item-ripple mat-ripple" mat-ripple=""></div><div class="mat-list-text"></div> Impact </div></a>
<a _ngcontent-fgi-c15="" class="mat-list-item ng-star-inserted" href="/earthquakes/eventpage/us7000bi0e/tellus" mat-list-item="" routerlinkactive="active-link"><div class="mat-list-item-content"><div class="mat-list-item-ripple mat-ripple" mat-ripple=""></div><div class="mat-list-text"></div> Felt Report - Tell Us! </div></a>
<a _ngcontent-fgi-c15="" class="mat-list-item ng-star-inserted active-link" href="/earthquakes/eventpage/us7000bi0e/dyfi" mat-list-item="" routerlinkactive="active-link"><div class="mat-list-item-content"><div class="mat-list-item-ripple mat-ripple" mat-ripple=""></div><div class="mat-list-text"></div> Did You Feel It? </div></a>
<a _ngcontent-fgi-c15="" class="mat-list-item ng-star-inserted" href="/earthquakes/eventpage/us7000bi0e/technical" mat-list-item="" routerlinkactive="active-link"><div class="mat-list-item-content"><div class="mat-list-item-ripple mat-ripple" mat-ripple=""></div><div class="mat-list-text"></div> Technical </div></a>
<a _ngcontent-fgi-c15="" class="mat-list-item ng-star-inserted" href="/earthquakes/eventpage/us7000bi0e/origin" mat-list-item="" routerlinkactive="active-link"><div class="mat-list-item-content"><div class="mat-list-item-ripple mat-ripple" mat-ripple=""></div><div class="mat-list-text"></div> Origin </div></a>
<a _ngcontent-fgi-c15="" class="mat-list-item ng-star-inserted" href="/earthquakes/eventpage/us7000bi0e/waveforms" mat-list-item="" routerlinkactive="active-link"><div class="mat-list-item-content"><div class="mat-list-item-ripple mat-ripple" mat-ripple=""></div><div class="mat-list-text"></div> Waveforms </div></a>
<a _ngcontent-fgi-c15="" class="mat-list-item ng-star-inserted" href="/earthquakes/feed/v1.0/detail/us7000bi0e.kml" mat-list-item=""><div class="mat-list-item-content"><div class="mat-list-item-ripple mat-ripple" mat-ripple=""></div><div class="mat-list-text"></div> Download Event KML </div></a>
<a _ngcontent-fgi-c15="" class="mat-list-item ng-star-inserted" href="/earthquakes/map/#%7B%22autoUpdate%22%3Afalse%2C%22basemap%22%3A%22terrain%22%2C%22event%22%3A%22us7000bi0e%22%2C%22feed%22%3A%22us7000bi0e%22%2C%22mapposition%22%3A%5B%5B6.104279985601153%2C-85.06432001439885%5D%2C%5B10.603920014398849%2C-80.56467998560115%5D%5D%2C%22search%22%3A%7B%22id%22%3A%22us7000bi0e%22%2C%22isSearch%22%3Atrue%2C%22name%22%3A%22Search%20Results%22%2C%22params%22%3A%7B%22endtime%22%3A%222020-09-25T17%3A46%3A43.975Z%22%2C%22latitude%22%3A8.3541%2C%22longitude%22%3A-82.8145%2C%22maxradiuskm%22%3A250%2C%22minmagnitude%22%3A2%2C%22starttime%22%3A%222020-08-14T17%3A46%3A43.975Z%22%7D%7D%7D" mat-list-item=""><div class="mat-list-item-content"><div class="mat-list-item-ripple mat-ripple" mat-ripple=""></div><div class="mat-list-text"></div> View Nearby Seismicity </div></a>
<a _ngcontent-fgi-c15="" class="mat-list-item ng-star-inserted" href="/earthquakes/" mat-list-item=""><div class="mat-list-item-content"><div class="mat-list-item-ripple mat-ripple" mat-ripple=""></div><div class="mat-list-text"></div> Earthquakes </div></a>
<a _ngcontent-fgi-c15="" class="mat-list-item ng-star-inserted" href="/hazards/" mat-list-item=""><div class="mat-list-item-content"><div class="mat-list-item-ripple mat-ripple" mat-ripple=""></div><div class="mat-list-text"></div> Hazards </div></a>
<a _ngcontent-fgi-c15="" class="mat-list-item ng-star-inserted" href="/data/" mat-list-item=""><div class="mat-list-item-content"><div class="mat-list-item-ripple mat-ripple" mat-ripple=""></div><div class="mat-list-text"></div> Data &amp; Products </div></a>
<a _ngcontent-fgi-c15="" class="mat-list-item ng-star-inserted" href="/learn/" mat-list-item=""><div class="mat-list-item-content"><div class="mat-list-item-ripple mat-ripple" mat-ripple=""></div><div class="mat-list-text"></div> Learn </div></a>
<a _ngcontent-fgi-c15="" class="mat-list-item ng-star-inserted" href="/monitoring/" mat-list-item=""><div class="mat-list-item-content"><div class="mat-list-item-ripple mat-ripple" mat-ripple=""></div><div class="mat-list-text"></div> Monitoring </div></a>
<a _ngcontent-fgi-c15="" class="mat-list-item ng-star-inserted" href="/research/" mat-list-item=""><div class="mat-list-item-content"><div class="mat-list-item-ripple mat-ripple" mat-ripple=""></div><div class="mat-list-text"></div> Research </div></a>
<a _ngcontent-fgi-c18="" class="tell-us-link" href="/earthquakes/eventpage/us7000bi0e/tellus" queryparamshandling="preserve"> Felt Report - Tell Us! </a>
<a _ngcontent-fgi-c22=""> View all dyfi products (1 total) </a>
<a _ngcontent-fgi-c20="" href="/earthquakes/eventpage/us7000bi0e/dyfi/intensity"> US </a>
<a _ngcontent-fgi-c18="" aria-current="true" aria-disabled="false" class="mat-tab-link ng-star-inserted mat-tab-label-active" href="/earthquakes/eventpage/us7000bi0e/dyfi/intensity" mat-tab-link="" queryparamshandling="preserve" routerlinkactive="" tabindex="0"> Intensity </a>
<a _ngcontent-fgi-c18="" aria-current="false" aria-disabled="false" class="mat-tab-link ng-star-inserted" href="/earthquakes/eventpage/us7000bi0e/dyfi/zip" mat-tab-link="" queryparamshandling="preserve" routerlinkactive="" tabindex="0"> ZIP Map </a>
<a _ngcontent-fgi-c18="" aria-current="false" aria-disabled="false" class="mat-tab-link ng-star-inserted" href="/earthquakes/eventpage/us7000bi0e/dyfi/intensity-vs-distance" mat-tab-link="" queryparamshandling="preserve" routerlinkactive="" tabindex="0"> Intensity Vs. Distance </a>
<a _ngcontent-fgi-c18="" aria-current="false" aria-disabled="false" class="mat-tab-link ng-star-inserted" href="/earthquakes/eventpage/us7000bi0e/dyfi/responses-vs-time" mat-tab-link="" queryparamshandling="preserve" routerlinkactive="" tabindex="0"> Responses Vs. Time </a>
<a _ngcontent-fgi-c18="" aria-current="false" aria-disabled="false" class="mat-tab-link ng-star-inserted" href="/earthquakes/eventpage/us7000bi0e/dyfi/responses" mat-tab-link="" queryparamshandling="preserve" routerlinkactive="" tabindex="0"> DYFI Responses </a>
<a _ngcontent-fgi-c28="" class="ng-star-inserted" href="/earthquakes/eventpage/us7000bi0e/map?dyfi-responses-10km=true&amp;shakemap-intensity=false"><img _ngcontent-fgi-c28="" alt="DYFI intensity map" src="https://earthquake.usgs.gov/archive/product/dyfi/us7000bi0e/us/1601053020563/us7000bi0e_ciim_geo.jpg"/></a>
<a _ngcontent-fgi-c23="" href="/earthquakes/eventpage/us7000bi0e">Overview</a>
<a _ngcontent-fgi-c32="" class="ng-star-inserted" href="/earthquakes/eventpage/us7000bi0e/impact"> Impact Summary </a>
<a _ngcontent-fgi-c18="" href="https://earthquake.usgs.gov/data/dyfi/">Scientific Background for Did You Feel It?</a>
<a href="https://earthquake.usgs.gov/data/comcat/contributor/us/">USGS National Earthquake Information Center, PDE</a>
<a _ngcontent-fgi-c7="" href="/data/comcat/"> ANSS Comprehensive Earthquake Catalog (ComCat) Documentation </a>
<a _ngcontent-fgi-c7="" href="/data/comcat/data-eventterms.php"> Technical terms used on event pages </a>
<a _ngcontent-fgi-c11="" href="mailto:lisa%2Behpweb@usgs.gov">Questions or comments?</a>
<a _ngcontent-fgi-c11="" class="facebook ng-star-inserted" href="https://www.facebook.com/sharer.php?u=https%3A%2F%2Fearthquake.usgs.gov%2Fearthquakes%2Feventpage%2Fus7000bi0e%2Fdyfi%2Fintensity" title="Share using Facebook">Facebook</a>
<a _ngcontent-fgi-c11="" class="twitter ng-star-inserted" href="https://twitter.com/intent/tweet?url=https%3A%2F%2Fearthquake.usgs.gov%2Fearthquakes%2Feventpage%2Fus7000bi0e%2Fdyfi%2Fintensity&amp;text=USGS%20%7C%20M 5.3 - 1 km NNW of Manaca Norte, Panama" title="Share using Twitter">Twitter</a>
<a _ngcontent-fgi-c11="" class="email ng-star-inserted" href="mailto:lisa%2Behpweb@usgs.gov?to=&amp;subject=M 5.3 - 1 km NNW of Manaca Norte, Panama&amp;body=https%3A%2F%2Fearthquake.usgs.gov%2Fearthquakes%2Feventpage%2Fus7000bi0e%2Fdyfi%2Fintensity" title="Share using Email">Email</a>
<a _ngcontent-fgi-c13="" class="ng-star-inserted" href="/"> Home </a>
<a _ngcontent-fgi-c13="" class="ng-star-inserted" href="/aboutus/"> About Us </a>
<a _ngcontent-fgi-c13="" class="ng-star-inserted" href="/contactus/"> Contacts </a>
<a _ngcontent-fgi-c13="" class="ng-star-inserted" href="/legal.php"> Legal </a>

我发现了一个关于使用 requests_html 进行网络抓取的 youtube 教程,我认为它可能有用:https://www.youtube.com/watch?v=MeBU-4Xs2RU 我可以得到他在视频中给出的与啤酒网站合作的示例,但我无法将其应用于我的情况。

这是我尝试过的代码,

from requests_html import HTMLSession

url_usgs = 'https://earthquake.usgs.gov/earthquakes/eventpage/us7000biji/dyfi/intensity'

r_usgs = s.get(url_usgs)

r_usgs.html.render(sleep=1)

downloads = r_usgs.html.xpath('//*[@id="mat-expansion-panel-header-0"]', first=True)
print(downloads.absolute_links)

这并没有返回任何东西。我不知道 html,所以我可能选择了错误的项目的 xpath 来使用。

如果有人对我如何从下载菜单 (https://earthquake.usgs.gov/archive/product/dyfi/us7000biji/us/1601214674370/dyfi_geo_10km.geojson) 获取 10 公里 dyfi 数据的 URL 有任何想法,或者可以为我指出有关网络抓取的更深入材料的方向,我将不胜感激.

【问题讨论】:

    标签: python selenium-webdriver web-scraping python-requests python-requests-html


    【解决方案1】:

    您需要点击“下载”菜单才能展开内容。

    from selenium import webdriver
    from selenium.common.exceptions import NoSuchElementException
    import time
    
    
    driver = webdriver.Chrome()
    driver.get('https://earthquake.usgs.gov/earthquakes/eventpage/us7000bi0e/dyfi/intensity')
    
    # get a reference to the download menu. This will run before the page has 
    # finished loading, so we stick it in a while loop and just keep looping
    # until we're successful.
    while True:
        try:
            download_menu = driver.find_element_by_id('mat-expansion-panel-header-0')
        except NoSuchElementException:
            time.sleep(0.2)
            continue
        else:
            break
    
    # click on the download menu to expand the content
    download_menu.click()
    
    while True:
        try:
            downloads = driver.find_element_by_id('cdk-accordion-child-0')
        except NoSuchElementException:
            time.sleep(0.2)
            continue
        else:
            break
    
    links = downloads.find_elements_by_css_selector('a')
    geojson = [link for link in links if 'geojson' in link.text.lower()]
    
    for link in geojson:
        print(link.text, ':', link.get_attribute('href'))
    
    
    driver.close()
    

    这将产生:

    GEOJSON 645.0 B : https://earthquake.usgs.gov/archive/product/dyfi/us7000bi0e/us/1601053020563/dyfi_zip.geojson
    GEOJSON 844.0 B : https://earthquake.usgs.gov/archive/product/dyfi/us7000bi0e/us/1601053020563/dyfi_geo_1km.geojson
    GEOJSON 1.0 KB : https://earthquake.usgs.gov/archive/product/dyfi/us7000bi0e/us/1601053020563/dyfi_geo_10km.geojson
    

    ...当然,您可以检查href 属性的值以查找10km 数据(通过在链接中查找包含10km 的数据)。

    【讨论】:

    • 多么传奇!感谢您让我超越了那个水平。
    猜你喜欢
    • 2018-03-22
    • 1970-01-01
    • 2022-01-18
    • 2021-10-28
    • 2016-02-21
    • 2019-07-27
    • 2023-02-02
    • 2020-10-06
    • 2020-04-20
    相关资源
    最近更新 更多