如何从网络链接列表中的 URL 中检索 URL 和数据答案

【问题标题】：How do I retrieve URLs and data from the URLs from a list of weblinks如何从网络链接列表中的 URL 中检索 URL 和数据
【发布时间】：2019-12-21 23:49:00
【问题描述】：

“你好，我对网络抓取很陌生。我最近检索了一个网络链接列表，这些链接中有包含表格数据的 URL。我打算抓取数据，但似乎连获取 URL。非常感谢任何形式的帮助"

"网页链接列表是

https://aviation-safety.net/database/dblist.php?Year=1919

https://aviation-safety.net/database/dblist.php?Year=1920

https://aviation-safety.net/database/dblist.php?Year=1921

https://aviation-safety.net/database/dblist.php?Year=1922

https://aviation-safety.net/database/dblist.php?Year=2019"

“从链接列表中，我打算

一个。获取这些链接中的 URL

https://aviation-safety.net/database/record.php?id=19190802-0

https://aviation-safety.net/database/record.php?id=19190811-0

https://aviation-safety.net/database/record.php?id=19200223-0"

"b. 从每个 URL 内的表中获取数据（例如，事件日期、事件时间、类型、运营商、注册、msn、首飞、分类）"

    #Get the list of weblinks

    import numpy as np
    import pandas as pd
    from bs4 import BeautifulSoup
    import requests

    headers = {'insert user agent'}

    #start of code

    mainurl = "https://aviation-safety.net/database/"
    def getAndParseURL(mainurl):
       result = requests.get(mainurl)
       soup = BeautifulSoup(result.content, 'html.parser')
       datatable = soup.find_all('a', href = True)
       return datatable

    datatable = getAndParseURL(mainurl)

    #go through the content and grab the URLs

    links = []
    for link in datatable:
        if 'Year' in link['href']:
            url = link['href']
            links.append(mainurl + url)

    #check if links are in dataframe

    df = pd.DataFrame(links, columns=['url'])

    df.head(10)

    #save the links to a csv

    df.to_csv('aviationsafetyyearlinks.csv')


    #from the csv read each web-link and get URLs within each link

    import csv
    from urllib.request import urlopen

    contents = []
    df = pd.read_csv('aviationsafetyyearlinks.csv')

    urls = df['url']
    for url in urls:
        contents.append(url) 
        for url in contents:
            page = requests.get(url)
            soup = BeautifulSoup(page.content, 'html.parser')
            addtable = soup.find_all('a', href = True)

“我只能获取网络链接列表，无法获取 URL，也无法获取这些网络链接中的数据。代码不断显示数组不太确定我的代码哪里错了，感谢任何帮助，并在此先感谢。”

【问题讨论】：

标签： python python-3.x web-scraping beautifulsoup

【解决方案1】：

请求页面时。添加用户代理。

headers = {'User-Agent':
       'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}
mainurl = "https://aviation-safety.net/database/dblist.php?Year=1919"
def getAndParseURL(mainurl):
    result = requests.get(mainurl,headers=headers)
    soup = BeautifulSoup(result.content, 'html.parser')
    datatable = soup.select('a[href*="database/record"]')
    return datatable

print(getAndParseURL(mainurl))

【讨论】：

“你好，这行得通！我能够检索 2 个 URL。但我想知道是否可以使用 URL 列表而不是 mainurl 中的一个 URL？” URL 列表从 1919 年到 2019 年：aviation-safety.net/database/dblist.php?Year=1919 aviation-safety.net/database/dblist.php?Year=1920 aviation-safety.net/database/dblist.php?Year=1921 aviation-safety.net/database/dblist.php?Year=1922 我尝试这样做但遇到了这个错误： InvalidSchema: No connection adapters were found for '0 aviation-safety.net/database/dblist.ph...