【发布时间】:2019-12-21 23:49:00
【问题描述】:
“你好,我对网络抓取很陌生。我最近检索了一个网络链接列表,这些链接中有包含表格数据的 URL。我打算抓取数据,但似乎连获取 URL。非常感谢任何形式的帮助"
"网页链接列表是
https://aviation-safety.net/database/dblist.php?Year=1919
https://aviation-safety.net/database/dblist.php?Year=1920
https://aviation-safety.net/database/dblist.php?Year=1921
https://aviation-safety.net/database/dblist.php?Year=1922
https://aviation-safety.net/database/dblist.php?Year=2019"
“从链接列表中,我打算
一个。获取这些链接中的 URL
https://aviation-safety.net/database/record.php?id=19190802-0
https://aviation-safety.net/database/record.php?id=19190811-0
https://aviation-safety.net/database/record.php?id=19200223-0"
"b. 从每个 URL 内的表中获取数据 (例如,事件日期、事件时间、类型、运营商、注册、msn、首飞、分类)"
#Get the list of weblinks
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
import requests
headers = {'insert user agent'}
#start of code
mainurl = "https://aviation-safety.net/database/"
def getAndParseURL(mainurl):
result = requests.get(mainurl)
soup = BeautifulSoup(result.content, 'html.parser')
datatable = soup.find_all('a', href = True)
return datatable
datatable = getAndParseURL(mainurl)
#go through the content and grab the URLs
links = []
for link in datatable:
if 'Year' in link['href']:
url = link['href']
links.append(mainurl + url)
#check if links are in dataframe
df = pd.DataFrame(links, columns=['url'])
df.head(10)
#save the links to a csv
df.to_csv('aviationsafetyyearlinks.csv')
#from the csv read each web-link and get URLs within each link
import csv
from urllib.request import urlopen
contents = []
df = pd.read_csv('aviationsafetyyearlinks.csv')
urls = df['url']
for url in urls:
contents.append(url)
for url in contents:
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
addtable = soup.find_all('a', href = True)
“我只能获取网络链接列表,无法获取 URL,也无法获取这些网络链接中的数据。代码不断显示数组 不太确定我的代码哪里错了,感谢任何帮助,并在此先感谢。”
【问题讨论】:
标签: python python-3.x web-scraping beautifulsoup