【发布时间】:2021-05-13 00:39:16
【问题描述】:
我有一个不同的 URL 列表,我想从 Python 中抓取文本。到目前为止,我已经设法构建了一个脚本,该脚本基于带有关键字的 Google 搜索返回 URL,但是我现在想抓取这些 URL 的内容。问题是我现在正在抓取整个网站,包括布局/样式信息,而我只想抓取“可见文本”。最终,我的目标是获取所有这些 url 的名称,并将它们存储在 pandas DataFrame 中。甚至可能包括某些名字被提及的频率,但那是以后的事了。到目前为止,下面是我的代码的一个相当简单的开始:
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
import requests
from time import sleep
from random import randint
import spacy
import en_core_web_sm
import pandas as pd
url_list = ["https://www.nhtsa.gov/winter-driving-safety", "https://www.safetravelusa.com/", "https://www.theatlantic.com/business/archive/2014/01/how-2-inches-of-snow-created-a-traffic-nightmare-in-atlanta/283434/", "https://www.wsdot.com/traffic/passes/stevens/"]
df = pd.DataFrame(url_list, columns = ['url'])
df_Names = []
# load english language model
nlp = en_core_web_sm.load()
# find Names in text
def spacy_entity(df):
df1 = nlp(df)
df2 = [[w.text,w.label_] for w in df1.ents]
return df2
for index, url in df.iterrows():
print(index)
print(url)
sleep(randint(2,5))
# print(page)
req = Request(url[0], headers={"User-Agent": 'Mozilla/5.0'})
webpage = urlopen(req).read()
soup = BeautifulSoup(webpage, 'html5lib').get_text()
df_Names.append(spacy_entity(soup))
df["Names"] = df_Names
【问题讨论】:
标签: python dataframe web-scraping beautifulsoup spacy