【发布时间】:2021-11-20 18:34:07
【问题描述】:
我正在尝试抓取这个新闻网站“https://inshorts.com/en/read/national”,我只是得到显示文章的结果,我需要网站上的所有文章包含单词(例如“COVID-19”),并且不必使用“加载更多”按钮。
这是我提供当前文章的代码:
import requests
from bs4 import BeautifulSoup
import pandas as pd
dummy_url="https://inshorts.com/en/read/badminton"
data_dummy=requests.get(dummy_url)
soup=BeautifulSoup(data_dummy.content,'html.parser')
urls=["https://inshorts.com/en/read/national"]
news_data_content,news_data_title,news_data_category,news_data_time=[],[],[],[]
for url in urls:
category=url.split('/')[-1]
data=requests.get(url)
soup=BeautifulSoup(data.content,'html.parser')
news_title=[]
news_content=[]
news_category=[]
news_time = []
for headline,article,time in zip(soup.find_all('div', class_=["news-card-title news-right-box"]),
soup.find_all('div',class_=["news-card-content news-right-box"]),
soup.find_all('div', class_ = ["news-card-author-time news-card-author-time-in-title"])):
news_title.append(headline.find('span',attrs={'itemprop':"headline"}).string)
news_content.append(article.find('div',attrs={'itemprop':"articleBody"}).string)
news_time.append(time.find('span', clas=["date"]))
news_category.append(category)
news_data_title.extend(news_title)
news_data_content.extend(news_content)
news_data_category.extend(news_category)
news_data_time.extend(news_time)
df1=pd.DataFrame(news_data_title,columns=["Title"])
df2=pd.DataFrame(news_data_content,columns=["Content"])
df3=pd.DataFrame(news_data_category,columns=["Category"])
df4=pd.DataFrame(news_data_time, columns=["time"])
df=pd.concat([df1,df2,df3,df4],axis=1)
def name():
a = input("File Name: ")
return a
b = name()
df.to_csv(b + ".csv")
【问题讨论】:
标签: web-scraping beautifulsoup python-requests data-science feed