【问题标题】:Web scraping news page with a "load more"带有“加载更多”的网页抓取新闻页面
【发布时间】:2021-11-20 18:34:07
【问题描述】:

我正在尝试抓取这个新闻网站“https://inshorts.com/en/read/national”,我只是得到显示文章的结果,我需要网站上的所有文章包含单词(例如“COVID-19”),并且不必使用“加载更多”按钮。

这是我提供当前文章的代码:

import requests
from bs4 import BeautifulSoup
import pandas as pd
dummy_url="https://inshorts.com/en/read/badminton"
data_dummy=requests.get(dummy_url)
soup=BeautifulSoup(data_dummy.content,'html.parser')


urls=["https://inshorts.com/en/read/national"]
news_data_content,news_data_title,news_data_category,news_data_time=[],[],[],[]
for url in urls:
  category=url.split('/')[-1]
  data=requests.get(url)
  soup=BeautifulSoup(data.content,'html.parser')
  news_title=[]
  news_content=[]
  news_category=[]
  news_time = []
  for headline,article,time in zip(soup.find_all('div', class_=["news-card-title news-right-box"]),
                            soup.find_all('div',class_=["news-card-content news-right-box"]),
                            soup.find_all('div', class_ = ["news-card-author-time news-card-author-time-in-title"])):
    
    news_title.append(headline.find('span',attrs={'itemprop':"headline"}).string)
    news_content.append(article.find('div',attrs={'itemprop':"articleBody"}).string)
    news_time.append(time.find('span', clas=["date"]))

    news_category.append(category)
  news_data_title.extend(news_title)
  news_data_content.extend(news_content)
  news_data_category.extend(news_category)  
  news_data_time.extend(news_time)

df1=pd.DataFrame(news_data_title,columns=["Title"])
df2=pd.DataFrame(news_data_content,columns=["Content"])
df3=pd.DataFrame(news_data_category,columns=["Category"])
df4=pd.DataFrame(news_data_time, columns=["time"])
df=pd.concat([df1,df2,df3,df4],axis=1)


def name():
  a = input("File Name: ")
  return a
b = name()
df.to_csv(b + ".csv")

【问题讨论】:

    标签: web-scraping beautifulsoup python-requests data-science feed


    【解决方案1】:

    你可以使用这个例子来模拟点击Load More按钮:

    import re
    import requests
    from bs4 import BeautifulSoup
    
    
    url = "https://inshorts.com/en/read/national"
    api_url = "https://inshorts.com/en/ajax/more_news"
    
    headers = {
        "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:92.0) Gecko/20100101 Firefox/92.0"
    }
    
    # load first page:
    html_doc = requests.get(url, headers=headers).text
    min_news_id = re.search(r'min_news_id = "([^"]+)"', html_doc).group(1)
    
    pages = 10  # <-- here I limit number of pages to 10
    while pages:
        soup = BeautifulSoup(html_doc, "html.parser")
    
        # search the soup for your articles here
        # ...
    
        # here I just print the headlines:
        for headline in soup.select('[itemprop="headline"]'):
            print(headline.text)
    
        # load next batch of articles:
        data = requests.post(api_url, data={"news_offset": min_news_id}).json()
        html_doc = data["html"]
        min_news_id = data["min_news_id"]
    
        pages -= 1
    

    打印前 10 页的新闻标题:

    
    ...
    
    Moeen has done some wonderful things in Test cricket: Root
    There should be an evolution in player-media relationship: Federer
    Swiggy in talks to raise over $500 mn at $10 bn valuation: Reports
    Tesla investors urged to reject Murdoch, Kimbal Musk's re-election
    Doctor dies on Pune-Mumbai Expressway when rolls of paper fall on his car
    2 mothers name newborn girls after Cyclone Gulab in Odisha 
    100 US citizens, permanent residents waiting to leave Afghanistan
    Iran's nuclear programme has crossed all red lines: Israeli PM
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2018-01-18
      • 1970-01-01
      • 2019-09-30
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多