【问题标题】:Can't seem to access Metatags似乎无法访问元标记
【发布时间】:2020-12-11 18:42:35
【问题描述】:

我需要从新闻文章中抓取作者和日期,但我无法访问元标记中的某些信息。

import requests, random, re, os
from bs4 import BeautifulSoup as bs
import urllib.parse
import time
from newspaper import Article


url = ['https://www.wsj.com/articles/covid-19-is-dividing-the-american-worker-11598068859?mod=hp_lead_pos7',

##WALL STREET JOURNAL
for link in url:

    #Try 1
    #Get the published date -- this is where I have problems. 
    webpage = requests.get(link)
    soup = bs(webpage.text, "html.parser")
    date = soup.find("meta", {"name": "article.published"})
    print(date)



    #Try 2
    #Access date from the <time> tag instead
    for tag in soup.find_all('time', {"class": "timestamp article__timestamp flexbox__flex--1"}):
        date = tag.text
        print(date)





    #Get the author name -- this part works
    article = Article(link, language='en')
    article.download()
    article.parse()
    # print(article.html)

    author = article.authors
    date = article.publish_date
    author = author[0]

    day_month = str("Check Date")
    print(day_month + "," + "," + "," + str(author))

当我打印出汤时,我可以在输出中获取 Meta 标签,所以我知道它们在那里,但我似乎无法用任何一种方法访问它们。

这是我目前得到的输出: 没有 检查日期,,,Christopher Mims

有什么想法吗?

【问题讨论】:

标签: web-scraping beautifulsoup newspaper3k


【解决方案1】:

如果您不指定用户代理,网站将返回不同的页面(未找到 404 页面)。您可以指定任何有效的使用代理,例如

import requests
from bs4 import BeautifulSoup as bs


HEADERS = {
    'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:77.0) Gecko/20100101 Firefox/77.0',
}
url = ['https://www.wsj.com/articles/covid-19-is-dividing-the-american-worker-11598068859?mod=hp_lead_pos7']

## WALL STREET JOURNAL
for link in url:

    # Get the published date -- this is where I have problems.
    webpage = requests.get(link, headers=HEADERS)
    soup = bs(webpage.text, "html.parser")
    date = soup.find("meta", {"name": "article.published"})
    print(date['content'])

    # Access date from the <time> tag instead
    for tag in soup.find_all('time', {"class": "timestamp article__timestamp flexbox__flex--1"}):
        date = tag.text
        print(date.strip())
    

输出:

2020-08-22T04:01:00.000Z
Aug. 22, 2020 12:01 am ET

【讨论】:

    【解决方案2】:

    Newspaper 在查询效率方面存在一些问题,因为它在定位目标 HTML 中的某些数据元素时存在一些轻微的导航问题。我注意到您需要查看目标的 HTML 以确定可以使用 Newspaper

    中的函数/方法查询哪些项目

    华尔街日报上的meta-tags包含作者姓名、文章标题、文章摘要、文章发表数据和文章关键词,无需使用BeautifulSoup即可提取。

    from newspaper import Article
    from newspaper import Config
    
    user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0'
    
    config = Config()
    config.browser_user_agent = user_agent
    
    url = 'https://www.wsj.com/articles/covid-19-is-dividing-the-american-worker-11598068859?mod=hp_lead_pos7'
    article = Article(url, config=config)
    article.download()
    article.parse()
    article_meta_data = article.meta_data
    
    article_published_date = str({value for (key, value) in article_meta_data.items() if key == 'article.published'})
    print(article_published_date)
    
    article_author = sorted({value for (key, value) in article_meta_data.items()if key == 'author'})
    print(article_author)
    
    article_title = {value for (key, value) in article_meta_data.items() if key == 'article.headline'}
    print(article_title)
    
    article_summary = {value for (key, value) in article_meta_data.items() if key == 'article.summary'}
    print(article_summary)
    
    keywords = ''.join({value for (key, value) in article_meta_data.items() if key == 'news_keywords'})
    article_keywords = sorted(keywords.lower().split(','))
    print(article_keywords)
    

    我希望这个答案对你有所帮助。

    附: BeautifulSoupNewspaper中的一个依赖,所以可以这样调用:

    from newspaper.utils import BeautifulSoup
    

    【讨论】:

    • 哇!解决问题的好方法。谢谢你的帮助。这将加快我所做的一些工作。
    • @SethMcCombie 不客气。如果此答案解决了您的问题,请接受它作为您问题的解决方案。
    • @SethMcCombie 既然您使用报纸,您可能会发现我的使用文档很有用 -- github.com/johnbumgarner/newspaper3_usage_overview
    猜你喜欢
    • 2018-09-03
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2021-12-21
    • 1970-01-01
    • 2016-01-21
    • 2016-04-30
    • 1970-01-01
    相关资源
    最近更新 更多