从谷歌新闻中抓取新闻答案

【问题标题】：Scrape news from google news从谷歌新闻中抓取新闻
【发布时间】：2015-08-03 02:10:59
【问题描述】：

这似乎与从 news.google.com 抓取内容有关的其他问题重复，但这并不是因为它们只请求整个 html 代码，而不是文章的 url 链接。

我正在尝试创建两个函数，可以从 news.google.com 抓取新闻或根据用户输入的内容获取新闻，即：

>>> news top
> <5 url of top stories in news.google.com>

或

>>> news london
> <5 london related news url from news.google.com>

这是我正在进行的代码工作（因为我对抓取/请求不是很熟悉，所以我不知道如何进行）：

def get_news(user_define_input):
    try:
        response = requests.get("https://www.google.com/search?hl=en&gl=us&tbm=nws&authuser=0&q=test&oq="+format(user_define_input[1]))
    except:
        print ("Error while retrieving data!")
        return
    tree = html.fromstring(response.text)
    news = tree.xpath("//div[@class='l _HId']/text()")
    print (news)

我确实意识到/text() 没有获得网址，但我不知道如何获得，因此提出了问题。

如果需要，您可以添加它以使其看起来更好：

news = "<anything>".join(news)

为了澄清，user_define_input[0] 将是来自用户输入的“新闻”。而user_define_input[1] 将是搜索，即：“伦敦”。所以所有的结果都应该与伦敦有关。如果您愿意花时间让我的其他功能从 news.google.com 获取所有头条新闻，非常感谢！ :)（应该是类似的代码，所以我不会在这里发布任何与此相关的内容）

帮助后的代码（仍然无法正常工作）：

def get_news(user_define_input):
    try:
        response = requests.get("https://www.google.com/search?hl=en&gl=us&tbm=nws&authuser=0&q=test&oq="+format(user_define_input[1]))
        except:
            print ("Error while retrieving data!")
                return
    tree = html.fromstring(response.text)
    url_to_news = tree.xpath(".//div[@class='esc-lead-article-title-wrapper']/h2[@class='esc-lead-article-title']/a/@href")
    for url in url_to_news:
        print(url)
    summary_of_the_new = tree.xpath(".//div[@class='esc-lead-snippet-wrapper']/text()")
    title_of_the_new = tree.xpath(".//span[@class='titletext']/text()")
    print (summary_of_the_new)
    print (title_of_the_new)

【问题讨论】：

你能修复它吗？是否可以发布最终解决方案/您的工作？
@user2543622 不幸的是，我没有最终的解决方案 :( 也许改天 :/ 因为我目前正在做其他事情。

标签： python html function python-3.x python-requests

【解决方案1】：

我知道您想要的是获得当用户输入query 时出现的所有新闻中的url，对吧？

要做到这一点，您将需要这个 xpath 表达式：

url_to_news = tree.xpath(".//div[@class='esc-lead-article-title-wrapper']/h2[@class='esc-lead-article-title']/a/@href")

它将返回一个带有新闻网址的列表。

因为它是一个列表，所以要遍历 url，你只需要一个 for 循环：

for url in url_to_news:
    print(url)

附加组件：

要获得新闻摘要，您需要以下内容：

summary_of_the_new = tree.xpath(".//div[@class='esc-lead-snippet-wrapper']/text()")

最后，新闻的标题是：

title_of_the_new = tree.xpath(".//span[@class='titletext']/text()")

之后，您可以将所有这些信息映射在一起。如果您需要进一步的帮助，请评论此答案。我根据我的理解回答了这个问题。

【讨论】：

所以我输入了：“news test”，所以会搜索到 test。使用您的代码，我在打印 summary_of_the_new: [] 时得到了这个，title_of_the_new 也是如此。我的代码将在上面的编辑中。

【解决方案2】：

检查我的实现@http://mpand.github.io/gnp/

将故事和 URL 作为 JSON 对象返回

【讨论】：

请回答stackoverflow.com/questions/36846782/…