RuntimeWarning：从未等待协程“NewsExtraction.get_article_data_elements”答案

【问题标题】：RuntimeWarning: coroutine 'NewsExtraction.get_article_data_elements' was never awaitedRuntimeWarning：从未等待协程“NewsExtraction.get_article_data_elements”
【发布时间】：2022-01-31 08:54:53
【问题描述】：

我一直拒绝在我的代码中使用asyncio，但使用它可能有助于解决我遇到的一些性能问题。

这是我的场景：

最终用户提供要抓取的新闻网站列表
每个元素都传递给Article Class
有效的文章被传递给Extraction Class
Extraction Class 将数据传递给NewsExtraction Class

90% 的时间这个流程是完美的，但有时 NewsExtraction 类中的 12 个函数之一无法提取 HTML 提供中存在的数据。看来我的代码是“踩在自己身上”，这导致数据元素无法被解析。当我重新运行代码时，所有元素都被正确解析。

NewsExtraction类有这个函数get_article_data_elements，是从Extraction类调用的。

函数get_article_data_elements调用这些项目：

published_date = self._extract_article_published_date()
modified_date = self._extract_article_modified_date()
title = self._extract_article_title()
description = self._extract_article_description()
keywords = self._extract_article_key_words()
tags = self._extract_article_tags()
authors = self._extract_article_author()
top_image = self._extract_top_image()
language = self._extract_article_language()
categories = self._extract_article_category()
text = self._extract_textual_content()
url = self._extract_article_url()

这些数据元素中的每一个都用于填充 Python 字典，该字典最终会传回给最终用户。

我一直在尝试将asyncio 代码添加到NewsExtraction Class，但我不断收到此错误消息：

RuntimeWarning: coroutine 'NewsExtraction.get_article_data_elements' was never awaited

过去 3 天我一直在试图解决这个问题。关于这个错误RuntimeWarning: coroutine never awaited，我在 Stack Overflow 上查看了几十个问题。我也看过很多关于使用asyncio 的文章，但我不知道如何将asyncio 与我的NewsExtraction Class 一起使用，NewsExtraction Class 是从Extraction Class 调用的。

谁能给我一些建议来解决我的问题？

class NewsExtraction(object):
    """
    This class is used to extract common data elements from a news article
    on xyz
    """

    def __init__(self, url, soup):
        self._url = url
        self._raw_soup = soup


    truncated...


    async def _extract_article_published_date(self):
      """
      This function is designed to extract the publish date for the article being parsed.

      :return: date article was originally published
      :rtype: string
      """
      json_date_published = JSONExtraction(self._url, self._raw_soup).extract_article_published_date()
      if json_date_published is not None:
         if len(json_date_published) != 0:
            return json_date_published
         else:
             return None
      elif json_date_published is None:
           if self._raw_soup.find(name='div', attrs={'class': regex.compile("--publishDate")}):
              date_published = self._raw_soup.find(name='div', attrs={'class': regex.compile("--publishDate")})
              if len(date_published) != 0:
                 return date_published.text
              else:
                logger.info('The HTML tag to extract the publish date for the following article was not found.')
                logger.info(f'Article URL -- {self._url}')
                return None


    truncated...


    async def get_article_data_elements(self):
      """
      This function is designed to extract all the common data elements from a
      news article on xyz.

      :return: dictionary of data elements related to the article
      :rtype: dict
        """
      article_data_elements = {}
      
      # I have tried this:
      published_date = self._extract_article_published_date().__await__()

      # and this
      published_date = self.task(self._extract_article_published_date())
      await published_date

      truncated...

我也试过用：

if __name__ == "__main__":
    asyncio.run(NewsExtraction.get_article_data_elements())
    # asyncio.run(self.get_article_data_elements())

在我的新闻提取代码中使用asyncio 真的是一头雾水。

如果这个问题有问题，我很乐意将其删除并继续阅读有关如何正确使用asyncio 的信息。

谁能给我一些解决我的问题的建议？

提前感谢您提供有关使用asyncio的任何指导

【问题讨论】：

标签： python-3.x python-asyncio

【解决方案1】：

您将 _extract_article_published_date 和 get_article_data_elements 定义为协程，并且此协程必须在您的代码中使用 await-ed 才能以异步方式获得它们的执行结果。

您可以创建一个NewsExtraction 类型的实例并在前面使用关键字await 调用此方法，此await 将执行传递给循环中的其他任务，直到他等待的任务完成执行。请注意，此任务执行不涉及线程或进程，只有在不使用 cpu-time（await-ing I/O 操作或休眠）时才通过执行。

if __name__ == '__main__':
    extractor = NewsExtraction(...)
    # this creates the event loop and runs the coroutine
    asyncio.run(extractor.get_article_data_elements())

在您的_extract_article_published_date 中，您还必须在await 通过网络执行请求的协同程序，如果您使用某些库进行抓取，请确保在幕后使用async/await 以获得真实的使用asyncio时的性能。

async def get_article_data_elements(self):
      article_data_elements = {}
      
     # note here that the instance is self
      published_date = await self._extract_article_published_date()

      truncated...

您必须深入了解asyncio documentation 才能更好地了解 Python 3.7+ 的这些功能。

【讨论】：

感谢您的回答。一旦找出导致此新错误的原因，我很可能会立即将其标记为正确：sys:1: RuntimeWarning: coroutine 'NewsExtraction._extract_article_title' was never awaited RuntimeWarning: Enable tracemalloc to get the object allocation traceback
每次你想调用一个协程（一个async def）你必须在它前面使用关键字await，这会安排它在循环中执行并且等待直到协程返回。正是我在答案中描述的。如果您想进行实验，请从 the documentation 复制“Hello, World”示例，删除 await 关键字，它会发出与您收到的相同的警告。
请注意，这不是异常，只是一个警告，因为协程没有被安排在循环中，可能不是您期望的。
每个函数都以async def开头，我这样称呼它们`title = await self._extract_article_title()`
感谢您的帮助。我现在明白我在哪里犯了错误。我发现我需要从程序中的另一个类中调用 asyncio.run 函数。我确实注意到添加异步并没有提高我的代码的性能。我现在正在寻找可以提高性能的东西。