【发布时间】:2019-06-07 07:01:29
【问题描述】:
我是 python 中的 ask 和 trio 的新手,我得到了一个示例代码。让我解释 我有一个 URL 列表,每个 URL 都是新闻 URL,每个 URL 都有子 URL。 第一个 url 请求并获取所有其他 href 并添加到列表中。 然后获取该列表中所有hrefs的文章。 问题是某些时候文章变得空洞。
尝试了其工作时间的单个网址的示例代码
import asks
import trio
from goose3 import Goose
import logging as log
from goose3.configuration import ArticleContextPattern
from pprint import pprint
import json
import time
asks.init('trio')
async def extractor(path, htmls, paths, session):
try:
r = await session.get(path, timeout=2)
out = r.content
htmls.append(out)
paths.append(path)
except Exception as e:
out = str(e)
htmls.append(out)
paths.append(path)
async def main(path_list, session):
htmls = []
paths = []
async with trio.open_nursery() as n:
for path in path_list:
n.start_soon(extractor, path, htmls, paths, session)
return htmls, paths
async def run(urls, conns=50):
s = asks.Session(connections=conns)
g = Goose()
htmls, paths = await main(urls, s)
print(htmls," ",paths)
cleaned = []
for html, path in zip(htmls, paths):
dic = {}
dic['url'] = path
if html is not None:
try:
#g.config.known_context_pattern = ArticleContextPattern(attr='class', value='the-post')
article = g.extract(raw_html=html)
author=article.authors
dic['goose_text'] = article.cleaned_text
#print(article.cleaned_text)
#dic['goose_date'] = article.publish_datetime
dic['goose_title'] = article.title
if author:
dic['authors']=author[0]
else:
dic['authors'] =''
except Exception as e:
raise
print(e)
log.info('goose found no text using html')
dic['goose_html'] = html
dic['goose_text'] = ''
dic['goose_date'] = None
dic['goose_title'] = None
dic['authors'] =''
cleaned.append(dic)
return cleaned
async def real_main():
sss= '[{"crawl_delay_sec": 0, "name": "mining","goose_text":"","article_date":"","title":"", "story_url": "http://www.mining.com/canalaska-start-drilling-west-mcarthur-uranium-project","url": "http://www.mining.com/tag/latin-america/page/1/"},{"crawl_delay_sec": 0, "name": "mining", "story_url": "http://www.mining.com/web/tesla-fires-sound-alarms-safety-electric-car-batteries", "url": "http://www.mining.com/tag/latin-america/page/1/"}]'
obj = json.loads(sss)
pprint(obj)
articles=[]
for l in obj:
articles.append(await run([l['story_url']]))
#await trio.sleep(3)
pprint(articles)
if __name__ == "__main__":
trio.run(real_main)
获取文章数据不遗漏
【问题讨论】:
-
请修正这个例子。您将单个 URL 传递给 run(),但 run() 需要一个 URL 列表。
-
另外,请将 trio.run 移至顶层并异步化
run。原因是当前版本的asks要求在 Trio 的运行时调用会话。 -
谢谢你的回复,我的问题是有一个href列表,每个href都会有html的文章,这是我的期望,但有时html是[''] ,你能告诉我需要一个三重奏的回调,这样我就可以确定 html 会得到值。
-
对不起,你能告诉我 trio.to 中的更改吗?要使其成为顶级和异步,请更改代码。
-
好的,内联了。现在请修复我的第一条评论的代码,以便它真正起作用 - 我们无法弄清楚为什么代码有时会失败而根本不起作用。
标签: python python-trio