【发布时间】:2021-03-05 04:27:00
【问题描述】:
我正在做一项需要抓取的任务。我有一个带有 id 的数据集,对于每个 id,我需要抓取一些新信息。这个数据集有大约 400 万行。这是我的代码:
import pandas as pd
import numpy as np
import semanticscholar as sch
import time
# dataset with ids
df = pd.read_csv('paperIds-1975-2005-2015-2.tsv', sep='\t', names=["id"])
# columns that will be produced
cols = ['id', 'abstract', 'arxivId', 'authors',
'citationVelocity', 'citations',
'corpusId', 'doi', 'fieldsOfStudy',
'influentialCitationCount', 'is_open_access',
'is_publisher_licensed', 'paperId',
'references', 'title', 'topics',
'url', 'venue', 'year']
# a new dataframe that we will append the scraped results
new_df = pd.DataFrame(columns=cols)
# a counter so we know when every 100000 papers are scraped
c = 0
i = 0
while i < df.shape[0]:
try:
paper = sch.paper(df.id[i], timeout=10) # scrape the paper
new_df = new_df.append([df.id[i]]+paper, ignore_index=True) # append to the new dataframe
new_df.to_csv('abstracts_impact.csv', index=False) # save it
if i % 100000 == 0: # to check how much we did
print(c)
c += 1
i += 1
except:
time.sleep(60)
问题是数据集非常大,这种方法不起作用。我让它工作了 2 天,它刮掉了大约 100000 个 id,然后突然冻结了,所有保存的数据都只是空行。 我在想最好的解决方案是并行化和批处理。我以前从未这样做过,而且我不熟悉这些概念。任何帮助,将不胜感激。谢谢!
【问题讨论】:
标签: python pandas web-scraping batch-processing