并行抓取数据+批处理答案

【问题标题】：Scraping data parallel + batch processing并行抓取数据+批处理
【发布时间】：2021-03-05 04:27:00
【问题描述】：

我正在做一项需要抓取的任务。我有一个带有 id 的数据集，对于每个 id，我需要抓取一些新信息。这个数据集有大约 400 万行。这是我的代码：

import pandas as pd
import numpy as np
import semanticscholar as sch
import time

# dataset with ids
df = pd.read_csv('paperIds-1975-2005-2015-2.tsv', sep='\t', names=["id"])

# columns that will be produced
cols = ['id', 'abstract', 'arxivId', 'authors', 
        'citationVelocity', 'citations', 
        'corpusId', 'doi', 'fieldsOfStudy', 
        'influentialCitationCount', 'is_open_access', 
        'is_publisher_licensed', 'paperId', 
        'references', 'title', 'topics', 
        'url', 'venue', 'year']

# a new dataframe that we will append the scraped results
new_df = pd.DataFrame(columns=cols)

# a counter so we know when every 100000 papers are scraped
c = 0      
i = 0
while i < df.shape[0]:
    try:
        paper = sch.paper(df.id[i], timeout=10) # scrape the paper
        new_df = new_df.append([df.id[i]]+paper, ignore_index=True) # append to the new dataframe
        new_df.to_csv('abstracts_impact.csv', index=False) # save it 
        if i % 100000 == 0: # to check how much we did
            print(c)
            c += 1
        i += 1
    except:
        time.sleep(60)

问题是数据集非常大，这种方法不起作用。我让它工作了 2 天，它刮掉了大约 100000 个 id，然后突然冻结了，所有保存的数据都只是空行。我在想最好的解决方案是并行化和批处理。我以前从未这样做过，而且我不熟悉这些概念。任何帮助，将不胜感激。谢谢！

【问题讨论】：

标签： python pandas web-scraping batch-processing

【解决方案1】：

好的，首先没有数据 :( 所以我只是从semanticscholar 文档中获取示例 ID。查看您的代码，我可以看到很多错误：

不要总是坚持pd.DataFrame 工作！数据框很棒，但也很慢！您只需从 'paperIds-1975-2005-2015-2.tsv' 获取 ID，即可使用 file.readline() 读取文件，也可以将数据保存到列表中：

data = pd.read_csv('paperIds-1975-2005-2015-2.tsv', sep='\t', names=["id"]).id.values

从代码流中，我的理解是您想将抓取的数据保存到 single CSV 文件中，对吗？那么，为什么要一次又一次地追加数据并保存文件呢？这会使代码慢 100000 秒！
我真的不明白你添加的time.sleep(60)的目的。如果出现错误，您应该打印并继续 - 为什么要等待？
要检查进度，您可以使用tqdm library，它会为您的代码显示一个漂亮的进度条！

考虑到这些，我将您的代码修改如下：

import pandas as pd
import semanticscholar as sch
from tqdm import tqdm as TQ # for progree-bar

data = ['10.1093/mind/lix.236.433', '10.1093/mind/lix.236.433'] # using list or np.ndarray looks more logical!
print(data)
>> ['10.1093/mind/lix.236.433', '10.1093/mind/lix.236.433']

完成此操作后，您现在可以去抓取数据。好的，在此之前pandas DataFrame 基本上是一本具有高级功能的字典。因此，出于我们的目的，我们将首先将所有信息添加到字典中，然后创建数据框。我个人更喜欢这个过程 - 如果需要进行任何更改，可以让我有更多的控制权。

cols = ['id', 'abstract', 'arxivId', 'authors', 'citationVelocity', 'citations',
    'corpusId', 'doi', 'fieldsOfStudy', 'influentialCitationCount', 'is_open_access',
    'is_publisher_licensed', 'paperId', 'references', 'title', 'topics', 'url', 'venue', 'year']

outputData = dict((k, []) for k in cols)

print(outputData)
{'id': [],
 'abstract': [],
 'arxivId': [],
 'authors': [],
 'citationVelocity': [],
 'citations': [],
 'corpusId': [],
 'doi': [],
 'fieldsOfStudy': [],
 'influentialCitationCount': [],
 'is_open_access': [],
 'is_publisher_licensed': [],
 'paperId': [],
 'references': [],
 'title': [],
 'topics': [],
 'url': [],
 'venue': [],
 'year': []}

现在您可以简单地获取数据并将其保存到您的数据框中，如下所示：

for _paperID in TQ(data):
    paper = sch.paper(_paperID, timeout = 10) # scrape the paper
    for key in cols:
        try:
            outputData[key].append(paper.get(key))
        except KeyError:
            outputData[key].append(None) # if there is no data, append none
            print(f"{key} not Found for {_paperID}")

pd.DataFrame(outputData).to_csv('output_file_name.csv', index = False)

这是我得到的输出：

【讨论】：

感谢您的回答！现在回答您的一些问题：如果出现错误，我想等待，因为我的互联网连接不好，有时会断开连接，我每次都在保存数据，因为我不知道代码什么时候会停止，我至少想保存一些东西。我想为这项任务使用多线程，以便以更快的方式完成工作。
哦，好的，对于互联网连接问题，您可以跟踪未下载的论文 - 然后您可以稍后重试。为了保存数据 - 只要您的程序正在运行，您的变量就会存储在内存中。
另一件事，由于 GIL，并行处理在 python 中产生了问题，而在基于 Windows 的系统上，我之前在运行程序时遇到了问题。所以我会建议使用替代方案。例如，为每个 x 记录创建一个新文件并保存。