【问题标题】:How to get all results by iterating through has_more如何通过遍历 has_more 获得所有结果
【发布时间】:2019-08-07 17:01:24
【问题描述】:

我正在使用 stackexchange api 从 2000 年到 2019 年 8 月获取 cmets。看起来我只遍历 2 页。我不确定我的错误是在api参数上还是在迭代过程中。

这是我的代码的样子。

import requests
from datetime import datetime
import json
import csv
import os
import pprint

pp = pprint.PrettyPrinter(indent=4)

def write_to_json(data):
    curr_dir = os.getcwd()
    output_file_path = os.path.join(curr_dir, 'so_comment1.json')

    with open(output_file_path, 'w') as outfile:
        json.dump(data, outfile)

def get_comments(fromdate, todate):


    so_url = 'https://api.stackexchange.com/2.2/comments?site=stackoverflow&filter=!1zSn*g7xPU9g6(VDTS7_c&fromdate=' \
        +str(fromdate)+'&todate='+str(todate)+'&pagesize=100'
    headers = {"Content-type": "application/json"}

    resp = requests.get(so_url, headers = headers)

    if resp.status_code != 200:
        print('error: ' + str(resp.status_code))
    else:
        print('Success')

    data = resp.json()
    data1 = resp.json()
    page_num = 1
    if data1['has_more']:
        page_num += 1
        so_url = 'https://api.stackexchange.com/2.2/comments?site=stackoverflow&filter=!1zSn*g7xPU9g6(VDTS7_c&fromdate=' \
            +str(fromdate)+'&todate='+str(todate)+'&pagesize=100&page='+str(page_num)

        resp = requests.get(so_url, headers = headers)

        if resp.status_code != 200:
            print('error: ' + str(resp.status_code))
        else:
            print('Success')

        data1 = resp.json()

        for item in data1['items']:
            data['items'].append(item)

    write_to_json(data)       

def filter_comment_body():
    with open('so_comment1.json') as json_file_so:
        comments = json.load(json_file_so)

        with open('comments1.csv', 'w', encoding='utf-8') as comments_file:
            comments_writer = csv.writer(comments_file, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)

            for item in comments['items']:
                comments_writer.writerow([item['body']])


if __name__ == '__main__':
    # once comments are written to json file(s) stop calling to get_comments
    fromdate = datetime.strptime('Jan 1 2000', '%b %d %Y')
    todate = datetime.strptime('Aug 1 2019', '%b %d %Y')
    # print(datetime.timestamp(fromdate), ' ', datetime.timestamp(todate))
    get_comments(fromdate, todate)
    filter_comment_body()

考虑到日期范围,我假设我将获得 1000 个 cmets。 但是我只收到了200 cmets(2页)

【问题讨论】:

  • 您应该添加一些代码以在每次迭代时将resp 的副本保存到文件中。也适用于data1。这可能会帮助您了解问题出在哪里。

标签: python-3.x loops stackexchange-api


【解决方案1】:

您请求了两页 - 您收到了两页。

  1. 您将获得第一页
  2. ...然后设置page_num = 1
  3. 然后你检查data1['has_more']
    1. 如果是这种情况,则增加page_num,下载第二页并从get_comments 返回。
    2. 如果不是,代码只是返回

那是你打算做的吗?我认为您的意思是继续下载新页面,直到 data1['has_more'] 变为 False

所以,算法可能是这样的:

create an empty list where you want to hold the data
set page_num=1

begin_loop:
    download page number page_num
    if data['has_more'] is False:
        goto return_from_function

    append the elements from `data` to the list you created earlier
    increment page_num
    goto begin_loop

return_from_function:
    process the data in the list created on step 1 and return

【讨论】:

  • 愚蠢的我。我放置了一个条件语句而不是循环。谢谢指出。
猜你喜欢
  • 2015-02-09
  • 1970-01-01
  • 1970-01-01
  • 2014-05-31
  • 1970-01-01
  • 2020-03-31
  • 2018-06-27
  • 1970-01-01
  • 2021-12-22
相关资源
最近更新 更多