从谷歌分析 API v4 下载批处理报告答案

【问题标题】：Downloading batch reports from google analytics API v4从谷歌分析 API v4 下载批处理报告
【发布时间】：2020-01-16 07:49:14
【问题描述】：

我正在尝试获取一份为期 3 个月的报告，为此，我需要发出多个请求并将结果附加到列表中，因为 API 仅返回每个请求的 100,000 行。 API 返回了一个名为nextPageToken 的变量，我需要将其传递到下一个查询中以获取报告的下一个100,000 行。我很难做到这一点。

这是我的代码：

def initialize_analyticsreporting():
    '''Initializes an Analytics Reporting API V4 service object.

    Returns:
      An authorized Analytics Reporting API V4 service object.
    '''
    credentials = ServiceAccountCredentials.from_json_keyfile_name(
        KEY_FILE_LOCATION, SCOPES)

    # Build the service object.
    analytics = build('analyticsreporting', 'v4', credentials=credentials)

    return analytics


list = [] 




def get_report(analytics, pageTokenVariable):
    return analytics.reports().batchGet(
        body={
            'reportRequests': [
                {
                    'viewId': VIEW_ID,
                    'pageSize': 100000,
                    'dateRanges': [{'startDate': '90daysAgo', 'endDate': 'yesterday'}],
                    'metrics': [{'expression': 'ga:adClicks'}, {'expression': 'ga:impressions'}, {'expression': 'ga:adCost'}, {'expression': 'ga:CTR'}, {'expression': 'ga:CPC'}, {'expression': 'ga:costPerTransaction'}, {'expression': 'ga:transactions'}, {'expression': 'ga:transactionsPerSession'}, {'expression': 'ga:pageviews'}, {'expression': 'ga:timeOnPage'}],
                    "pageToken": pageTokenVariable,
                    'dimensions': [{'name': 'ga:adMatchedQuery'}, {'name': 'ga:campaign'}, {'name': 'ga:adGroup'}, {'name': 'ga:adwordsCustomerID'}, {'name': 'ga:date'}],
                    'orderBys': [{'fieldName': 'ga:impressions', 'sortOrder': 'DESCENDING'}],
                    'dimensionFilterClauses': [{

                        'filters': [{

                            'dimension_name': 'ga:adwordsCustomerID',
                            'operator': 'EXACT',
                            'expressions': 'abc',
                            'not': 'True'
                        }]
                    }],
                    'dimensionFilterClauses': [{

                        'filters': [{

                            'dimension_name': 'ga:adMatchedQuery',
                            'operator': 'EXACT',
                            'expressions': '(not set)',
                            'not': 'True'
                        }]
                    }]
                }]
        }
    ).execute()


analytics = initialize_analyticsreporting()
response = get_report(analytics, "0")

for report in response.get('reports', []):
    pagetoken = report.get('nextPageToken', None)
    print(pagetoken)
    #------printing the pagetoken here returns `100,000` which is expected

    columnHeader = report.get('columnHeader', {})
    dimensionHeaders = columnHeader.get('dimensions', [])
    metricHeaders = columnHeader.get(
        'metricHeader', {}).get('metricHeaderEntries', [])
    rows = report.get('data', {}).get('rows', [])

    for row in rows:
        # create dict for each row
        dict = {}
        dimensions = row.get('dimensions', [])
        dateRangeValues = row.get('metrics', [])

        # fill dict with dimension header (key) and dimension value (value)
        for header, dimension in zip(dimensionHeaders, dimensions):
            dict[header] = dimension

        # fill dict with metric header (key) and metric value (value)
        for i, values in enumerate(dateRangeValues):
            for metric, value in zip(metricHeaders, values.get('values')):
                # set int as int, float a float
                if ',' in value or ',' in value:
                    dict[metric.get('name')] = float(value)
                else:
                    dict[metric.get('name')] = float(value)
        list.append(dict)
      # Append that data to a list as a dictionary

# pagination function

    while pagetoken:  # This says while there is info in the nextPageToken get the data, process it and add to the list

        response = get_report(analytics, pagetoken)
        pagetoken = response['reports'][0]['nextPageToken']
        print(pagetoken)
        #------printing the pagetoken here returns `200,000` as is expected but the data being pulled is the same as for the first batch and so on. While in the loop the pagetoken is being incremented but it does not retrieve new data
        for row in rows:
                # create dict for each row
            dict = {}
            dimensions = row.get('dimensions', [])
            dateRangeValues = row.get('metrics', [])

            # fill dict with dimension header (key) and dimension value (value)
            for header, dimension in zip(dimensionHeaders, dimensions):
                dict[header] = dimension

            # fill dict with metric header (key) and metric value (value)
            for i, values in enumerate(dateRangeValues):
                for metric, value in zip(metricHeaders, values.get('values')):
                    # set int as int, float a float
                    if ',' in value or ',' in value:
                        dict[metric.get('name')] = float(value)
                    else:
                        dict[metric.get('name')] = float(value)
            list.append(dict)

df = pd.DataFrame(list)
print(df)  # Append that data to a list as a dictionary
df.to_csv('full_dataset.csv', encoding="utf-8", index=False)

我试图传递 pagetoken 的错误在哪里？

Here is the documentation for the pageToken from Google.

【问题讨论】：

看看这篇关于pagination的文章。您需要查询report 以获得nextPageToken，然后将其作为pageToken 包含在后续请求中。无论您请求多少，API 每次请求最多返回 100,000 行。

标签： python google-analytics-api google-reporting-api

【解决方案1】：

所以您正在更新 pagetoken = response['reports'][0]['nextPageToken'] 中的页面令牌，但您不应该在 while 循环中使用新数据更新 rows 吗？

类似的东西。

    while pagetoken:
        response = get_report(analytics, pagetoken)
        pagetoken = response['reports'][0].get('nextPageToken')
        for report in reponse.get('reports', []):
            rows = report.get('data', {}).get('rows', [])
            for row in rows:

【讨论】：

我不知道，这就是我寻找答案的原因，文档没有正确解释。它只声明更新pageToken。
添加了代码示例。无法对其进行测试，但对我来说 rows 需要更新，否则您将一遍又一遍地处理相同的数据。
现在我在response = get_report(analytics, pagetoken) 线上收到KeyError: 'nextPageToken'。现在数据似乎是正确的，当nextPageToken 不再存在时它不会停止。但为什么？我确实有while pagetoken 条件。
谢谢，成功了！所以据我了解：问题是即使我有权利nextPageToken 我曾经查询过同一份报告？而使用get 和pagetoken 的区别在于如果找不到它就不会抛出错误？
正确。 get 返回 None 而不是引发异常。