cron作业抛出DeadlineExceededError答案

【问题标题】：cron job throwing DeadlineExceededErrorcron作业抛出DeadlineExceededError
【发布时间】：2017-08-14 15:34:47
【问题描述】：

我目前正在免费试用模式下从事谷歌云项目。我有 cron 作业从数据供应商获取数据并将其存储在数据存储中。几周前，我编写了获取数据的代码，一切正常，但突然之间，我开始收到错误“DeadlineExceededError：超出响应 HTTP 请求的总体截止日期”最后两天。我相信 cron 作业应该只在 60 分钟后超时，知道为什么我会收到错误吗？。

cron 任务

def run():
  try:
    config = cron.config
    actual_data_source = config['xxx']['xxxx']
    original_data_source = actual_data_source

    company_list = cron.rest_client.load(config, "companies", '')

    if not company_list:
        logging.info("Company list is empty")
        return "Ok"

    for row in company_list:
        company_repository.save(row,original_data_source, actual_data_source)

    return "OK"

存储库代码

  def save( dto, org_ds , act_dp):
  try:
    key = 'FIN/%s' % (dto['ticker'])
    company = CompanyInfo(id=key)
    company.stock_code = key
    company.ticker = dto['ticker']
    company.name = dto['name']
    company.original_data_source = org_ds
    company.actual_data_provider = act_dp
    company.put()
    return company
     except Exception:
    logging.exception("company_repository: error occurred saving the company 
    record ")
    raise

RestClient

  def load(config, resource, filter):
    try:
    username = config['xxxx']['xxxx']
    password = config['xxxx']['xxxx']
    headers = {"Authorization": "Basic %s" % base64.b64encode(username + ":" 
    + password)}

    if filter:
        from_date = filter['from']
        to_date = filter['to']
        ticker = filter['ticker']
        start_date = datetime.strptime(from_date, '%Y%m%d').strftime("%Y-%m-%d")
        end_date = datetime.strptime(to_date, '%Y%m%d').strftime("%Y-%m-%d")

    current_page = 1
    data = []

    while True:

      if (filter):
        url = config['xxxx']["endpoints"][resource] % (ticker, current_page, start_date, end_date)
      else:
        url = config['xxxx']["endpoints"][resource] % (current_page)

      response = urlfetch.fetch(
            url=url,
            deadline=60,
            method=urlfetch.GET,
            headers=headers,
            follow_redirects=False,

        )
      if response.status_code != 200:
            logging.error("xxxx GET received status code %d!" % (response.status_code))
            logging.error("error happend for url: %s with headers %s", url, headers)
            return 'Sorry, xxxx API request failed', 500

      db = json.loads(response.content)

      if not db['data']:
            break

      data.extend(db['data'])

      if db['total_pages'] == current_page:
            break

      current_page += 1

    return data
except Exception:
     logging.exception("Error occured with xxxx API request")
     raise

【问题讨论】：

假设您没有像@momus 建议的那样被阻止或速率限制，考虑分派一个任务来执行load 函数中while 循环的每次迭代的保存。这样您就不必等到load 完成后才能开始数据存储更新。您也可以考虑使用ndb.put_multi，而不是在每个实例上调用put()。
相关（是的，我知道这确实是一个不同的问题）：stackoverflow.com/questions/45594018/…
处理这些 cron 请求的服务使用哪种扩展方式和实例类型？

标签： google-app-engine google-cloud-datastore google-app-engine-python

【解决方案1】：

我猜这与此问题相同，但现在有更多代码： DeadlineExceededError: The overall deadline for responding to the HTTP request was exceeded

我修改了您的代码以在每次 urlfetch 后写入数据库。如果有更多页面，它会在延迟任务中重新启动自己，这应该在 10 分钟超时之前。

延迟任务中未捕获的异常会导致它重试，因此请注意这一点。

我不清楚actual_data_source 和original_data_source 是如何工作的，但我认为你应该能够修改那部分。

定时任务

def run(current_page=0):
  try:
    config = cron.config
    actual_data_source = config['xxx']['xxxx']
    original_data_source = actual_data_source

    data, more = cron.rest_client.load(config, "companies", '', current_page)

    for row in data:
          company_repository.save(row, original_data_source, actual_data_source)

    # fetch the rest
    if more:
        deferred.defer(run, current_page + 1)
  except Exception as e:
     logging.exception("run() experienced an error: %s" % e)

RestClient

  def load(config, resource, filter, current_page):
    try:
        username = config['xxxx']['xxxx']
        password = config['xxxx']['xxxx']
        headers = {"Authorization": "Basic %s" % base64.b64encode(username + ":" 
        + password)}

        if filter:
            from_date = filter['from']
            to_date = filter['to']
            ticker = filter['ticker']
            start_date = datetime.strptime(from_date, '%Y%m%d').strftime("%Y-%m-%d")
            end_date = datetime.strptime(to_date, '%Y%m%d').strftime("%Y-%m-%d")

            url = config['xxxx']["endpoints"][resource] % (ticker, current_page, start_date, end_date)
        else:
            url = config['xxxx']["endpoints"][resource] % (current_page)

        response = urlfetch.fetch(
                url=url,
                deadline=60,
                method=urlfetch.GET,
                headers=headers,
                follow_redirects=False,

        )
        if response.status_code != 200:
                logging.error("xxxx GET received status code %d!" % (response.status_code))
                logging.error("error happend for url: %s with headers %s", url, headers)
                return [], False

        db = json.loads(response.content)

        return db['data'], (db['total_pages'] != current_page)


    except Exception as e:
         logging.exception("Error occured with xxxx API request: %s" % e)
         return [], False

【讨论】：

【解决方案2】：

我更愿意将其写成评论，但我需要更多的声誉才能做到这一点。

当您直接运行实际数据获取而不是通过 cron 作业？
您是否尝试过测量从开始到结束的时间增量工作？
被检索的公司数量是否急剧增加？
您似乎在进行某种形式的股票报价汇总 - 是吗？可能是提供商已经开始屏蔽您了？

【讨论】：