BigQuery Javascript UDF 因“超出资源”而失败答案

【问题标题】：BigQuery Javascript UDF fails with "Resources Exceeded"BigQuery Javascript UDF 因“超出资源”而失败
【发布时间】：2016-04-04 13:31:28
【问题描述】：

这个问题可能是另一个例子

BigQuery UDF memory exceeded error on multiple rows but works fine on single row

但有人建议我以问题而不是答案的形式发布。

我正在使用 javascript 将日志文件解析为表格。 javascript解析函数比较简单。它适用于 1M 行，但在 3M 行时失败。日志文件可能比 3M 大很多，所以失败是个问题。

函数如下。

function parseLogRow(row, emit) {

    r =  (row.logrow ? row.logrow : "") + (row.l2 ? " " + row.l2 : "") + (row.l3 ? " " + row.l3 : "")
    ts = null
    category = null
    user = null
    message = null
    db = null
    seconds = null
    found = false
    if (r) {
        m = r.match(/^(\d\d\d\d-\d\d-\d\d \d\d:\d\d:\d\d\.\d\d\d (\+|\-)\d\d\d\d) \[([^|]*)\|([^|]*)\|([^\]]*)\] ::( \(([\d\.]+)s\))? (.*)/ )
        if( m){
          ts = new Date(m[1])*1
          category = m[3] || null
          user = m[4] || null
          db = m[5] || null
          seconds = m[7] || null
          message = m[8] || null
          found = true
        }
        else {
          message = r
          found = false
        }
     }

    emit({
      ts:  ts,
      category: category,
      user: user,
      db: db,
      seconds: seconds*1.0,
      message: message,
      found: found
      });
  }


  bigquery.defineFunction(
    'parseLogRow',                           // Name of the function exported to SQL
    ['logrow',"l2","l3"],                    // Names of input columns
    [
      {'name': 'ts', 'type': 'float'},  // Output schema
      {'name': 'category', 'type': 'string'},
      {'name': 'user', 'type': 'string'},
      {'name': 'db', 'type': 'string'},
      {'name': 'seconds', 'type': 'float'},
      {'name': 'message', 'type': 'string'},
      {'name': 'found', 'type': 'boolean'},
    ],
    parseLogRow                          // Reference to JavaScript UDF
  );

我用这个查询引用函数：

SELECT
    ROW_NUMBER() OVER() as row_num,
    ts,category,user,
    db,seconds,message,found,
FROM parseLogRow((SELECT * FROM[#{dataset}.today]
      LIMIT 1000000
    ))

“今天”表中的一些示例数据如下所示（作为 CSV）：

logrow,l2,l3
# Logfile created on 2015-12-29 00:00:09 -0800 by logger.rb/v1.2.7,,
2015-12-29 00:00:09.262 -0800 [INFO|7aaa0|] :: Running scheduled job: confirm running gulp process,,
2015-12-29 00:00:09.277 -0800 [DEBUG|7aaa0|] :: Restarted gulp process,,
2015-12-29 00:00:09.278 -0800 [INFO|7aaa0|] :: Completed scheduled job: confirm running gulp process,,
2015-12-29 00:00:14.343 -0800 [DEBUG|7aaa2|scheduler] :: Polling for pending tasks (master: true),,
2015-12-29 00:00:19.396 -0800 [INFO|7aaa4|] :: Running scheduled job: confirm running gulp process,,
2015-12-29 00:00:19.409 -0800 [DEBUG|7aaa4|] :: Restarted gulp process,,
2015-12-29 00:00:19.410 -0800 [INFO|7aaa4|] :: Completed scheduled job: confirm running gulp process,,
2015-12-29 00:00:29.487 -0800 [INFO|7aaa6|] :: Running scheduled job: confirm running gulp process,,
2015-12-29 00:00:29.500 -0800 [DEBUG|7aaa6|] :: Restarted gulp process,,
2015-12-29 00:00:29.500 -0800 [INFO|7aaa6|] :: Completed scheduled job: confirm running gulp process,,
2015-12-29 00:00:39.597 -0800 [INFO|7aaa8|] :: Running scheduled job: confirm running gulp process,,
2015-12-29 00:00:39.610 -0800 [DEBUG|7aaa8|] :: Restarted gulp process,,
2015-12-29 00:00:39.611 -0800 [INFO|7aaa8|] :: Completed scheduled job: confirm running gulp process,,
2015-12-29 00:00:44.659 -0800 [DEBUG|7aaaa|scheduler] :: Polling for pending tasks (master: true),,
2015-12-29 00:00:49.687 -0800 [INFO|7aaac|] :: Running scheduled job: confirm running gulp process,,
2015-12-29 00:00:49.689 -0800 [DEBUG|7aaac|] :: Restarted gulp process,,
2015-12-29 00:00:49.689 -0800 [INFO|7aaac|] :: Completed scheduled job: confirm running gulp process,,
2015-12-29 00:00:59.869 -0800 [INFO|7aaae|] :: Running scheduled job: confirm running gulp process,,
2015-12-29 00:00:59.871 -0800 [DEBUG|7aaae|] :: Restarted gulp process,,

这有点小技巧，因为我将日志作为 3 列表（实际上只有一列）导入为 CSV，并将分隔符设置为制表符（我们通常没有任何制表符在日志文件中），使用查询将其转换为我真正想要的表。

我喜欢这种模式，因为它解析速度快并且是分布式的（当它工作时）。

失败的作业是：bigquery-looker:bquijob_260be029_153dd96cfdb。如果您需要可复制的案例，请与我联系。

任何帮助或建议将不胜感激。

【问题讨论】：

标签： google-bigquery udf

【解决方案1】：

第 1 点
我没有看到 UDF 的问题 - 即使在 10M 行上它也对我有用
我认为问题在于使用 ROW_NUMBER() OVER() - 删除它，它应该可以工作！

SELECT
  ts,category,user,
  db,seconds,message,found,
FROM parseLogRow((SELECT * FROM[#{dataset}.today]
))

第 2 点
从性能的角度来看，下面应该运行得更快（我认为），一般来说，我建议在“普通”BQ 也能正常工作的情况下避免使用 UDF

SELECT 
  REGEXP_EXTRACT(logrow, r'^(\d\d\d\d-\d\d-\d\d \d\d:\d\d:\d\d\.\d\d\d (?:\+|\-)\d\d\d\d) \[[^|]*\|[^|]*\|[^\]]*\] :: .*') AS ts,
  REGEXP_EXTRACT(logrow, r'^(?:\d\d\d\d-\d\d-\d\d \d\d:\d\d:\d\d\.\d\d\d (?:\+|\-)\d\d\d\d) \[([^|]*)\|(?:[^|]*)\|(?:[^\]]*)\] :: (?:.*)') AS category,
  REGEXP_EXTRACT(logrow, r'^(?:\d\d\d\d-\d\d-\d\d \d\d:\d\d:\d\d\.\d\d\d (?:\+|\-)\d\d\d\d) \[(?:[^|]*)\|([^|]*)\|(?:[^\]]*)\] :: (?:.*)') AS user,
  REGEXP_EXTRACT(logrow, r'^(?:\d\d\d\d-\d\d-\d\d \d\d:\d\d:\d\d\.\d\d\d (?:\+|\-)\d\d\d\d) \[(?:[^|]*)\|(?:[^|]*)\|([^\]]*)\] :: (?:.*)') AS db,
  REGEXP_EXTRACT(logrow, r'^(?:\d\d\d\d-\d\d-\d\d \d\d:\d\d:\d\d\.\d\d\d (?:\+|\-)\d\d\d\d) \[(?:[^|]*)\|(?:[^|]*)\|(?:[^\]]*)\] :: (.*)') AS message,
  REGEXP_MATCH(logrow, r'^((?:\d\d\d\d-\d\d-\d\d \d\d:\d\d:\d\d\.\d\d\d (?:\+|\-)\d\d\d\d) \[(?:[^|]*)\|(?:[^|]*)\|(?:[^\]]*)\] :: (?:.*))') AS found
FROM (
  SELECT logrow +IFNULL(' ' + l2, '') + IFNULL(' ' + l3, '') AS logrow 
  FROM YourTable    
)

【讨论】：

谢谢，你是对的。不是 javascript 它是窗口函数。这里有更多信息：stackoverflow.com/questions/33247703/…