【问题标题】:regex big query extracting numeric from url正则表达式大查询从 url 中提取数字
【发布时间】:2021-06-29 01:32:08
【问题描述】:

您好,我正在尝试使用大查询提取 7digit 以提取 2670782 和 2670788 关于这个数据

描述下面的字段数据

is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type 8888888 specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing 8888888 software like Aldus PageMaker including versions of Lorem Ipsum.

>> https://hello.com/pudding/answer/2670782?hl=en&ref_topic=7072943
>> https://hello.com/pudding/answer/2670788?hl=en&ref_topic=7072943

我有一个查询,但除了 2670782 和 2670788 之外,数据上还有其他 7 位数字。所以首先我想检查该行是否以“>>”开头并包含“hello.com”,我可以提取它。

这是我的查询,但它也会获取不应该的 8888888。

SELECT
 desc,
 REGEXP_EXTRACT_ALL(desc, r"\/(\d{7})") AS num
FROM
 `table`
WHERE
 REGEXP_CONTAINS(DESCRIPTION, r"(>> )")
 AND REGEXP_CONTAINS(desc, r"(hello.com)")

我相信我需要检查该行是否以 >> 开头,并且它在单个正则表达式中包含 hello.com,然后我可以提取 / 之后的 7 位数字。我被卡住了
任何帮助将不胜感激!

【问题讨论】:

  • 我迷路了。样本数据是一行还是多行? “如果该行以 >> 开头”是什么意思?
  • 在 KaBoom 的回答中尝试在正则表达式的开头添加 (:?m) 以允许 ^ 匹配字符串的开头和换行符。

标签: regex google-bigquery


【解决方案1】:

如果您的每个输入都是一行,则可以使用此正则表达式

^>>.+hello.com.+\/(\d{7})

我在 regex101.com 中使用您的输入和 1 行输入假设测试此正则表达式

更新: 您可以用换行符替换“>>”,然后使用下面的正则表达式提取数字

hello.com.+\/(\d{7})

示例如下:

WITH
  sample AS (
  SELECT
    '''start here not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing 8888888 software like Aldus PageMaker including versions of Lorem Ipsum. >> hello.com/pudding/answer/2670782?hl=en&ref_topic=7072943 >> hello.com/pudding/answer/2670788?hl=en&ref_topic=7072943
''' AS txt
  UNION ALL
  SELECT
    '''
is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type 8888888 specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing 8888888 software like Aldus PageMaker including versions of Lorem Ipsum.

>> https://hello.com/pudding/answer/2670786?hl=en&ref_topic=7072943
>> https://hello.com/pudding/answer/2670785?hl=en&ref_topic=7072943
'''),
  sample_new_line AS (
  SELECT
    REGEXP_REPLACE(txt, '>>', '\n') AS txt
  FROM
    sample)
SELECT
  REGEXP_EXTRACT_ALL(txt, r"hello.com.+\/(\d{7})") AS num
FROM
  sample_new_line;

【讨论】:

  • 当我尝试在这里输入多行时它不起作用........从这里开始不仅五个世纪,而且电子排版的飞跃,基本保持不变。它在 1960 年代随着包含 Lorem Ipsum 段落的 Letraset 表的发布而流行起来,最近随着桌面出版 8888888 软件(如 Aldus PageMaker)(包括 Lorem Ipsum 的版本)的发布。 >> hello.com/pudding/answer/2670782?hl=en&ref_topic=7072943 >> hello.com/pudding/answer/2670788?hl=en&ref_topic=7072943
  • 将此添加到正则表达式字符串 (:?m) 的开头。这允许向上箭头开始字符串或开始行。
  • @lipo 我用你的新测试更新我的答案。
猜你喜欢
  • 1970-01-01
  • 2021-09-07
  • 2013-12-20
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2013-05-16
  • 2011-07-24
  • 1970-01-01
相关资源
最近更新 更多