如何在bigquery中获取以某个字符开头的所有单词答案

【问题标题】：how to get all the words that start with a certain character in bigquery如何在bigquery中获取以某个字符开头的所有单词
【发布时间】：2020-03-18 07:06:08
【问题描述】：

我在 bigquery 表中有一个文本列。该列的示例记录看起来像 -

with temp as 
(
select 1 as id,"as we go forward into unchartered waters it's important to remember we are all in this together. #united #community" as input
union all
select 2 , "US cities close bars, restaurants and cinemas #Coronavirus"
)

select *
from temp

我想提取此列中以 # 开头的所有单词。稍后我想知道这些术语的频率。如何在 BigQuery 中执行此操作？

我的输出看起来像 -

id, word
1, united
1, community
2, coronavirus

【问题讨论】：

标签： regex google-bigquery

【解决方案1】：

以下是 BigQuery 标准 SQL

我想提取该列中所有以#开头的单词

#standardSQL
WITH temp AS (
  SELECT 1 AS id,"as we go forward into unchartered waters it's important to remember we are all in this together. #united #community" AS input UNION ALL
  SELECT 2 , "US cities close bars, restaurants and cinemas #Coronavirus"
)
SELECT id, word
FROM temp, UNNEST(REGEXP_EXTRACT_ALL(input, r'(?:^|\s)#([^#\s]*)')) word

有输出

Row id  word     
1   1   united   
2   1   community    
3   2   Coronavirus

稍后我想知道这些词的频率

#standardSQL
SELECT word, COUNT(1) frequency
FROM temp, UNNEST(REGEXP_EXTRACT_ALL(input, r'(?:^|\s)#([^#\s]*)')) word
GROUP BY word

【讨论】：

【解决方案2】：

您可以在不使用正则表达式的情况下执行此操作，方法是拆分单词，然后选择以您想要的方式开头的单词。例如：

SELECT
  id,
  ARRAY(SELECT TRIM(x, "#") FROM UNNEST(SPLIT(input, ' ')) as x WHERE STARTS_WITH(x,'#')) str
FROM
  temp

如果您希望主题标签是单独的行，您可以更紧密一些：

SELECT
  id,
  TRIM(x, "#") str
FROM
  temp,
  UNNEST(SPLIT(input, ' ')) x
WHERE
  STARTS_WITH(x,'#')

【讨论】：

而不是x LIKE '#%' - 我会推荐START_WITH(x, '#")
完成。我很好奇为什么这是首选？是不是更有效率？还是只是看起来更好？
这是我个人的建议——我觉得这样更有效率