【问题标题】:how to get all the words that start with a certain character in bigquery如何在bigquery中获取以某个字符开头的所有单词
【发布时间】:2020-03-18 07:06:08
【问题描述】:

我在 bigquery 表中有一个文本列。该列的示例记录看起来像 -

with temp as 
(
select 1 as id,"as we go forward into unchartered waters it's important to remember we are all in this together. #united #community" as input
union all
select 2 , "US cities close bars, restaurants and cinemas #Coronavirus"
)

select *
from temp

我想提取此列中以 # 开头的所有单词。稍后我想知道这些术语的频率。如何在 BigQuery 中执行此操作?

我的输出看起来像 -

id, word
1, united
1, community
2, coronavirus

【问题讨论】:

    标签: regex google-bigquery


    【解决方案1】:

    以下是 BigQuery 标准 SQL

    我想提取该列中所有以#开头的单词

    #standardSQL
    WITH temp AS (
      SELECT 1 AS id,"as we go forward into unchartered waters it's important to remember we are all in this together. #united #community" AS input UNION ALL
      SELECT 2 , "US cities close bars, restaurants and cinemas #Coronavirus"
    )
    SELECT id, word
    FROM temp, UNNEST(REGEXP_EXTRACT_ALL(input, r'(?:^|\s)#([^#\s]*)')) word   
    

    有输出

    Row id  word     
    1   1   united   
    2   1   community    
    3   2   Coronavirus    
    

    稍后我想知道这些词的频率

    #standardSQL
    SELECT word, COUNT(1) frequency
    FROM temp, UNNEST(REGEXP_EXTRACT_ALL(input, r'(?:^|\s)#([^#\s]*)')) word
    GROUP BY word
    

    【讨论】:

      【解决方案2】:

      您可以在不使用正则表达式的情况下执行此操作,方法是拆分单词,然后选择以您想要的方式开头的单词。例如:

      SELECT
        id,
        ARRAY(SELECT TRIM(x, "#") FROM UNNEST(SPLIT(input, ' ')) as x WHERE STARTS_WITH(x,'#')) str
      FROM
        temp
      

      如果您希望主题标签是单独的行,您可以更紧密一些:

      SELECT
        id,
        TRIM(x, "#") str
      FROM
        temp,
        UNNEST(SPLIT(input, ' ')) x
      WHERE
        STARTS_WITH(x,'#')
      

      【讨论】:

      • 而不是x LIKE '#%' - 我会推荐START_WITH(x, '#")
      • 完成。我很好奇为什么这是首选?是不是更有效率?还是只是看起来更好?
      • 这是我个人的建议——我觉得这样更有效率
      猜你喜欢
      • 2018-02-23
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多