等效的字符串包含在 google bigquery答案

【问题标题】：Equivalent of string contains in google bigquery等效的字符串包含在 google bigquery
【发布时间】：2019-10-02 08:23:56
【问题描述】：

我有一张如下图所示的表格

我想创建two new binary columns，表明主题是否有steroids 和aspirin。我希望在Postgresql and google bigquery 中实现这一点

我尝试了以下方法，但它不起作用

select subject_id
case when lower(drug) like ('%cortisol%','%cortisone%','%dexamethasone%') 
then 1 else 0 end as steroids,
case when lower(drug) like ('%peptide%','%paracetamol%') 
then 1 else 0 end as aspirin,
from db.Team01.Table_1


SELECT 
db.Team01.Table_1.drug
FROM `table_1`,
UNNEST(table_1.drug) drug
WHERE REGEXP_CONTAINS( db.Team01.Table_1.drug,r'%cortisol%','%cortisone%','%dexamethasone%')

我希望我的输出如下所示

【问题讨论】：

标签： sql postgresql google-bigquery

【解决方案1】：

以下是 BigQuery 标准 SQL

#standardSQL
SELECT 
  subject_id,
  SUM(CASE WHEN REGEXP_CONTAINS(LOWER(drug), r'cortisol|cortisone|dexamethasone') THEN 1 ELSE 0 END) AS steroids,
  SUM(CASE WHEN REGEXP_CONTAINS(LOWER(drug), r'peptide|paracetamol') THEN 1 ELSE 0 END) AS aspirin
FROM `db.Team01.Table_1`
GROUP BY subject_id

如果适用于您问题的样本数据 - 结果是

Row subject_id  steroids    aspirin  
1   1           3           1    
2   2           1           1

注意：我使用的是LIKE on steroids，而不是简单的 LIKE 以冗长和冗余的文本结尾 - 这是REGEXP_CONTAINS

【讨论】：

【解决方案2】：

在 Postgres 中，我建议使用 filter 子句：

select subject_id,
       count(*) filter (where lower(drug) ~ 'cortisol|cortisone|dexamethasone') as steroids,
       count(*) filter (where lower(drug) ~ 'peptide|paracetamol') as aspirin,
from db.Team01.Table_1
group by subject_id;

在 BigQuery 中，我会推荐 countif()：

select subject_id,
       countif(regexp_contains(drug, 'cortisol|cortisone|dexamethasone') as steroids,
       countif(drug ~ ' 'peptide|paracetamol') as aspirin,
from db.Team01.Table_1
group by subject_id;

您可以使用sum(case when . . . end) 作为更通用的方法。但是，每个数据库都有一种更“本地”的方式来表达这种逻辑。顺便说一句，FILTER 子句是标准 SQL，只是没有被广泛采用。

【讨论】：

'countif()' 是一个聚合函数，它更消耗资源。看来这段代码不起作用countif(drug ~ ' 'peptide|paracetamol')
@Gordon Linoff，你能帮我在你的声明中使用% 符号吗？ where regexp_contains(drug,'(?i)dexamethasone'|'(?i)cortisone'|'(?i)cortisol') -- This looks for string presence in the drug` 专栏。但是，如果我想查找以%sone 结尾或以corti 开头的项目是否存在ALONG WITH THE CONTAINS CONDITION。是不是像这样WHERE LOWER(CAST(DRUG AS BYTES)) LIKE b'corti%' OR LOWER(CAST(DRUG AS BYTES)) LIKE b'%sone' OR regexp_contains(drug,'(?i)dexamethasone'|'(?i)cortisone'|'(?i)cortisol')
@SSMK 。 . .在任一数据库中都不需要%。如果您想为此专门使用like，也许您应该问另一个问题。

【解决方案3】：

使用条件聚合。这是一个适用于大多数（如果不是全部）RDBMS 的解决方案：

SELECT
    subject_id,
    MAX(CASE WHEN drug IN ('cortisol', 'cortisone', 'dexamethasone') THEN 1 END) steroids,
    MAX(CASE WHEN drug IN ('peptide', 'paracetamol') THEN 1 END) aspirin
FROM db.Team01.Table_1.drug
GROUP BY subject_id

注意：不清楚您为什么使用LIKE，因为您似乎有完全匹配；我将LIKE 条件变为等式。

【讨论】：

但是不，steroid 是我必须创建的列。我必须使用cortisol、cortsione 等值。如果您在帖子中查看我的查询，您会有所了解
您知道Like 运算符不能使用什么吗？是否只支持case when中的IN子句
我认为应该支持IN。如果没有，你可以切换到ORed 等式，比如； `药物='肽'或药物='扑热息痛'`。

【解决方案4】：

你错过了group-by

select subject_id,
    sum(case when lower(drug) in ('cortisol','cortisone','dexamethasone')
       then 1 else 0 end) as steroids,
    sum(case when lower(drug) in ('peptide','paracetamol') 
       then 1 else 0 end) as aspirin
from db.Team01.Table_1
group by subject_id

使用like关键字

select subject_id,
 sum(case when lower(drug) like '%cortisol%'
        or lower(drug) like '%cortisone%'
        or lower(drug) like '%dexamethasone%'   
    then 1 else 0 end) as steroids,
    sum(case when lower(drug) like '%peptide%'
        or lower(drug) like '%paracetamol%'
    then 1 else 0 end) as aspirin
from db.Team01.Table_1
group by subject_id

【讨论】：

我使用 sum() 因为我认为，这就是你想要实现的目标
我遇到这个错误No matching signature for operator LIKE for argument types: STRING, STRUCT. Supported signatures: STRING LIKE STRING; BYTES LIKE BYTES at [2:19]
好吧，不支持Likeoperatir？
使用like运算符添加
也可以试试lower(drug) similar to '%(cortisol|cortisone|dexamethasone)%'

【解决方案5】：

另一个可能更直观的解决方案是使用BigQuery Contains_Substr 返回布尔结果。

【讨论】：

这没有提供问题的答案。一旦你有足够的reputation，你就可以comment on any post；相反，provide answers that don't require clarification from the asker。 - From Review