【问题标题】:Remove duplicate rows according to the attribute in google BigQuery SQL根据google BigQuery SQL中的属性删除重复行
【发布时间】:2017-05-09 09:04:20
【问题描述】:

我有一张表叫:结果 我正在使用 BigQuery 从 GA 中选择数据

SELECT
  Date,
  totals.pageviews,
  h.transaction.transactionId,
  h.item.itemQuantity,
  h.transaction.transactionRevenue,
  totals.bounces,
  fullvisitorid,
  totals.timeOnSite,
  device.browser,
  device.deviceCategory,
  trafficSource.source,
  channelGrouping,
  h.page.pagePath,
  h.eventInfo.eventCategory,
  device.operatingSystem
FROM
  `atomic-life-148403.126959513.ga_sessions_*`,
  UNNEST(hits) AS h
WHERE
  _TABLE_SUFFIX BETWEEN REPLACE(CAST(DATE_ADD(CURRENT_DATE(), INTERVAL -1 YEAR) AS STRING), '-','')
  AND CONCAT('intraday_', REPLACE(CAST(DATE_ADD(CURRENT_DATE(), INTERVAL 0 DAY) AS STRING), '-',''))
  ORDER BY
  date  DESC

有一些记录重复。如何从表中删除重复记录?

我想得到以下结果。

【问题讨论】:

  • 您确实想查找并删除行,或者只是将它们从查询结果中隐藏?如果是后者,请使用 DISTINCT。如果是前者,它会变得更复杂一些。
  • 如何只选择不同的行?因为 itemquentity 和收入是分开的

标签: sql google-bigquery


【解决方案1】:

您可以使用 ROW_NUMBER

WITH CTE AS 
(SELECT *, ROW_NUMBER() OVER (PARTITION BY transactionid ORDER BY 
transactionid) ROW FROM [YourTable]) 

DELETE [YourTable] 
FROM [YourTable]
JOIN CTE ON [YourTable].transactionid ON CTE.transactionid
                              WHERE CTE.ROW > 1

【讨论】:

    【解决方案2】:

    你可以使用ROW_NUMBER()这样的解析函数

    select * from (
    select *,
    ROW_NUMBER() OVER(PARTITION BY transactionid ORDER BY transactionid) rownum
    from result ) xxx
    where rownum = 1;
    

    【讨论】:

      【解决方案3】:

      以下是 BigQuery 标准 SQL

      #standardSQL
      SELECT DISTINCT
        Date,
        totals.pageviews,
        h.transaction.transactionId,
        h.item.itemQuantity,
        h.transaction.transactionRevenue,
        totals.bounces,
        fullvisitorid,
        totals.timeOnSite,
        device.browser,
        device.deviceCategory,
        trafficSource.source,
        channelGrouping,
        h.page.pagePath,
        h.eventInfo.eventCategory,
        device.operatingSystem
      FROM
        `atomic-life-148403.126959513.ga_sessions_*`,
        UNNEST(hits) AS h
      WHERE
        _TABLE_SUFFIX BETWEEN REPLACE(CAST(DATE_ADD(CURRENT_DATE(), INTERVAL -1 YEAR) AS STRING), '-','')
        AND CONCAT('intraday_', REPLACE(CAST(DATE_ADD(CURRENT_DATE(), INTERVAL 0 DAY) AS STRING), '-',''))
        ORDER BY
        date  DESC  
      

      如您所见 - 我刚刚将 DISTINCT 添加到您的 SELECT 中 - 详细了解 BigQuery 标准 SQL 的 SELECT and its modifiers

      【讨论】:

        【解决方案4】:

        您可以选择唯一的行并删除其他行:

        DELETE FROM MyTable
        LEFT OUTER JOIN (
           SELECT DISTINCT * FROM MyTable
        ) as UniqueRows ON
           MyTable.KeyField= UniqueRows.KeyField
        WHERE
           UniqueRows.KeyField IS NULL;
        

        【讨论】:

          【解决方案5】:

          对所有选定的列使用 GROUP BY 应该可以消除结果中任何真正的重复行:

          SELECT
            Date,
            totals.pageviews,
            h.transaction.transactionId,
            h.item.itemQuantity,
            h.transaction.transactionRevenue,
            totals.bounces,
            fullvisitorid,
            totals.timeOnSite,
            device.browser,
            device.deviceCategory,
            trafficSource.source,
            channelGrouping,
            h.page.pagePath,
            h.eventInfo.eventCategory,
            device.operatingSystem
          FROM
            `atomic-life-148403.126959513.ga_sessions_*`,
            UNNEST(hits) AS h
          WHERE
            _TABLE_SUFFIX BETWEEN REPLACE(CAST(DATE_ADD(CURRENT_DATE(), INTERVAL -1 
          YEAR) AS STRING), '-','')
            AND CONCAT('intraday_', REPLACE(CAST(DATE_ADD(CURRENT_DATE(), INTERVAL 0 DAY) AS STRING), '-',''))
          GROUP BY
            Date,
            totals.pageviews,
            h.transaction.transactionId,
            h.item.itemQuantity,
            h.transaction.transactionRevenue,
            totals.bounces,
            fullvisitorid,
            totals.timeOnSite,
            device.browser,
            device.deviceCategory,
            trafficSource.source,
            channelGrouping,
            h.page.pagePath,
            h.eventInfo.eventCategory,
            device.operatingSystem
          ORDER BY
            date  DESC;
          

          【讨论】:

            猜你喜欢
            • 2018-09-29
            • 1970-01-01
            • 2022-08-19
            • 1970-01-01
            • 1970-01-01
            • 2015-03-06
            • 1970-01-01
            • 1970-01-01
            • 2012-05-23
            相关资源
            最近更新 更多