【问题标题】:Postgres: Count Rows with a Left JoinPostgres:用左连接计算行数
【发布时间】:2018-10-01 20:19:21
【问题描述】:

我正在尝试使用 Postgres 进行一些分析,我确实有 2 个表,分别称为:predictionstatepageviews

predictionstate 表:

此表包含具有我们算法结果的列,使用以下结构:

  • id ({company_identifier}:{user_identifier})
  • 型号(参考字符串值)
  • 预测(0.0 到 1.0 之间的浮点数)

pageviews 表:

此表包含用户信息,使用以下结构:

  • company_identifier
  • user_identifier
  • pageview_current_url_type

问题

我正在尝试根据我们的最佳模型获取数据,以分析它的准确性,基本上我需要知道在哪里创建细分并计算我有多少成员。下面的代码就是这样做的:

WITH ranges AS (
  SELECT
    myrange::text || '-' || (myrange + 0.1)::text AS segment,
    myrange as r_min, myrange + 0.1 as r_max
  FROM generate_series(0.0, 0.9, 0.1) AS myrange
)
SELECT
  SPLIT_PART(p.id, ':', 1) as company_identifier,
  p.model,
  r.segment,
  COUNT(DISTINCT(SPLIT_PART(p.id, ':', 2))) as "segment_users"
FROM
  ranges r
INNER JOIN predictionstate p ON p.prediction BETWEEN r.r_min AND r.r_max
GROUP BY company_identifier, p.model, r.segment
ORDER BY company_identifier, p.model, r.segment;

但是我遇到的问题,因为我不知道具体怎么做,所以对于每个(公司、型号、细分市场),需要获取准确度的数据,查询@987654330 @表并识别pageview_current_url_type == 'BUYSUCCESS'

我试过了,但没用:

WITH ranges AS (
  SELECT
    myrange::text || '-' || (myrange + 0.1)::text AS segment,
    myrange as r_min, myrange + 0.1 as r_max
  FROM generate_series(0.0, 0.9, 0.1) AS myrange
)
SELECT
  SPLIT_PART(p.id, ':', 1) as company_identifier,
  p.model,
  r.segment,
  COUNT(DISTINCT(SPLIT_PART(p.id, ':', 2))) as "segment_users",
  b.n as "converted_users"
FROM
  ranges r,
  (
    SELECT COUNT(DISTINCT(pvs.user_identifier)) as n
    FROM pageviews pvs
    INNER JOIN (
        SELECT
            SPLIT_PART(id, ':', 1) as company_identifier,
            SPLIT_PART(id, ':', 2) as user_identifier
        FROM predictionstate ps
        WHERE prediction BETWEEN r.r_min AND r.r_max ) users
        ON (
            pvs.user_identifier = users.user_identifier AND
            pvs.company_identifier= users.company_identifier) 
        WHERE pageview_current_url_type = 'BUYSUCCESS'

  ) b
INNER JOIN predictionstate p ON p.prediction BETWEEN r.r_min AND r.r_max
GROUP BY company_identifier, p.model, r.segment
ORDER BY company_identifier, p.model, r.segment;

TL;DR:我需要根据主要查询用户来计算 JOIN。

编辑:

我添加了一个 SQL Fiddle https://www.db-fiddle.com/f/5sQiZD6mHwdnwvVfvL9MAh/0

我想知道,对于那些segment_users,有多少人有pageview_current_url_type = 'BUYSUCCESS',在结果中再添加一列:segmented_really_bought

编辑 2:再一次尝试不起作用(错误:列“p.id”必须出现在 GROUP BY 子句中或用于聚合函数中)

WITH ranges AS (
  SELECT
    myrange::text || '-' || (myrange + 0.1)::text AS segment,
    myrange as r_min, myrange + 0.1 as r_max
  FROM generate_series(0.0, 0.9, 0.1) AS myrange
)
SELECT
  SPLIT_PART(p.id, ':', 1) as company_identifier,
  p.model,
  r.segment,
  COUNT(DISTINCT(SPLIT_PART(p.id, ':', 2))) as "segment_users",
  COUNT(b.*) as "converted_users"
FROM
  ranges r
INNER JOIN predictionstate p ON p.prediction BETWEEN r.r_min AND r.r_max
INNER JOIN (
  SELECT users.company_identifier, COUNT(users.user_identifier) AS n
  FROM pageviews
  INNER JOIN (
    SELECT SPLIT_PART(ps.id, ':', 2) AS user_identifier,
           SPLIT_PART(ps.id, ':', 1) AS company_identifier
    FROM predictionstate ps
    WHERE provider_id=47 AND
          prediction > 0.7
   ) users ON (
      pageviews.user_identifier=users.user_identifier AND
      pageviews.company_identifier=users.company_identifier
    )
  WHERE pageview_current_url_type='BUYSUCCESS'
  GROUP BY users.company_identifier
) AS b
ON (
  b.company_identifier = company_identifier
)
GROUP BY company_identifier, p.model, r.segment
ORDER BY company_identifier, p.model, r.segment;

编辑 3:添加了所需的输出

使用此代码生成:https://gist.github.com/brunoalano/479265b934a67dc02092fb54a846fe1e

company, model, segment, segment_users, really_bought
company_a, model_a, 0.3-0.4, 1, 3
company_a, model_a, 0.5-0.6, 1, 1
company_a, model_b, 0.2-0.3, 1, 3
company_a, model_c, 0.2-0.3, 1, 1
company_a, model_c, 0.7-0.8, 1, 3
company_b, model_a, 0.3-0.4, 3, 2
company_b, model_b, 0.5-0.6, 2, 1
company_b, model_b, 0.6-0.7, 1, 1
company_b, model_c, 0.5-0.6, 1, 0
company_b, model_c, 0.8-0.9, 1, 1

【问题讨论】:

  • 1.为什么你的 ID 是一个串联的字符串?如果您将两列作为主键,那么在您的代码中会容易得多。 2. 这看起来很安静。您能否添加一个示例表和预期输出?
  • @S-Man 我在这里创建它:db-fiddle.com/f/5sQiZD6mHwdnwvVfvL9MAh/0
  • 您发布的样本的预期结果是什么?请将其添加到您的问题中。
  • @KamilGosciminski 我添加了所需的输出和我用来生成它的代码。很抱歉。
  • 我的答案似乎正是您要找的,但我不知道为什么您的输出中的段数少于数据生成的段数。

标签: sql postgresql


【解决方案1】:

如果没有示例输出,很难判断您需要什么,但我认为您正在寻找的是:

WITH ranges AS (
  SELECT
    myrange::text || '-' || (myrange + 0.1)::text AS segment,
    myrange as r_min, myrange + 0.1 as r_max
  FROM generate_series(0.0, 0.9, 0.1) AS myrange
)
SELECT
  p.company_identifier,
  p.model,
  r.segment,
  COUNT(DISTINCT(p.user_identifier)) as "segment_users",
  COUNT(CASE WHEN pv.pageview_current_url_type = 'BUYSUCCESS' THEN 1 END) AS segmented_really_bought
FROM
  ranges r
INNER JOIN (
  SELECT
    SPLIT_PART(id, ':', 1) as company_identifier,
    SPLIT_PART(id, ':', 2) as user_identifier,
    model,
    prediction
  FROM
    predictionstate
  ) p ON p.prediction BETWEEN r.r_min AND r.r_max
LEFT JOIN pageviews pv ON 
  p.company_identifier = pv.company_identifier
  AND p.user_identifier = pv.user_identifier
GROUP BY p.company_identifier, p.model, r.segment
ORDER BY p.company_identifier, p.model, r.segment;

对小提琴查询的更改:

  • predictionstate 替换为我们加入的子查询,我们在其中执行 split_part 逻辑以将公司和用户标识符作为单独的列获取
  • 将这些标识符用于LEFT JOINpageviews
  • 添加了 segmented_really_bought 列,并带有一个 CASEd COUNT

【讨论】:

    【解决方案2】:

    demo: db<>fiddle

    WITH ranges AS (
      SELECT
        myrange::text || '-' || (myrange + 0.1)::text AS segment,
        myrange as r_min, myrange + 0.1 as r_max
      FROM generate_series(0.0, 0.9, 0.1) AS myrange
    ), pstate AS (                                         -- A
      SELECT 
        SPLIT_PART(ps.id, ':', 1) AS company_identifier,
        SPLIT_PART(ps.id, ':', 2) AS user_identifier,
        model,
        prediction
      FROM predictionstate ps
    )
    SELECT 
      company_identifier, model, segment,
      COUNT(DISTINCT user_identifier) as segment_users,    -- B
      -- C: 
      COUNT(user_identifier) FILTER (WHERE pageview_current_url_type = 'BUYSUCCESS') as really_bought
    FROM pstate ps
    LEFT JOIN ranges r 
    ON prediction BETWEEN r_min AND r_max
    LEFT JOIN pageviews pv 
    USING (company_identifier, user_identifier)
    GROUP BY company_identifier, model, segment
    ORDER BY company_identifier, model, segment
    

    答:我真的建议您将 id 列分成两列以便更好地处理。这将为您节省大量拆分字符串的时间(编写查询并执行它们)并且更具可读性。这就是我添加第二个 CTE 的原因。

    B:COUNT(DISTINCT) 统计组中不同的用户

    C:统计所有用户(不区分),但在统计前过滤掉预期的状态。


    我想知道:如果预测恰好在阈值上怎么办,例如0.3。使用BETWEEN 子句,此范围将在0.2-0.30.3-0.4 范围内连接(因为BETWEEN 等于r_min &gt;= x &gt;= r_max)。最好将范围定义为r_min &gt;= x &gt; r_maxr_min &gt; x &gt;= r_max。我按照您在示例中提到的方式进行了连接,但我更愿意更改它。还是不知道是哪个方向

    【讨论】:

      猜你喜欢
      • 2013-12-05
      • 2011-01-12
      • 2013-07-12
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2019-07-29
      • 1970-01-01
      • 2019-05-19
      相关资源
      最近更新 更多