查询 cassandra 表以获取最后 N 条记录答案

【问题标题】：querying cassandra table to fetch last N records查询 cassandra 表以获取最后 N 条记录
【发布时间】：2021-12-24 15:01:47
【问题描述】：

表架构如下：

 image_id | activity_time | image_location
----------+---------------+--------------------
      243 |    2021-10-22 | remotelocation_243
      243 |    2021-10-25 | remotelocation_243
       88 |    2021-10-12 |  remotelocation_88
      215 |    2021-10-20 | remotelocation_215
      215 |    2021-10-21 | remotelocation_215
      215 |    2021-10-22 | remotelocation_215
      215 |    2021-10-25 | remotelocation_215
       80 |    2021-10-12 |  remotelocation_80
      248 |    2021-10-20 | remotelocation_248
      248 |    2021-10-21 | remotelocation_248
      248 |    2021-10-22 | remotelocation_248
      248 |    2021-10-25 | remotelocation_248
      234 |    2021-10-20 | remotelocation_234
      234 |    2021-10-21 | remotelocation_234
      234 |    2021-10-22 | remotelocation_234
      234 |    2021-10-25 | remotelocation_234
       11 |    2021-10-12 |  remotelocation_11
      501 |    2021-10-22 | remotelocation_501
        1 |    2021-10-12 |   remotelocation_1
      509 |    2021-10-22 | remotelocation_509
       78 |    2021-10-12 |  remotelocation_78
       96 |    2021-10-12 |  remotelocation_96
      539 |    2021-10-22 | remotelocation_539

我想根据activity_time获取最后N条记录。我读到以下内容：

Cassandra + Fetch the last records using in query

Error creating table in cassandra - Bad Request: Only clustering key columns can be defined in CLUSTERING ORDER directiv

Order latest records by timestamp in Cassandra

但是，我发现它需要某种 where 子句才能从 order by 中获取结果。

我只想做这样的事情：

select * from table_name order by activity_time desc limit 20;

非常感谢任何帮助。提前致谢。

【问题讨论】：

什么是主键定义？
主键 - image_id.
看起来不是这样的。 Cassandra 中的 PK 是独一无二的。我看到多个 248、215、243 等。

标签： python cassandra cassandra-3.0

【解决方案1】：

select * from table_name order by activity_time desc limit 20;

所以这被称为“未绑定”查询（SELECT 没有 WHERE）。这是一种已知的 Cassandra 反模式，因为如果没有对分区键进行WHERE 子句过滤，集群中的每个节点都会被联系到。在大型集群场景中，这可能意味着 > 100 个节点。此外，一个节点需要准备并返回结果（称为“协调器”），而协调器节点因此类查询而崩溃。

使用当前的表结构，您能做的最好的事情是查询最后 N 条记录以查找单个 image_id。

要回答您的实际问题，您首先需要将数据复制到旨在支持查询的表中。由于您主要关心最近的数据，因此使用时间“桶”对数据进行分区可能是有意义的。

选择作为“存储桶”的实际时间单位将根据您的业务需求而有所不同。鉴于上面显示的数据集，我将在此示例中使用“月”。

CREATE TABLE stackoverflow.images_by_month (
    month int,
    activity_time date,
    image_id int,
    image_location text,
    PRIMARY KEY (month, activity_time, image_id)
) WITH CLUSTERING ORDER BY (activity_time DESC, image_id ASC)

现在我可以这样做了：

> SELECT * FROm images_by_month WHERE month=202110 LIMIT 10;

 month  | activity_time | image_id | image_location
--------+---------------+----------+--------------------
 202110 |    2021-10-25 |      215 | remotelocation_215
 202110 |    2021-10-25 |      234 | remotelocation_234
 202110 |    2021-10-25 |      243 | remotelocation_243
 202110 |    2021-10-25 |      248 | remotelocation_248
 202110 |    2021-10-22 |      215 | remotelocation_215
 202110 |    2021-10-22 |      234 | remotelocation_234
 202110 |    2021-10-22 |      243 | remotelocation_243
 202110 |    2021-10-22 |      248 | remotelocation_248
 202110 |    2021-10-21 |      215 | remotelocation_215
 202110 |    2021-10-21 |      234 | remotelocation_234

请注意，我已经在我的表上指定了CLUSTERING ORDER，所以我不需要在查询中处理它。数据只是按该顺序从磁盘中取出。

如果month 为每个分区创建了太多行，不妨试试week 甚至day 之类的时间单位。

【讨论】：

非常感谢。这帮助我理解并解决了手头的问题！非常感谢。