【问题标题】:How to optimize the following neo4j Cypher query如何优化以下 neo4j Cypher 查询
【发布时间】:2020-02-25 18:44:02
【问题描述】:

我是 cypher 的新手,并且有以下查询来查找 2 种源类型之间的不匹配(例如)。我相信在语法上查询看起来不错,但是在只有 1,00,000 个节点的数据集上运行需要 1 分钟。我还没有使用关系。有人可以帮助优化查询吗?谢谢。

MATCH (VW_OXSS41:VW_OrderXStatusSummary4{SourceTypeID: "1"}) 
WHERE apoc.date.parse(VW_OXSS41.TimeStamp,'s',('yyyy-MM-dd HH:mm:ss'))>=apoc.date.parse("2020-02-10",'s',('yyyy-MM-dd')) AND apoc.date.parse(VW_OXSS41.TimeStamp,'s',('yyyy-MM-dd HH:mm:ss'))<=apoc.date.parse("2020-02-16",'s',('yyyy-MM-dd'))
WITH VW_OXSS41.IdentifierValue as X
MATCH (VW_OXSS42:VW_OrderXStatusSummary4{SourceTypeID: "2"}) 
WHERE apoc.date.parse(VW_OXSS42.TimeStamp,'s',('yyyy-MM-dd HH:mm:ss'))>=apoc.date.parse("2020-02-10",'s',('yyyy-MM-dd')) AND apoc.date.parse(VW_OXSS42.TimeStamp,'s',('yyyy-MM-dd HH:mm:ss'))<=apoc.date.parse("2020-02-16",'s',('yyyy-MM-dd'))
WITH apoc.coll.disjunction(COLLECT(X), COLLECT(VW_OXSS42.IdentifierValue)) as XX
UNWIND (XX) as YY

更新后的查询和错误:-

WITH apoc.date.parse("2020-02-20",'s',('yyyy-MM-dd')) AS a, apoc.date.parse("2020-02-25",'s',('yyyy-MM-dd')) AS b
       MATCH (x:VW_OrderXStatusSummary4 {SourceTypeID: "2"}) 
       WHERE a <= apoc.date.parse(x.TimeStamp,'s',('yyyy-MM-dd HH:mm:ss')) <= b
       WITH a, b, COLLECT(x.IdentifierValue) AS X
       MATCH (y:VW_OrderXStatusSummary4 {SourceTypeID: "1"}) 
       WHERE a <= apoc.date.parse(y.TimeStamp,'s',('yyyy-MM-dd HH:mm:ss')) <= b
       WITH X, COLLECT(y.IdentifierValue) AS Y
       UNWIND apoc.coll.subtract(X,Y) AS XX
       MATCH (z:VW_OrderXStatusSummary4 {SourceTypeID: "2"}) 
       WHERE a <= apoc.date.parse(z.TimeStamp,'s',('yyyy-MM-dd HH:mm:ss')) <= b
       RETURN XX AS MISMATCHES,MAX(z.TimeStamp);
Variable `a` not defined (line 10, column 7 (offset: 551))
"WHERE a <= apoc.date.parse(z.TimeStamp,'s',('yyyy-MM-dd HH:mm:ss')) <= b"

像这样解决了上面的错误:-

WITH apoc.date.parse("2020-02-21",'s',('yyyy-MM-dd')) AS a, apoc.date.parse("2020-02-25",'s',('yyyy-MM-dd')) AS b
MATCH (x:VW_OrderXStatusSummary4 {SourceTypeID: "2"}) 
WHERE a <= apoc.date.parse(x.TimeStamp,'s',('yyyy-MM-dd HH:mm:ss')) <= b
WITH a, b, COLLECT(x.IdentifierValue) AS X
MATCH (y:VW_OrderXStatusSummary4 {SourceTypeID: "1"}) 
WHERE a <= apoc.date.parse(y.TimeStamp,'s',('yyyy-MM-dd HH:mm:ss')) <= b
WITH X, COLLECT(y.IdentifierValue) AS Y
UNWIND apoc.coll.subtract(X,Y) AS XX
WITH XX, apoc.date.parse("2020-02-20",'s',('yyyy-MM-dd')) AS a, apoc.date.parse("2020-02-25",'s',('yyyy-MM-dd')) AS b
MATCH (z:VW_OrderXStatusSummary4 {SourceTypeID: "2"}) 
WHERE a <= apoc.date.parse(z.TimeStamp,'s',('yyyy-MM-dd HH:mm:ss')) <= b
AND XX = z.IdentifierValue
RETURN XX AS MISMATCHES,MAX(z.TimeStamp);

正确的预期输出为:-

+---------------------------------------------+
| MISMATCHES          | TIMESTAMP             |
+---------------------------------------------+
| "W2002201453550218" | "2020-02-21 12:00:16" |
| "W2002201453550222" | "2020-02-21 12:00:16" |
| "W2002201453550223" | "2020-02-21 09:30:36" |
| "W2002201453550224" | "2020-02-21 12:00:16" |
| "W2002201453550226" | "2020-02-21 12:00:16" |
| "W2002201453550227" | "2020-02-21 12:00:16" |
| "W2002201453550237" | "2020-02-21 12:00:16" |
| "3011WOS002978598"  | "2020-02-21 10:00:54" |
| "3011WOS002978595"  | "2020-02-21 13:00:57" |
| "0010000000006183"  | "2020-02-21 16:00:41" |
| "W2002181111547439" | "2020-02-21 04:00:34" |
| "11"                | "2020-02-21 16:00:41" |
| "10112787861P1458"  | "2020-02-21 10:00:54" |
+---------------------------------------------+

想知道是否有更好的方法?

【问题讨论】:

    标签: neo4j cypher graph-databases neo4j-apoc


    【解决方案1】:
    1. 您需要避免在两个MATCH 子句的结果之间创建cartesian product。假设两个MATCH 子句在它们自己的查询中执行时通常会分别返回NM 节点。因为您的查询以它的方式组合了这两个 MATCH 子句,所以您的第二个 MATCH 子句实际上正在执行 N*M 匹配(并生成 N*M 结果行)。

    2. 您需要确保已在 :VW_OrderXStatusSummary4(SourceTypeID) 上创建索引。这将优化MATCH 子句执行的查找。

    3. 您可以简化 Cypher 代码以避免重复的函数调用。

    创建上述索引后,试试这个:

    WITH apoc.date.parse("2020-02-10",'s',('yyyy-MM-dd')) AS a, apoc.date.parse("2020-02-16",'s',('yyyy-MM-dd')) AS b
    MATCH (x:VW_OrderXStatusSummary4 {SourceTypeID: "1"}) 
    WHERE a <= apoc.date.parse(x.TimeStamp,'s',('yyyy-MM-dd HH:mm:ss')) <= b
    WITH a, b, COLLECT(x.IdentifierValue) AS X
    MATCH (y:VW_OrderXStatusSummary4 {SourceTypeID: "2"}) 
    WHERE a <= apoc.date.parse(y.TimeStamp,'s',('yyyy-MM-dd HH:mm:ss')) <= b
    WITH X, COLLECT(y.IdentifierValue) AS Y
    UNWIND apoc.coll.disjunction(X, Y) AS YY
    ...
    

    在第一个WITH 子句中执行COLLECT(x.IdentifierValue) 操作会导致它在单个结果行中返回所有x 节点(而不是N 结果行)。这允许第二个MATCH 避免笛卡尔积问题。

    【讨论】:

    • 非常感谢,我明白你的意思了。我还有一个问题,假设我想针对 YY 数据集显示 MAX(TimeStamp),我应该遵循哪种方法?我做了以下但得到了错误:---
    • neo4j> WITH apoc.date.parse("2020-02-20",'s',('yyyy-MM-dd')) 作为一个,apoc.date.parse("2020 -02-25",'s',('yyyy-MM-dd')) AS b MATCH (x:VW_OrderXStatusSummary4​​ {SourceTypeID: "2"}) WHERE a
    • b WITH X, COLLECT(y.IdentifierValue) AS Y UNWIND apoc.coll.subtract(X,Y) AS XX MATCH (z:VW_OrderXStatusSummary4​​ {SourceTypeID: "2"}) WHERE a a 未定义(第 10 行,第 7 列(偏移量:551))“WHERE a
    • 您最后的WITH 子句不包含ab。但是,更重要的是,您现在重新引入了另一种笛卡尔积(无论如何这完全没有必要)。查看我的答案的更新。
    • 感谢@cybersam 的编辑,但这个 max_ts 实际上是不正确的。我不想要 YY 的完整数据集上的最大 TS,而是针对 YY 的每条记录,因为 YY 属性值在标签中是可重复的:VW_OrderXStatusSummary4​​。请查看我对我的问题的最后编辑。
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多