【问题标题】:Are the parquet files generated by IBM Db2 Event Store readable by a standard reader?IBM Db2 Event Store 生成的 parquet 文件是否可由标准阅读器读取?
【发布时间】:2025-12-30 12:30:10
【问题描述】:

我正在查看 IBM Db2 Event Store 的文档,并看到以下语句“共享区域中的数据以标准 Parquet 格式存储,可以通过 Db2 Event Store 或其他能够读取 Parquet 的系统进行查询数据。”文件在哪里?我可以使用标准 parquet 文件阅读器阅读它们吗?

【问题讨论】:

    标签: db2 parquet ibm-event-store


    【解决方案1】:

    IBM Db2 Event Store 支持共享文件系统(如 NFS)和云对象存储来存储共享文件。在这两种情况下,目录结构是相似的,所以我将描述它的 COS 版本。

    使用 AWS S3 客户端,您可以查询 COS 存储桶中的文件,在这种情况下,我正在查询 SoftLayer 的 COS 中的测试存储桶,并且只显示我在那里拥有的少量镶木地板文件。就像你在这个清单中看到的那样,文件都在一个名为 EVENTDB(默认数据库名称)的数据库目录下,然后是数据库标识符(DB00000000),然后是内部标识符(TS00000003),然后是数据,最后是表标识符 (t0000000000)。在该目录内有两个子目录:preShared 和 shared,在其中您可以找到 parquet 文件。共享文件是最后一组 parquet 文件,是合并 preShared 文件的结果。所有文件都是不可变的,但最好读取共享流,因为 preShared 文件在合并到共享流中后会被清除。

    $ aws s3 --endpoint=https://s3.us-east.cloud-object-storage.appdomain.cloud  ls --recursive s3://mytestbucket/ | grep -i parquet
    2019-06-17 05:08:05       1004 ibm/htap/db/db2inst1/SHARED_DATA/EVENTDB/DB00000000/TS00000003/data/t0000000000/preShared/s000001.0.0.i000000000000.parquet
    2019-06-17 03:32:27       1004 ibm/htap/db/db2inst1/SHARED_DATA/EVENTDB/DB00000000/TS00000003/data/t0000000000/preShared/s000006.0.0.i000000000000.parquet
    2019-06-17 03:32:27       1004 ibm/htap/db/db2inst1/SHARED_DATA/EVENTDB/DB00000000/TS00000003/data/t0000000000/preShared/s000008.0.0.i000000000000.parquet
    2019-06-17 03:32:27       1004 ibm/htap/db/db2inst1/SHARED_DATA/EVENTDB/DB00000000/TS00000003/data/t0000000000/preShared/s000015.0.0.i000000000000.parquet
    2019-06-17 03:32:27       1004 ibm/htap/db/db2inst1/SHARED_DATA/EVENTDB/DB00000000/TS00000003/data/t0000000000/preShared/s000018.0.0.i000000000000.parquet
    2019-06-17 03:32:27       1004 ibm/htap/db/db2inst1/SHARED_DATA/EVENTDB/DB00000000/TS00000003/data/t0000000000/preShared/s000022.0.0.i000000000000.parquet
    2019-06-17 05:08:05       1004 ibm/htap/db/db2inst1/SHARED_DATA/EVENTDB/DB00000000/TS00000003/data/t0000000000/preShared/s000027.0.0.i000000000000.parquet
    2019-06-17 05:08:05       1004 ibm/htap/db/db2inst1/SHARED_DATA/EVENTDB/DB00000000/TS00000003/data/t0000000000/preShared/s000031.0.0.i000000000000.parquet
    2019-06-17 03:32:27       1004 ibm/htap/db/db2inst1/SHARED_DATA/EVENTDB/DB00000000/TS00000003/data/t0000000000/preShared/s000035.0.0.i000000000000.parquet
    2019-06-17 05:15:49       1057 ibm/htap/db/db2inst1/SHARED_DATA/EVENTDB/DB00000000/TS00000003/data/t0000000000/shared/s000001.0.0.t000000000000.parquet
    2019-06-17 05:15:50       1057 ibm/htap/db/db2inst1/SHARED_DATA/EVENTDB/DB00000000/TS00000003/data/t0000000000/shared/s000006.0.0.t000000000000.parquet
    2019-06-17 05:15:51       1057 ibm/htap/db/db2inst1/SHARED_DATA/EVENTDB/DB00000000/TS00000003/data/t0000000000/shared/s000008.0.0.t000000000000.parquet
    2019-06-17 05:15:50       1057 ibm/htap/db/db2inst1/SHARED_DATA/EVENTDB/DB00000000/TS00000003/data/t0000000000/shared/s000015.0.0.t000000000000.parquet
    2019-06-17 05:15:50       1057 ibm/htap/db/db2inst1/SHARED_DATA/EVENTDB/DB00000000/TS00000003/data/t0000000000/shared/s000018.0.0.t000000000000.parquet
    2019-06-17 05:15:49       1057 ibm/htap/db/db2inst1/SHARED_DATA/EVENTDB/DB00000000/TS00000003/data/t0000000000/shared/s000022.0.0.t000000000000.parquet
    2019-06-17 05:15:50       1057 ibm/htap/db/db2inst1/SHARED_DATA/EVENTDB/DB00000000/TS00000003/data/t0000000000/shared/s000027.0.0.t000000000000.parquet
    2019-06-17 05:15:49       1057 ibm/htap/db/db2inst1/SHARED_DATA/EVENTDB/DB00000000/TS00000003/data/t0000000000/shared/s000031.0.0.t000000000000.parquet
    2019-06-17 05:15:51       1057 ibm/htap/db/db2inst1/SHARED_DATA/EVENTDB/DB00000000/TS00000003/data/t0000000000/shared/s000035.0.0.t000000000000.parquet
    

    现在所有这些文件都是标准 parquet 文件,可以使用标准 parquet 阅读器读取,例如 Apache parquet-tools (https://github.com/apache/parquet-mr/tree/master/parquet-tools)。调用 parquet 工具非常简单,例如,在使用标准 AWS S3 客户端从 COS 下载后,您可以查看上述文件之一的元数据:

    $ java -jar /usr/local/parquet-tools/parquet-tools-1.8.2-SNAPSHOT.jar meta s000035.0.0.t000000000000.parquet

    在文件的内容中,您将看到类似以下内容,其中显示了两个内部列(__beginTime 和 __prevRID),然后是用户创建的列 c1:

    file:        file:s000035.0.0.t000000000000.parquet
      creator:     Apache parquet-cpp
      extra:       LastShardLSN = 0:4764
      extra:       FirstBeginTime = 1541121073074
    
      file schema: T1
      --------------------------------------------------------------------------------
      __beginTime: REQUIRED INT64 O:INT_64 R:0 D:0
      __prevRID:   OPTIONAL INT64 O:INT_64 R:0 D:1
      c1:          REQUIRED INT64 R:0 D:0
    
      row group 1: RC:1000 TS:8149 OFFSET:4
      --------------------------------------------------------------------------------
      __beginTime:  INT64 SNAPPY DO:0 FPO:4 SZ:4062/8044/1.98 VC:1000 ENC:RLE,PLAIN
      __prevRID:    INT64 SNAPPY DO:0 FPO:4066 SZ:32/30/0.94 VC:1000 ENC:RLE,PLAIN
      c1:           INT64 SNAPPY DO:0 FPO:4098 SZ:4055/8044/1.98 VC:1000 ENC:RLE,PLAIN
    

    【讨论】: