【问题标题】:Cassandra - >500mb CSV file produces ~50mb size table?Cassandra - >500mb CSV 文件产生 ~50mb 大小的表?
【发布时间】:2015-12-16 23:17:11
【问题描述】:

我是 Cassandra 的新手,并试图弄清楚大小调整的工作原理。我创建了一个键空间和一个表。然后,我生成了一个脚本,在 java 中将 100 万行创建到一个 csv 文件中,并将其插入到我的数据库中。 CSV 文件的大小约为 545 mb。然后我将它加载到数据库中并运行 nodetool cfstats 命令并收到此输出。它说使用的总空间是 50555052 字节(~50 mb)。怎么会这样?有了索引、列等的开销,我的总数据怎么能比原始 CSV 数据更小(不仅更小,而且更小)?也许我没有正确阅读这里的内容,但这看起来对吗?我在单台机器上使用 Cassandra 2.2.1。

Table: users
        SSTable count: 1
        Space used (live): 50555052
        Space used (total): 50555052
        Space used by snapshots (total): 0
        Off heap memory used (total): 1481050
        SSTable Compression Ratio: 0.03029072054256705
        Number of keys (estimate): 984133
        Memtable cell count: 240336
        Memtable data size: 18385704
        Memtable off heap memory used: 0
        Memtable switch count: 19
        Local read count: 0
        Local read latency: NaN ms
        Local write count: 1000000
        Local write latency: 0.044 ms
        Pending flushes: 0
        Bloom filter false positives: 0
        Bloom filter false ratio: 0.00000
        Bloom filter space used: 1192632
        Bloom filter off heap memory used: 1192624
        Index summary off heap memory used: 203778
        Compression metadata off heap memory used: 84648
        Compacted partition minimum bytes: 643
        Compacted partition maximum bytes: 770
        Compacted partition mean bytes: 770
        Average live cells per slice (last five minutes): 0.0
        Maximum live cells per slice (last five minutes): 0
        Average tombstones per slice (last five minutes): 0.0
        Maximum tombstones per slice (last five minutes): 0

我生成 CSV 文件的 Java 代码如下所示:

try{

            FileWriter writer = new FileWriter(sFileName);
            for(int i=0;i<1000000;i++){


            writer.append("Username " + i);
            writer.append(',');
            writer.append(new Timestamp(date.getTime()).toString());
            writer.append(',');
            writer.append("myfakeemailaccnt@email.com");
            writer.append(',');
            writer.append(new Timestamp(date.getTime()).toString());
            writer.append(',');
            writer.append("eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiIxMjM0NTY3ODkwIiwibmFtZSI6IkpvaG4gRG9lIiwiYWRtaW4iOnRydWV9.TJVA95OrM7E2cBab30RMHrHDcEfxjoYZgeFONFh7HgQ");
            writer.append(',');
            writer.append("eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiIxMjM0NTY3ODkwIiwibmFtZSI6IkpvaG4gRG9lIiwiYWRtaW4iOnRydWV9.TJVA95OrM7E2cBab30RMHrHDcEfxjoYZgeFONFh7HgQ");
            writer.append(',');
            writer.append("eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiIxMjM0NTY3ODkwIiwibmFtZSI6IkpvaG4gRG9lIiwiYWRtaW4iOnRydWV9.TJVA95OrM7E2cBab30RMHrHDcEfxjoYZgeFONFh7HgQ");
            writer.append(',');
            writer.append("tr");
            writer.append('\n');

            }   
            writer.flush();
            writer.close();

        }
        catch(IOException e)
        {
             e.printStackTrace();
        } 

【问题讨论】:

  • 不确定您的数据是什么样的,但如果 CSV 中充满逗号和引号,您可能会看到一些节省
  • 我也是 Cassandra 的新手,我刚刚复制了一个 ~14GB 的 csv,其中包含 50 个字段的 ~23M 记录。 Cassandra 告诉我它只有大约 158MB 的磁盘。等待它在我的节点上复制,然后我将尝试一些查询以确保它都在那里。

标签: cassandra


【解决方案1】:

所以我想到了最大的3条数据:

eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiIxMjM0NTY3ODkwIiwibmFtZSI6IkpvaG4gRG9lIiwiYWRtaW4iOnRydWV9.TJVA95OrM7E2cBab30RMHrHDcEfxjoYZgeFONFh7HgQ

并认为它们是相同的,也许 Cassandra 正在压缩它们,尽管它说这只是 3% 的比率。所以我更改了我的 Java 代码以生成不同的数据。

public class Main {

    private static final String ALPHA_NUMERIC_STRING = "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";

    public static void main(String[] args) {

        generateCassandraCSVData("users.csv");

    }

    public static String randomAlphaNumeric(int count) {
        StringBuilder builder = new StringBuilder();
        while (count-- != 0) {
        int character = (int)(Math.random()*ALPHA_NUMERIC_STRING.length());
        builder.append(ALPHA_NUMERIC_STRING.charAt(character));
        }
        return builder.toString();
        }


    public static void generateCassandraCSVData(String sFileName){

    java.util.Date date= new java.util.Date();


        try{

            FileWriter writer = new FileWriter(sFileName);
            for(int i=0;i<1000000;i++){



            writer.append("Username " + i);
            writer.append(',');
            writer.append(new Timestamp(date.getTime()).toString());
            writer.append(',');
            writer.append("myfakeemailaccnt@email.com");
            writer.append(',');
            writer.append(new Timestamp(date.getTime()).toString());
            writer.append(',');
            writer.append("" + randomAlphaNumeric(150) + "");
            writer.append(',');
            writer.append("" + randomAlphaNumeric(150) + "");
            writer.append(',');
            writer.append("" + randomAlphaNumeric(150) + "");
            writer.append(',');
            writer.append("tr");
            writer.append('\n');


            //generate whatever data you want
            }   
            writer.flush();
            writer.close();

        }
        catch(IOException e)
        {
             e.printStackTrace();
        } 

    }

}

所以现在这 3 个大列的数据都是随机字符串,不再相同了。这是现在生产的:

Table: users
        SSTable count: 4
        Space used (live): 554671040
        Space used (total): 554671040
        Space used by snapshots (total): 0
        Off heap memory used (total): 1886175
        SSTable Compression Ratio: 0.6615549506522498
        Number of keys (estimate): 1019477
        Memtable cell count: 270024
        Memtable data size: 20758095
        Memtable off heap memory used: 0
        Memtable switch count: 25
        Local read count: 0
        Local read latency: NaN ms
        Local write count: 1323546
        Local write latency: 0.048 ms
        Pending flushes: 0
        Bloom filter false positives: 0
        Bloom filter false ratio: 0.00000
        Bloom filter space used: 1533512
        Bloom filter off heap memory used: 1533480
        Index summary off heap memory used: 257175
        Compression metadata off heap memory used: 95520
        Compacted partition minimum bytes: 311
        Compacted partition maximum bytes: 770
        Compacted partition mean bytes: 686
        Average live cells per slice (last five minutes): 0.0
        Maximum live cells per slice (last five minutes): 0
        Average tombstones per slice (last five minutes): 0.0
        Maximum tombstones per slice (last five minutes): 0

所以现在 CSV 文件又是 ~550mb,我的表现在也是 ~550mb。那么,如果非关键列数据相同(基本较低),Cassandra 是否以某种方式非常有效地压缩了这些数据?如果是这种情况,这是一个非常重要的概念(我以前从未读过这个概念),以便了解何时对数据库进行建模,因为如果您牢记这一点,您可以节省大量存储空间。

【讨论】:

    猜你喜欢
    • 2020-06-05
    • 2011-07-06
    • 2011-07-02
    • 2014-09-22
    • 1970-01-01
    • 2017-03-10
    • 1970-01-01
    • 2021-08-25
    • 2019-10-04
    相关资源
    最近更新 更多