Cassandra - >500mb CSV 文件产生 ~50mb 大小的表？答案

【问题标题】：Cassandra - >500mb CSV file produces ~50mb size table?Cassandra - >500mb CSV 文件产生 ~50mb 大小的表？
【发布时间】：2015-12-16 23:17:11
【问题描述】：

我是 Cassandra 的新手，并试图弄清楚大小调整的工作原理。我创建了一个键空间和一个表。然后，我生成了一个脚本，在 java 中将 100 万行创建到一个 csv 文件中，并将其插入到我的数据库中。 CSV 文件的大小约为 545 mb。然后我将它加载到数据库中并运行 nodetool cfstats 命令并收到此输出。它说使用的总空间是 50555052 字节（~50 mb）。怎么会这样？有了索引、列等的开销，我的总数据怎么能比原始 CSV 数据更小（不仅更小，而且更小）？也许我没有正确阅读这里的内容，但这看起来对吗？我在单台机器上使用 Cassandra 2.2.1。

Table: users
        SSTable count: 1
        Space used (live): 50555052
        Space used (total): 50555052
        Space used by snapshots (total): 0
        Off heap memory used (total): 1481050
        SSTable Compression Ratio: 0.03029072054256705
        Number of keys (estimate): 984133
        Memtable cell count: 240336
        Memtable data size: 18385704
        Memtable off heap memory used: 0
        Memtable switch count: 19
        Local read count: 0
        Local read latency: NaN ms
        Local write count: 1000000
        Local write latency: 0.044 ms
        Pending flushes: 0
        Bloom filter false positives: 0
        Bloom filter false ratio: 0.00000
        Bloom filter space used: 1192632
        Bloom filter off heap memory used: 1192624
        Index summary off heap memory used: 203778
        Compression metadata off heap memory used: 84648
        Compacted partition minimum bytes: 643
        Compacted partition maximum bytes: 770
        Compacted partition mean bytes: 770
        Average live cells per slice (last five minutes): 0.0
        Maximum live cells per slice (last five minutes): 0
        Average tombstones per slice (last five minutes): 0.0
        Maximum tombstones per slice (last five minutes): 0

我生成 CSV 文件的 Java 代码如下所示：

try{

            FileWriter writer = new FileWriter(sFileName);
            for(int i=0;i<1000000;i++){


            writer.append("Username " + i);
            writer.append(',');
            writer.append(new Timestamp(date.getTime()).toString());
            writer.append(',');
            writer.append("myfakeemailaccnt@email.com");
            writer.append(',');
            writer.append(new Timestamp(date.getTime()).toString());
            writer.append(',');
            writer.append("eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiIxMjM0NTY3ODkwIiwibmFtZSI6IkpvaG4gRG9lIiwiYWRtaW4iOnRydWV9.TJVA95OrM7E2cBab30RMHrHDcEfxjoYZgeFONFh7HgQ");
            writer.append(',');
            writer.append("eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiIxMjM0NTY3ODkwIiwibmFtZSI6IkpvaG4gRG9lIiwiYWRtaW4iOnRydWV9.TJVA95OrM7E2cBab30RMHrHDcEfxjoYZgeFONFh7HgQ");
            writer.append(',');
            writer.append("eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiIxMjM0NTY3ODkwIiwibmFtZSI6IkpvaG4gRG9lIiwiYWRtaW4iOnRydWV9.TJVA95OrM7E2cBab30RMHrHDcEfxjoYZgeFONFh7HgQ");
            writer.append(',');
            writer.append("tr");
            writer.append('\n');

            }   
            writer.flush();
            writer.close();

        }
        catch(IOException e)
        {
             e.printStackTrace();
        }

【问题讨论】：

不确定您的数据是什么样的，但如果 CSV 中充满逗号和引号，您可能会看到一些节省
我也是 Cassandra 的新手，我刚刚复制了一个 ~14GB 的 csv，其中包含 50 个字段的 ~23M 记录。 Cassandra 告诉我它只有大约 158MB 的磁盘。等待它在我的节点上复制，然后我将尝试一些查询以确保它都在那里。

标签： cassandra

【解决方案1】：

所以我想到了最大的3条数据：

eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiIxMjM0NTY3ODkwIiwibmFtZSI6IkpvaG4gRG9lIiwiYWRtaW4iOnRydWV9.TJVA95OrM7E2cBab30RMHrHDcEfxjoYZgeFONFh7HgQ

并认为它们是相同的，也许 Cassandra 正在压缩它们，尽管它说这只是 3% 的比率。所以我更改了我的 Java 代码以生成不同的数据。

public class Main {

    private static final String ALPHA_NUMERIC_STRING = "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";

    public static void main(String[] args) {

        generateCassandraCSVData("users.csv");

    }

    public static String randomAlphaNumeric(int count) {
        StringBuilder builder = new StringBuilder();
        while (count-- != 0) {
        int character = (int)(Math.random()*ALPHA_NUMERIC_STRING.length());
        builder.append(ALPHA_NUMERIC_STRING.charAt(character));
        }
        return builder.toString();
        }


    public static void generateCassandraCSVData(String sFileName){

    java.util.Date date= new java.util.Date();


        try{

            FileWriter writer = new FileWriter(sFileName);
            for(int i=0;i<1000000;i++){



            writer.append("Username " + i);
            writer.append(',');
            writer.append(new Timestamp(date.getTime()).toString());
            writer.append(',');
            writer.append("myfakeemailaccnt@email.com");
            writer.append(',');
            writer.append(new Timestamp(date.getTime()).toString());
            writer.append(',');
            writer.append("" + randomAlphaNumeric(150) + "");
            writer.append(',');
            writer.append("" + randomAlphaNumeric(150) + "");
            writer.append(',');
            writer.append("" + randomAlphaNumeric(150) + "");
            writer.append(',');
            writer.append("tr");
            writer.append('\n');


            //generate whatever data you want
            }   
            writer.flush();
            writer.close();

        }
        catch(IOException e)
        {
             e.printStackTrace();
        } 

    }

}

所以现在这 3 个大列的数据都是随机字符串，不再相同了。这是现在生产的：

Table: users
        SSTable count: 4
        Space used (live): 554671040
        Space used (total): 554671040
        Space used by snapshots (total): 0
        Off heap memory used (total): 1886175
        SSTable Compression Ratio: 0.6615549506522498
        Number of keys (estimate): 1019477
        Memtable cell count: 270024
        Memtable data size: 20758095
        Memtable off heap memory used: 0
        Memtable switch count: 25
        Local read count: 0
        Local read latency: NaN ms
        Local write count: 1323546
        Local write latency: 0.048 ms
        Pending flushes: 0
        Bloom filter false positives: 0
        Bloom filter false ratio: 0.00000
        Bloom filter space used: 1533512
        Bloom filter off heap memory used: 1533480
        Index summary off heap memory used: 257175
        Compression metadata off heap memory used: 95520
        Compacted partition minimum bytes: 311
        Compacted partition maximum bytes: 770
        Compacted partition mean bytes: 686
        Average live cells per slice (last five minutes): 0.0
        Maximum live cells per slice (last five minutes): 0
        Average tombstones per slice (last five minutes): 0.0
        Maximum tombstones per slice (last five minutes): 0

所以现在 CSV 文件又是 ~550mb，我的表现在也是 ~550mb。那么，如果非关键列数据相同（基本较低），Cassandra 是否以某种方式非常有效地压缩了这些数据？如果是这种情况，这是一个非常重要的概念（我以前从未读过这个概念），以便了解何时对数据库进行建模，因为如果您牢记这一点，您可以节省大量存储空间。

【讨论】：