需要有关 Mahout 集群的建议答案

【问题标题】：Need suggestion regarding clustering in Mahout需要有关 Mahout 集群的建议
【发布时间】：2013-12-23 18:19:29
【问题描述】：

我按照“Mahout in Action”一书中的 Reuters Data set Clutering 示例进行了测试，并成功进行了测试。为了进一步了解聚类，我尝试了相同的序列来聚类一些推文数据。

我使用的命令顺序如下：

mahout seqdirectory -c UTF-8 -i hdfs://-----:8020/user/hdfs/tweet/tweet.txt -o hdfs://-----:8020/user/hdfs/tweet/seqfiles

mahout seq2sparse -i hdfs://-----:8020/user/hdfs/tweet/seqfiles -o hdfs://----:8020/user/hdfs/tweet/vectors/ -ow -chunk 100 -x 90 -seq -ml 50 -n 2 -nv

mahout kmeans -i hdfs://---:8020/user/hdfs/tweet/vectors/tfidf-vectors/ -c kmeans-centroids -cl -o hdfs://-----:8020/user/hdfs/tweet/kmeans-clusters -k 3 -ow -x 3 -dm org.apache.mahout.common.distance.CosineDistanceMeasure

mahout clusterdump -i hdfs://----:8020/user/hdfs/tweet/kmeans-clusters/clusters-3-final -d hdfs://----:8020/user/hdfs/tweet/vectors/dictionary.file-0 -dt sequencefile -b 100 -n 10 --evaluate -dm org.apache.mahout.common.distance.CosineDistanceMeasure --pointsDir hdfs://-----:8020/user/hdfs/tweet/kmeans-clusters/clusteredPoints -o tweet_outdump.txt

tweet_outdump.txt 文件包含以下数据：

CL-0{n=1 c=[] r=[]}
Top Terms: 
Weight : [props - optional]: Point:
1.0: /tweet.txt =]
Inter-Cluster Density: NaN
Intra-Cluster Density: 0.0
CDbw Inter-Cluster Density: 0.0
CDbw Intra-Cluster Density: NaN
CDbw Separation: 0.0

即使我试过了，这个命令：

mahout seqdumper -i hdfs://----:8020/user/hdfs/tweet/kmeans-clusters/clusteredPoints/part-m-00000

Key: 0: Value: 1.0: /tweet.txt =]
Count: 1

我真的很感激这里的一些反馈。提前致谢

【问题讨论】：

标签： cluster-analysis data-mining mahout

【解决方案1】：

您创建了一个由仅单个文档组成的数据集。

显然，聚类结果没有意义。没有“簇间距离”（因为只有一个簇）。而簇内距离为0，因为只有一个对象，它与自身的距离为0。

所以你已经在seqdirectory 命令中失败了——你传递的是一个文件，而不是一个每个文档一个文件的目录......

【讨论】：

【解决方案2】：

就您的情况而言，您的数据集似乎仅包含一个大文件，文件的每一行代表（例如文档或文件）。因此，在这种情况下， Seqdirectory 命令将生成一个仅包含一个的顺序文件，正如我从您的帖子中理解的那样，这是不合适的。因此，您应该首先编写一个简单的 MapReduce 代码，该代码获取您的数据集并为数据的每一行分配一个 id。在这里，您可以将行偏移量用作 Id（键），值是行本身。此外，您必须将输出格式指定为 Sequential。另一件事是，您的输出键必须是 Text，而您的值是包装在 Text 对象中的 UTF-8 编码字符串。这是一个简单的 MapReduce 代码：

public class TexToHadoopSeq {

    // Class Map1
    public static class mapper extends MapReduceBase implements
            Mapper<LongWritable, Text, Text, Text> {

        Text cle = new Text();
        Text valeur = new Text();

        @Override
        public void map(LongWritable key, Text values,
                OutputCollector<Text, Text> output, Reporter reporter)
                throws IOException {

            String record = values.toString();

            byte[] b = record.getBytes("UTF-8");

            valeur.set(b);
            cle.set(key.toString());
            output.collect(cle, valeur);

        }
    }

    // Class Reducer
    public static class Reduce1 extends MapReduceBase implements
            Reducer<Text, Text, Text, Text> {

        @Override
        public void reduce(Text key, Iterator<Text> values,
                OutputCollector<Text, Text> output, Reporter reporter)
                throws IOException {
            while (values.hasNext()) {

                output.collect(key, values.next());

            }

        }

    }

    public static void main(String[] args) throws IOException {

        String inputdata = args[0];


        System.out.println();
        System.out.println();

        // Start Job1
        JobClient client1 = new JobClient();
        JobConf conf1 = new JobConf(TexToHadoopSeq.class);

        FileInputFormat.setInputPaths(conf1, new Path(inputdata));// database
        FileOutputFormat.setOutputPath(conf1, new Path("output"));// Sortie Job1

        conf1.setJarByClass(TexToHadoopSeq.class);
        conf1.setMapperClass(mapper.class);
        conf1.setReducerClass(Reduce1.class);

        conf1.setNumReduceTasks(1);

        conf1.setMapOutputKeyClass(Text.class);
        conf1.setMapOutputValueClass(Text.class);

        conf1.setOutputKeyClass(Text.class);
        conf1.setOutputValueClass(Text.class);
        conf1.setInputFormat(TextInputFormat.class);
        conf1.setOutputFormat(SequenceFileOutputFormat.class);
        client1.setConf(conf1);
        RunningJob Job;
        Job = JobClient.runJob(conf1);
        Job.waitForCompletion();

        System.out.println();
        System.out.println();
        System.out.print("*****Conversion is Done*****");

    }

}

现在，下一步是从您的序列文件（由上述代码生成）创建向量，因此使用：./mahout seq2sparse -i "Directory of your sequential file in HDFS" -o "output" --maxDFPercent 85 --namedVector

然后，您将获得 TFIDF 目录……然后继续执行 Kmeans 或任何 mahout 聚类算法。而已。

【讨论】：