【问题标题】:How to train Data for SVM?如何为 SVM 训练数据?
【发布时间】:2014-11-03 18:20:12
【问题描述】:

我想使用 java-ml 训练我的数据来分类一些文档,我在做什么:

我有两个类别,每个类别都有 11000 个文档。我总共有 92199 个功能是为information gain - chi square - mutual information - gini 提供的,我使用其中的 20000 个 2 火车,

所以我有 22000 个文档和 20000 个特征来训练数据,我发现每个文档与特征的交集,所以我有:

每个文档和特征的交集

不同:出现在特征中但不在文档中的数据

所以我在一个文档 2 火车中发送与他们的 tf_idfth_idf = 0 的交叉点

我是怎么做到的:

public void buildDataset() {
    DBDocMeta dbDocMeta; // the table that contains documents
    dataset = new DefaultDataset();
    neighbors.add(new Neighbor<Integer>("cat1")); // each neighbor contains a Document List 
    neighbors.add(new Neighbor<Integer>("cat2"));// neighbor integer: document{index,tf_idf} neighbor string : {word,tf_idf}
    try {
        dbDocMeta = new DBDocMeta();
        Map<Long, String> docInfo = dbDocMeta.getDocInfo();
        int count = 1;
        id:
        for (Long id : docInfo.keySet()) {
            count++;
            String cat = docInfo.get(id);
            System.out.println("***********************************************");
            System.out.println("Available processors (cores): " + Runtime.getRuntime().availableProcessors());
            Long freeMemory = Runtime.getRuntime().freeMemory();
            System.out.println("Free memory (bytes): " + freeMemory);
            if (freeMemory <= 500000000) {
                System.out.println("memory problem occurred !!!");
                net.sf.javaml.tools.data.FileHandler.exportDataset(dataset, new File("dataset.data"));
                break id;
            }
            long maxMemory = Runtime.getRuntime().maxMemory();
            System.out.println("Maximum memory (bytes): " + (maxMemory == Long.MAX_VALUE ? "no limit" : maxMemory));
            System.out.println("Total memory available to JVM (bytes): " + Runtime.getRuntime().totalMemory());
            System.out.println("category : " + cat);
            System.out.println("***********************************************");
            Document<String> doc1 = dbWeight.getNeighbors(id);

            Instance instance = new SparseInstance();
            instance.setClassValue(cat);
            if (!doc1.getAttributes().isEmpty()) {

                neighbors:
                for (Neighbor<Integer> neighbor : neighbors) {
                    if (!neighbor.getCategory().equalsIgnoreCase(cat)) {

                        continue neighbors;
                    }

                    Set<String> intersectionWords = intersection(features, doc1.getAttributes().keySet());
                    if (intersectionWords.isEmpty()) {
                        continue id;
                    }
                    HashSet<String> different = new HashSet<String>(features);
                    for (String word : intersectionWords) {
                        instance.put(dbWeight.getIndex(word), doc1.getAttributes().get(word));
                        different.remove(word);
                    }
                    for (String word : different) {
                        instance.put(dbWeight.getIndex(word), 0.0);
                    }
                    dataset.add(instance);

                    break neighbors;
                }
            }
        }
    } catch (InterruptedException e) {
        e.printStackTrace();
    } catch (ClassNotFoundException e) {
        e.printStackTrace();
    } catch (IOException e) {
        e.printStackTrace();
    }
    try {
        net.sf.javaml.tools.data.FileHandler.exportDataset(dataset, new File("save.data"));
        System.out.println("dataset has exported successfully");
    } catch (Exception e) {
        System.out.println("failed to export dataset");
        e.printStackTrace();
    }

}



private static <A> Set<A> intersection(final Set<A> xs, final Set<A> ys) {
    // make sure that xs is the smaller set
    if (ys.size() < xs.size()) {
        return intersection(ys, xs);
    }

    final HashSet<A> result = new HashSet<A>();
    for (A x : xs) {
        if (ys.contains(x)) {
            result.add(x);
        }
    }

    return result;
}

这是制作数据集的真正方法吗?

【问题讨论】:

    标签: java classification svm libsvm


    【解决方案1】:

    我的尝试

    public static void main(String...  arg){ 
    
     bagOfWords = prepareBOW(dataSet); // Provide dataset 
    
      prepareSentimentalSentencesList(negData, "-1 ");
    
       prepareSentimentalSentencesList(posData, "+1 ");
    
    }
    
    
    public List<String> prepareBOW(List<String> dataSet) {
    
        bagOfWords = new ArrayList<String>();
    
        // iterating each and every set of data/sentence.
        for (String s : dataSet) {
    
            String[] words = s.split(" ");
            bagOfWords.add("*&^(0");
    
    
            // adding each word of sentence/data in list.
            for (int i = 0; i < words.length; i++) {
                words[i] = words[i].replaceAll(",", "");
                words[i] = words[i].replaceAll(" ", "");
                words[i] = words[i].replaceAll("\\.", "");
                words[i] = words[i].toLowerCase();
                bagOfWords.add(words[i]);
    
            }
    
        }
        bagOfWords.remove("");
        bagOfWords = new ArrayList<String>(new LinkedHashSet<String>(bagOfWords));// Removing duplicates.
    
        return bagOfWords;
    
    }
    
    public void prepareSentimentalSentencesList(List<String> dataSet, String label) {
            List<String> list = new ArrayList<String>();
            for (String data : dataSet) {
    
            String wordsIndex = label;
            for (String word : data.split(" ")) {
                word = word.replaceAll(",", "");
                word = word.replaceAll(" ", "");
                word = word.replaceAll("\\.", "");
                word = word.toLowerCase();
                int index = getIndex(word);
                if (index != -1) {
                    wordsIndex += (index) + ":1 ";
                }
    
    
            }
            list.add(wordsIndex);
        }
    
        for (String s : list) {
              System.out.println(s);
        }
    }
    

    【讨论】:

    • 你为什么不使用instanse.put
    • 它的研究目的..但是你能告诉我我应该把 instanse.put 放在哪里,而不是 arrayList 吗??
    • prepareSentimentalSentencesList 上你可以使用instanse.put(key,value) 其中key 是索引,value 是权重`
    猜你喜欢
    • 2015-11-17
    • 2015-09-26
    • 2016-08-09
    • 2015-08-14
    • 2016-06-09
    • 2019-07-12
    • 2015-06-24
    • 2014-05-18
    • 2018-05-17
    相关资源
    最近更新 更多