【发布时间】:2014-11-03 18:20:12
【问题描述】:
我想使用 java-ml 训练我的数据来分类一些文档,我在做什么:
我有两个类别,每个类别都有 11000 个文档。我总共有 92199 个功能是为information gain - chi square - mutual information - gini 提供的,我使用其中的 20000 个 2 火车,
所以我有 22000 个文档和 20000 个特征来训练数据,我发现每个文档与特征的交集,所以我有:
每个文档和特征的交集
不同:出现在特征中但不在文档中的数据
所以我在一个文档 2 火车中发送与他们的 tf_idf 和 th_idf = 0 的交叉点
我是怎么做到的:
public void buildDataset() {
DBDocMeta dbDocMeta; // the table that contains documents
dataset = new DefaultDataset();
neighbors.add(new Neighbor<Integer>("cat1")); // each neighbor contains a Document List
neighbors.add(new Neighbor<Integer>("cat2"));// neighbor integer: document{index,tf_idf} neighbor string : {word,tf_idf}
try {
dbDocMeta = new DBDocMeta();
Map<Long, String> docInfo = dbDocMeta.getDocInfo();
int count = 1;
id:
for (Long id : docInfo.keySet()) {
count++;
String cat = docInfo.get(id);
System.out.println("***********************************************");
System.out.println("Available processors (cores): " + Runtime.getRuntime().availableProcessors());
Long freeMemory = Runtime.getRuntime().freeMemory();
System.out.println("Free memory (bytes): " + freeMemory);
if (freeMemory <= 500000000) {
System.out.println("memory problem occurred !!!");
net.sf.javaml.tools.data.FileHandler.exportDataset(dataset, new File("dataset.data"));
break id;
}
long maxMemory = Runtime.getRuntime().maxMemory();
System.out.println("Maximum memory (bytes): " + (maxMemory == Long.MAX_VALUE ? "no limit" : maxMemory));
System.out.println("Total memory available to JVM (bytes): " + Runtime.getRuntime().totalMemory());
System.out.println("category : " + cat);
System.out.println("***********************************************");
Document<String> doc1 = dbWeight.getNeighbors(id);
Instance instance = new SparseInstance();
instance.setClassValue(cat);
if (!doc1.getAttributes().isEmpty()) {
neighbors:
for (Neighbor<Integer> neighbor : neighbors) {
if (!neighbor.getCategory().equalsIgnoreCase(cat)) {
continue neighbors;
}
Set<String> intersectionWords = intersection(features, doc1.getAttributes().keySet());
if (intersectionWords.isEmpty()) {
continue id;
}
HashSet<String> different = new HashSet<String>(features);
for (String word : intersectionWords) {
instance.put(dbWeight.getIndex(word), doc1.getAttributes().get(word));
different.remove(word);
}
for (String word : different) {
instance.put(dbWeight.getIndex(word), 0.0);
}
dataset.add(instance);
break neighbors;
}
}
}
} catch (InterruptedException e) {
e.printStackTrace();
} catch (ClassNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
try {
net.sf.javaml.tools.data.FileHandler.exportDataset(dataset, new File("save.data"));
System.out.println("dataset has exported successfully");
} catch (Exception e) {
System.out.println("failed to export dataset");
e.printStackTrace();
}
}
private static <A> Set<A> intersection(final Set<A> xs, final Set<A> ys) {
// make sure that xs is the smaller set
if (ys.size() < xs.size()) {
return intersection(ys, xs);
}
final HashSet<A> result = new HashSet<A>();
for (A x : xs) {
if (ys.contains(x)) {
result.add(x);
}
}
return result;
}
这是制作数据集的真正方法吗?
【问题讨论】:
标签: java classification svm libsvm