【发布时间】:2018-10-01 10:11:21
【问题描述】:
使用 HBase 和 Parquet,我编写了代码以从 HBase 获取值并将值映射到 Object 类,但我无法使用 Dataset 将其复制到 Parquet。
HBase:
JavaPairRDD<ImmutableBytesWritable, Result> data = sc.newAPIHadoopRDD(getHbaseConf(),
TableInputFormat.class, ImmutableBytesWritable.class, Result.class);
JavaRDD<List<Tuple3<Long, Integer, Double>>> tempData = data
.values()
//Uses HBaseResultToSimple... class to parse the data.
.map(value -> {
SimpleObject object = oParser.call(value);
// Get the sample property, remove leading and ending spaces and split it by comma
// to get each sample individually
List<Tuple2<String, Integer>> samples = zipWithIndex((object.getSamples().trim().split(",")));
// Gets the unique identifier for that sp.
Long sp = object.getPos();
// Calculates the hamming distance for this sp for each sample.
// i.e. 0|0 => 0, 0|1 => 1, 1|0 => 1, 1|1 => 2
return samples.stream().map(t -> {
String alleles = t._1();
Integer patient = t._2();
List<String> values = Arrays.asList(alleles.split("\\|"));
Double firstA = Double.parseDouble(values.get(0));
Double second = Double.parseDouble(values.get(1));
// Returns the initial sp id, p id and the distance in form of Tuple.
return new Tuple3<>(snp, patient, firstAllele + secondAllele);
}).collect(Collectors.toList());
});
我将 Parquet 中的数据读取到数据集中,但简单无法复制上述方法。
Dataset<Row> url = session.read().parquet(fileName);
我只需要知道如何将Dataset<Row> 中的行映射到对象类,就像我在上述方法中对.map(value -> {... 所做的那样。
任何帮助将不胜感激。
【问题讨论】:
标签: java apache-spark java-8 functional-programming parquet