【问题标题】:spark converting CSV to libsvm formatspark将CSV转换为libsvm格式
【发布时间】:2017-04-14 22:20:33
【问题描述】:

我有一个 CSV 文件,其中包含状态、年龄、性别、薪水等作为自变量。

因变量流失。

在 spark 中,我们需要将 dataframe 转换为 libsvm 格式。你能告诉我怎么做吗?

libsvm 格式为:0 128:51

此处的特征值表示第 128 列中有值 51。

【问题讨论】:

  • 更详细地描述您的问题。我不同意你的想法。

标签: apache-spark


【解决方案1】:
/*
/Users/mac/matrix.txt
1 0.5 2.4 3.0
1 99 34 6454
2 0.8 3.0 4.5
*/
def concat(a:Array[String]):String ={
  var result=a(0)+" "
  for(i<-1 to a.size.toInt-1) 
  result=result+i+":"+a(i)(0)+" "
  return result
}
val rfile=sc.textFile("file:///Users/mac/matrix.txt")
val f=rfile.map(line => line.split(' ')).map(i=>concat(i))

我相信我有一个更简单的解决方案。

【讨论】:

    【解决方案2】:

    我用 hadoop 做同样的事情,但逻辑应该是一样的。我为您的用例创建了示例示例。首先,我创建数据框,然后删除所有具有空值或空白值的行。之后创建 RDD 并将 Row 转换为 libsvm 格式。 “repartition(1)”表示所有内容都将只放入一个文件中。将有一个结果列,例如。在 CTR 预测的情况下,它将仅为 1 或 0。

    示例文件输入:

    "zip","city","state","latitude","longitude","timezone","dst"
    "00210","Portsmouth","NH","43.005895","-71.013202","-5","1"
    "00211","Portsmouth","NH","43.005895","-71.013202","-5","1"
    "00212","Portsmouth","NH","43.005895","-71.013202","-5","1"
    "00213","Portsmouth","NH","43.005895","-71.013202","-5","1"
    "00214","Portsmouth","NH","43.005895","-71.013202","-5","1"
    "00215","Portsmouth","NH","43.005895","-71.013202","-5","1"
    "00501","Holtsville","NY","40.922326","-72.637078","-5","1"
    "00544","Holtsville","NY","40.922326","-72.637078","-5","1"
    
    public class LibSvmConvertJob {
    
        private static final String SPACE = " ";
        private static final String COLON = ":";
    
        public static void main(String[] args) {
    
            SparkConf sparkConf = new SparkConf().setMaster("local[2]").setAppName("Libsvm Convertor");
    
            JavaSparkContext javaSparkContext = new JavaSparkContext(sparkConf);
    
            SQLContext sqlContext = new SQLContext(javaSparkContext);
    
            DataFrame inputDF = sqlContext.read().format("com.databricks.spark.csv").option("header", "true")
                    .load("/home/raghunandangupta/inputfiles/zipcode.csv");
    
            inputDF.printSchema();
    
            sqlContext.udf().register("convertToNull", (String v1) -> (v1.trim().length() > 0 ? v1.trim() : null), DataTypes.StringType);
    
            inputDF = inputDF.selectExpr("convertToNull(zip)","convertToNull(city)","convertToNull(state)","convertToNull(latitude)","convertToNull(longitude)","convertToNull(timezone)","convertToNull(dst)").na().drop();
    
            inputDF.javaRDD().map(new Function<Row, String>() {
                private static final long serialVersionUID = 1L;
                @Override
                public String call(Row v1) throws Exception {
                    StringBuilder sb = new StringBuilder();
                    sb.append(hashCode(v1.getString(0))).append("\t")   //Resultant column
                    .append("1"+COLON+hashCode(v1.getString(1))).append(SPACE)
                    .append("2"+COLON+hashCode(v1.getString(2))).append(SPACE)
                    .append("3"+COLON+hashCode(v1.getString(3))).append(SPACE)
                    .append("4"+COLON+hashCode(v1.getString(4))).append(SPACE)
                    .append("5"+COLON+hashCode(v1.getString(5))).append(SPACE)
                    .append("6"+COLON+hashCode(v1.getString(6)));
                    return sb.toString();
                }
                private String hashCode(String value) {
                    return Math.abs(Hashing.murmur3_32().hashString(value, StandardCharsets.UTF_8).hashCode()) + "";
                }
            }).repartition(1).saveAsTextFile("/home/raghunandangupta/inputfiles/zipcode");
    
        }
    }
    

    【讨论】:

      猜你喜欢
      • 2017-10-10
      • 1970-01-01
      • 2017-03-19
      • 2017-06-08
      • 2014-12-18
      • 2012-02-13
      • 2016-12-18
      • 2015-03-26
      • 2016-04-19
      相关资源
      最近更新 更多