由于数据块模块似乎没有提供跳线选项,我可以想到以下几个选项:
选项一:在第一行前面添加一个“#”字符,该行将自动被视为注释并被data.bricks csv模块忽略;
选项二:创建您的自定义架构并将mode 选项指定为DROPMALFORMED,这将删除第一行,因为它包含的令牌少于 customSchema 中的预期:
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType};
val customSchema = StructType(Array(StructField("id", IntegerType, true),
StructField("name", StringType, true),
StructField("age", IntegerType, true)))
val df = sqlContext.read.format("com.databricks.spark.csv").
option("header", "true").
option("mode", "DROPMALFORMED").
schema(customSchema).load("test.txt")
df.show
16/06/12 21:24:05 WARN CsvRelation$:数字格式异常。掉落
格式错误的行:id,name,age
+---+----+---+
| id|name|age|
+---+----+---+
| 1| abc| 12|
| 2| bcd| 33|
+---+----+---+
注意这里的警告消息,上面写着格式错误的行:
选项三:编写您自己的解析器以删除长度不为三的行:
val file = sc.textFile("pathToYourCsvFile")
val df = file.map(line => line.split(",")).
filter(lines => lines.length == 3 && lines(0)!= "id").
map(row => (row(0), row(1), row(2))).
toDF("id", "name", "age")
df.show
+---+----+---+
| id|name|age|
+---+----+---+
| 1| abc| 12|
| 2| bcd| 33|
+---+----+---+