【发布时间】:2019-03-25 23:39:38
【问题描述】:
我使用以下方法将 csv 文件读入 Spark:
df = spark.read.format(file_type).options(header='true', quote='\"', ignoreLeadingWhiteSpace='true',inferSchema='true').load(file_location)
当我尝试使用来自另一个来源的示例 csv 数据并执行 diplsay(df) 时,它显示了一个整齐显示的标题行,然后是数据。
当我在我的主要数据上尝试它时,它有 40 列和数百万行,它只显示前 20 个列标题而没有数据行。
这是正常行为还是读错了?
更新: 我会将问题标记为已回答,因为以下提示很有用。但是我这样做的结果:
df.show(5, truncate=False)
目前显示: +-------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------+ |��"periodID","DAXDate","Country Name","Year ","TransactionDate","QTR","客户编号","客户名称","客户城市","单据类型代码","订单号","产品代码","产品描述","销售计量单位" ,"Sub Franchise Code","Sub Franchise Description","Product Major Code","Product Major Description","Product Minor Code","Product Minor Description","Invoice Number","Invoice DateTime","Class Of Trade ID","Class Of Trade","Region","AmountCurrencyType","Extended Cost","Gross Trade Sales","Net Trade Sales","Total(Ext Std Cost)","AdjustmentType","ExcludeComment ", "CurrencyCode","fxRate","Quantity","FileName","RecordCount","Product Category","Direct","ProfitCenter","ProfitCenterRegion","ProfitCenterCountry"| +-------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- ------------------------------
我将不得不回到基础,在文本编辑器中预览 csv,以找出该文件的正确格式,从而找出问题所在。请注意,我必须将我的代码更新为以下内容以处理管道分隔符:
df = spark.read.format(file_type).options(header='true', quote='\"', delimiter='|',ignoreLeadingWhiteSpace='true',inferSchema='true').load(file_location)
【问题讨论】:
标签: apache-spark pyspark databricks