【问题标题】:Save pandas data frame as csv on to gcloud storage bucket将 pandas 数据帧作为 csv 保存到 gcloud 存储桶
【发布时间】:2017-08-03 21:51:51
【问题描述】:
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
import gc
import pandas as pd
import datetime
import numpy as np
import sys



APP_NAME = "DataFrameToCSV"

spark = SparkSession\
    .builder\
    .appName(APP_NAME)\
    .config("spark.sql.crossJoin.enabled","true")\
    .getOrCreate()

group_ids = [1,1,1,1,1,1,1,2,2,2,2,2,2,2]

dates = ["2016-04-01","2016-04-01","2016-04-01","2016-04-20","2016-04-20","2016-04-28","2016-04-28","2016-04-05","2016-04-05","2016-04-05","2016-04-05","2016-04-20","2016-04-20","2016-04-29"]

#event = [0,1,0,0,0,0,1,1,0,0,0,0,1,0]
event = [0,1,1,0,1,0,1,0,0,1,0,0,0,0]

dataFrameArr = np.column_stack((group_ids,dates,event))

df = pd.DataFrame(dataFrameArr,columns = ["group_ids","dates","event"])

上面的 python 代码将在 gcloud dataproc 上的 spark 集群上运行。我想将熊猫数据框保存为 gs://mybucket/csv_data/ 的 gcloud 存储桶中的 csv 文件

我该怎么做?

【问题讨论】:

    标签: python gcloud google-cloud-dataproc


    【解决方案1】:

    您也可以将此解决方案与 Dask 一起使用。您可以将您的 DataFrame 转换为 Dask DataFrame,可以将其写入 Cloud Storage 上的 csv

    import dask.dataframe as dd
    import pandas
    df # your Pandas DataFrame
    ddf = dd.from_pandas(df,npartitions=1, sort=True)
    ddf.to_csv('gs://YOUR_BUCKET/ddf-*.csv', index=False, sep=',', header=False,  
                                   storage_options={'token': gcs.session.credentials}) 
    

    storage_options 参数是可选的

    【讨论】:

    • 你在最后一行有错字。应该是ddf.to_csv
    【解决方案2】:

    所以,我想出了如何做到这一点。从上面的代码继续,这里是解决方案:

    sc = SparkContext.getOrCreate()  
    
    from pyspark.sql import SQLContext
    sqlCtx = SQLContext(sc)
    sparkDf = sqlCtx.createDataFrame(df)    
    sparkDf.coalesce(1).write.option("header","true").csv('gs://mybucket/csv_data')
    

    【讨论】:

      猜你喜欢
      • 2019-11-07
      • 2017-09-29
      • 2021-07-23
      • 2020-07-29
      • 2020-01-09
      • 2021-03-20
      • 2016-06-18
      • 2021-01-02
      • 1970-01-01
      相关资源
      最近更新 更多