【发布时间】:2017-08-03 21:51:51
【问题描述】:
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
import gc
import pandas as pd
import datetime
import numpy as np
import sys
APP_NAME = "DataFrameToCSV"
spark = SparkSession\
.builder\
.appName(APP_NAME)\
.config("spark.sql.crossJoin.enabled","true")\
.getOrCreate()
group_ids = [1,1,1,1,1,1,1,2,2,2,2,2,2,2]
dates = ["2016-04-01","2016-04-01","2016-04-01","2016-04-20","2016-04-20","2016-04-28","2016-04-28","2016-04-05","2016-04-05","2016-04-05","2016-04-05","2016-04-20","2016-04-20","2016-04-29"]
#event = [0,1,0,0,0,0,1,1,0,0,0,0,1,0]
event = [0,1,1,0,1,0,1,0,0,1,0,0,0,0]
dataFrameArr = np.column_stack((group_ids,dates,event))
df = pd.DataFrame(dataFrameArr,columns = ["group_ids","dates","event"])
上面的 python 代码将在 gcloud dataproc 上的 spark 集群上运行。我想将熊猫数据框保存为 gs://mybucket/csv_data/ 的 gcloud 存储桶中的 csv 文件
我该怎么做?
【问题讨论】:
标签: python gcloud google-cloud-dataproc