【发布时间】:2019-03-02 16:22:29
【问题描述】:
首先,我通过以下方式加载数据:
import urllib.request
f = urllib.request.urlretrieve("https://www.dropbox.com/s/qz62t2oyllkl32s/kddcup.data_10_percent.gz?dl=1", "kddcup.data_10_percent.gz")
data_file = "./kddcup.data_10_percent.gz"
raw_data = sc.textFile(data_file)
然后,我通过以下方式创建了所需数据的列表:
import numpy as np
import pandas as pd
def parse_interaction(line):
line_split = line.split(",")
# keep just numeric and logical values
symbolic_indexes = [1,2,3,41] # in the above sample would be: tcp,http,SF,normal
clean_line_split = [item for i,item in enumerate(line_split) if i not in symbolic_indexes]
return np.array([x for x in clean_line_split], dtype=float)
vector_data = raw_data.map(parse_interaction)
现在,我可以看到vector_data.take(2)的数据了:
[array([0.00e+00, 1.81e+02, 5.45e+03, 0.00e+00, 0.00e+00, 0.00e+00,
0.00e+00, 0.00e+00, 1.00e+00, 0.00e+00, 0.00e+00, 0.00e+00,
0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00,
0.00e+00, 8.00e+00, 8.00e+00, 0.00e+00, 0.00e+00, 0.00e+00,
0.00e+00, 1.00e+00, 0.00e+00, 0.00e+00, 9.00e+00, 9.00e+00,
1.00e+00, 0.00e+00, 1.10e-01, 0.00e+00, 0.00e+00, 0.00e+00,
0.00e+00, 0.00e+00]),
array([0.00e+00, 2.39e+02, 4.86e+02, 0.00e+00, 0.00e+00, 0.00e+00,
0.00e+00, 0.00e+00, 1.00e+00, 0.00e+00, 0.00e+00, 0.00e+00,
0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00,
0.00e+00, 8.00e+00, 8.00e+00, 0.00e+00, 0.00e+00, 0.00e+00,
0.00e+00, 1.00e+00, 0.00e+00, 0.00e+00, 1.90e+01, 1.90e+01,
1.00e+00, 0.00e+00, 5.00e-02, 0.00e+00, 0.00e+00, 0.00e+00,
0.00e+00, 0.00e+00])]
我想用vector_data = pd.DataFrame(vector_data)将它转换成DataFrame,但是命令不起作用,我收到错误,如下:
ValueError Traceback (most recent call last)
<ipython-input-112-6a2dcc5bdb85> in <module>()
10
11 vector_data = raw_data.map(parse_interaction)
---> 12 vector_data = pd.DataFrame(vector_data)
~/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py in __init__(self, data, index, columns, dtype, copy)
420 dtype=values.dtype, copy=False)
421 else:
--> 422 raise ValueError('DataFrame constructor not properly called!')
423
424 NDFrame.__init__(self, mgr, fastpath=True)
ValueError: DataFrame constructor not properly called!
我知道输入向量是特殊格式,我需要在 DataFrame 命令中添加一些内容才能正常工作。请指导我如何制作一个DataFrame。
【问题讨论】:
-
什么是
raw_data.map? -
@roganjosh,我刚刚添加了所有代码
-
pd.DataFrame({'vector1': vector_data[0], 'vector2': vector_data[1]}) -
你想要哪一个,2列还是38列的dataframe?
-
@bakka 有 38 列。
标签: python pandas pyspark jupyter-notebook