【发布时间】:2016-12-10 00:12:24
【问题描述】:
我有大小超过 10 mb 的大型 csv 文件和大约 50 多个这样的文件。这些输入有超过 25 列和超过 50K 行。
所有这些都有相同的标题,我正在尝试将它们合并到一个 csv 中,并且标题只被提及一次。
选项:一个 代码:适用于小型 csv -- 超过 25 列,但文件大小以 kbs 为单位。
import pandas as pd
import glob
interesting_files = glob.glob("*.csv")
df_list = []
for filename in sorted(interesting_files):
df_list.append(pd.read_csv(filename))
full_df = pd.concat(df_list)
full_df.to_csv('output.csv')
但上面的代码不适用于较大的文件并给出错误。
错误:
Traceback (most recent call last):
File "merge_large.py", line 6, in <module>
all_files = glob.glob("*.csv", encoding='utf8', engine='python')
TypeError: glob() got an unexpected keyword argument 'encoding'
lakshmi@lakshmi-HP-15-Notebook-PC:~/Desktop/Twitter_Lat_lon/nasik_rain/rain_2$ python merge_large.py
Traceback (most recent call last):
File "merge_large.py", line 10, in <module>
df = pd.read_csv(file_,index_col=None, header=0)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 562, in parser_f
return _read(filepath_or_buffer, kwds)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 325, in _read
return parser.read()
File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 815, in read
ret = self._engine.read(nrows)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 1314, in read
data = self._reader.read(nrows)
File "pandas/parser.pyx", line 805, in pandas.parser.TextReader.read (pandas/parser.c:8748)
File "pandas/parser.pyx", line 827, in pandas.parser.TextReader._read_low_memory (pandas/parser.c:9003)
File "pandas/parser.pyx", line 881, in pandas.parser.TextReader._read_rows (pandas/parser.c:9731)
File "pandas/parser.pyx", line 868, in pandas.parser.TextReader._tokenize_rows (pandas/parser.c:9602)
File "pandas/parser.pyx", line 1865, in pandas.parser.raise_parser_error (pandas/parser.c:23325)
pandas.io.common.CParserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.
代码:列 25+,但文件大小超过 10mb
选项:四个
import pandas as pd
import glob
interesting_files = glob.glob("*.csv")
df_list = []
for filename in sorted(interesting_files):
df_list.append(pd.read_csv(filename))
full_df = pd.concat(df_list)
full_df.to_csv('output.csv')
错误:
Traceback (most recent call last):
File "merge_large.py", line 6, in <module>
allFiles = glob.glob("*.csv", sep=None)
TypeError: glob() got an unexpected keyword argument 'sep'
我进行了广泛的搜索,但找不到将具有相同标题的大型 csv 文件连接到一个文件中的解决方案。
编辑:
代码:
import dask.dataframe as dd
ddf = dd.read_csv('*.csv')
ddf.to_csv('master.csv',index=False)
错误:
Traceback (most recent call last):
File "merge_csv_dask.py", line 5, in <module>
ddf.to_csv('master.csv',index=False)
File "/usr/local/lib/python2.7/dist-packages/dask/dataframe/core.py", line 792, in to_csv
return to_csv(self, filename, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/dask/dataframe/io.py", line 762, in to_csv
compute(*values)
File "/usr/local/lib/python2.7/dist-packages/dask/base.py", line 179, in compute
results = get(dsk, keys, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/dask/threaded.py", line 58, in get
**kwargs)
File "/usr/local/lib/python2.7/dist-packages/dask/async.py", line 481, in get_async
raise(remote_exception(res, tb))
dask.async.ValueError: could not convert string to float: {u'type': u'Point', u'coordinates': [4.34279, 50.8443]}
Traceback
---------
File "/usr/local/lib/python2.7/dist-packages/dask/async.py", line 263, in execute_task
result = _execute_task(task, data)
File "/usr/local/lib/python2.7/dist-packages/dask/async.py", line 245, in _execute_task
return func(*args2)
File "/usr/local/lib/python2.7/dist-packages/dask/dataframe/csv.py", line 49, in bytes_read_csv
coerce_dtypes(df, dtypes)
File "/usr/local/lib/python2.7/dist-packages/dask/dataframe/csv.py", line 73, in coerce_dtypes
df[c] = df[c].astype(dtypes[c])
File "/usr/local/lib/python2.7/dist-packages/pandas/core/generic.py", line 2950, in astype
raise_on_error=raise_on_error, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/internals.py", line 2938, in astype
return self.apply('astype', dtype=dtype, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/internals.py", line 2890, in apply
applied = getattr(b, f)(**kwargs)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/internals.py", line 434, in astype
values=values, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/internals.py", line 477, in _astype
values = com._astype_nansafe(values.ravel(), dtype, copy=True)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/common.py", line 1920, in _astype_nansafe
return arr.astype(dtype
)
【问题讨论】:
标签: python csv pandas concatenation