【发布时间】:2019-05-14 09:11:30
【问题描述】:
我想将我通过 ftp 检索的以制表符分隔的大文本文件的内容直接放入 pandas 数据帧。
import pandas as pd
import urllib.request as ur
# retrieve only the header column & set dtype to save some memory
refseq_summary = "ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/assembly_summary_refseq.txt"
req = ur.Request(refseq_summary)
z_f = ur.urlopen(req)
col_names = pd.read_csv(z_f, sep='\t', nrows=0, skiprows=1)
for col in list(col_names.columns[:]):
col_names[col] = col_names[col].astype("object")
col_names["taxid"]= col_names["taxid"].astype("Int64")
col_names.rename(columns={'# assembly_accession':'assembly_accession'}, inplace=True)
col_dtypes = col_names.dtypes.to_dict()
col_names_list = list(col_names.columns.values)
# read the whole file, and set the dtype & column names
df = pd.read_csv(z_f, sep='\t', dtype=col_dtypes, names=col_names_list, comment="#")
但由于某种原因,df 中缺少前约 850 行,第一行完全混乱。
【问题讨论】:
标签: python pandas dataframe ftp