【问题标题】:How to read faster multiple CSV files using Python pandas如何使用 Python pandas 更快地读取多个 CSV 文件
【发布时间】:2020-01-23 07:16:38
【问题描述】:

我的程序应该读取约 400.000 个 csv 文件,而且需要很长时间。我使用的代码是:

        for file in self.files:
            size=2048
            csvData = pd.read_csv(file, sep='\t', names=['acol', 'bcol'], header=None, skiprows=range(0,int(size/2)), skipfooter=(int(size/2)-10))

            for index in range(0,10):
                s=s+float(csvData['bcol'][index])
            s=s/10
            averages.append(s)
            time=file.rpartition('\\')[2]
            time=int(re.search(r'\d+', time).group())
            times.append(time)

有没有提高速度的机会?

【问题讨论】:

标签: python pandas bigdata


【解决方案1】:

您可以使用线程。我从here 中获取了以下代码并针对您的用例进行了修改

global times =[]

def my_func(file):
        size=2048
        csvData = pd.read_csv(file, sep='\t', names=['acol', 'bcol'], header=None, skiprows=range(0,int(size/2)), skipfooter=(int(size/2)-10))

        for index in range(0,10):
            s=s+float(csvData['bcol'][index])
        s=s/10
        averages.append(s)
        time=file.rpartition('\\')[2]
        time=int(re.search(r'\d+', time).group())
        times.append(time)

threads = []
# In this case 'self.files' is a list of files to be read.
for ii in range(self.files):
# We start one thread per file present.
    process = Thread(target=my_func, args=[ii])
    process.start()
    threads.append(process)
# We now pause execution on the main thread by 'joining' all of our started threads.
# This ensures that each has finished processing the urls.
for process in threads:
    process.join()

【讨论】:

    猜你喜欢
    • 2017-02-15
    • 1970-01-01
    • 1970-01-01
    • 2019-08-29
    • 2016-03-15
    • 2021-04-16
    • 1970-01-01
    • 2022-01-25
    • 2011-06-21
    相关资源
    最近更新 更多