【问题标题】:Why is the memory usage going up, even though the usage of the single variables is going down?为什么即使单个变量的使用量下降,内存使用量也会上升?
【发布时间】:2020-01-18 08:29:03
【问题描述】:

我正在使用一些数据来训练一个使用 scikit-learn 的随机森林分类器。我的数据形状类似于 8000 个数据点,其中包含超过 60.000 个特征。在训练集分类之后,我使用 clf.feature_importances_ 访问特征,按值对它们进行排序,并删除值 = 0 的特征。我还删除了系统中信息最少的最后一个特征。之后,我将所有剩余功能及其各自的值写入一个新文件。这就是我的递归开始的地方。我在文件中读取了我想要使用的功能,缺少之前运行的所有无用信息。我不再加载我的数据集,只使用 pandas 的这个功能子集。 实际上,一切正常,变量减少并且过滤按预期工作但是内存使用量随着每个递归步骤而增加,因此仅 10 次迭代后我的使用量大约为 13% - 从开始时的 4.5% .

在开始新的迭代步骤之前,我已经使用 gc.collect() 尝试了垃圾收集器。此外,我尝试使用 del 删除一些变量,并使用空列表或纯零重新设置变量,以避免变量的高堆叠(事实并非如此)。

我使用这个函数来确定我的变量的大小并且它们确实在下降。

import sys
def sizeof_fmt(num, suffix='B'):
    for unit in ['','Ki','Mi','Gi','Ti','Pi','Ei','Zi']:
        if abs(num) < 1024.0:
            return "%3.1f %s%s" % (num, unit, suffix)
        num /= 1024.0
    return "%.1f %s%s" % (num, 'Yi', suffix)

for name, size in sorted(((name, sys.getsizeof(value)) for name, value in locals().items()),
                       key= lambda x: -x[1])[:10]:
print("{:>30}: {:>8}".format(name, sizeof_fmt(size)))

递归主要是这个,不包含我数据的读入:

def rek(rekfile,run):
    import gc
    import sklearn
    from sklearn.model_selection import train_test_split
    from sklearn import metrics
    import pandas as pd
    from sklearn.ensemble import RandomForestClassifier

    ##tried to set all variables to empty lists or zeros which did not work
    cladd = {}
    sorted_cladd ={}
    x_pre = []
    y = 0
    X = 0
    clf = 0
    with open(rekfile) as inf:##tab seperated feature value file
        for line in inf:
            spl = line.split('\t')
            ensg = spl[0]
            x_pre.append(ensg)
    y = data_as_pd['label']
    X = data_as_pd[x_pre]
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5)
    #Create a rf Classifier
    clf=RandomForestClassifier(n_estimators=500) 
    #Train the model using the training sets
    clf.fit(X_train, y_train)
    #Predict the response for test dataset
    y_pred = clf.predict(X_test)
    accu = metrics.accuracy_score(y_test, y_pred)
    # Model Accuracy: how often is the classifier correct?
    print("Accuracy:",accu)
    ##only proceeding with the rekursion if accuracy limit is satisfied
    if accu > 0.925:

        featout = "/home/andre/tf/forest/rekursion/rf_feature_values/feature_values_rf_fullmodel_wo_normal_n500_50perc-split_rekursion_run_"+str(run)+".txt" ##where the features for the next runs are saved

        orf = open(featout,'w')
        index = 0
        ##sorting the features for their importance and deleting zeros
        for classi in clf.feature_importances_:
            if classi !=float(0):
                cladd[x_pre[index]] = classi
            else:
                dump.write(x_pre[index]+'\n')
            index = index + 1
        sorted_cladd = sorted(cladd.items(),key=lambda x: x[1], reverse=True)
        ##deleting the feature with the least information
        sorted_cladd.pop()[-1]
        for a in range(len(sorted_cladd)):
            orf.write(sorted_cladd[a][0]+'\t'+str(sorted_cladd[a][1])+'\n')
        orf.close()
        ##set run variable
        newrun = run+1
        del clf ##tried to reduce size deleting clf (not working)
        gc.collect() ##tried garbage collector (not working)
        rek(featout,newrun) ##new iteration

我希望在迭代过程中内存使用量 (RAM) 会下降,因为我每一步都会减少输入数据,但使用量实际上会上升,直到出现“内存错误”错误消息。

我希望有人可以帮助我,因为我真的看不出我在这里缺少什么。非常感谢任何帮助。

问候,

安德烈

编辑: 使用 while 循环对我来说完全有效,并按预期减少了使用的内存!

【问题讨论】:

    标签: python scikit-learn random-forest


    【解决方案1】:

    Python 堆栈帧非常庞大。我不知道您的实现细节,但就内存使用而言,递归通常效率不高。它以显着的内存成本提高了可读性。

    关于python中递归的一些有用见解here

    我建议使用迭代方法更改代码,这应该会减少内存使用量

    【讨论】:

    • 所以实际上你是在告诉我为了更好的内存处理而放弃 rekursion 并使用 while 循环?
    • @Pyretu 完全正确 :)
    • 非常感谢您的快速回复,我会试一试,希望能成功!
    【解决方案2】:

    我认为改进的第一种方法是重构代码,以便垃圾收集器更容易追踪变量超出范围。话虽如此,正如@Nikaidoh 所提到的,递归在 Python 中并不是很好,应该尽可能避免。在您的情况下,使用 for 循环执行相同的操作实际上相当容易,这甚至可以提高可读性。

    这是对您的代码的重写,包括重构和递归实现的替代方案。我建议使用后者(从而删除 export_features_listrek 函数),并可能修改 fit_clf 以返回 clf.importance_features_ 而不是 clf 本身(从而从 main 中删除 del clf ;你会必须测试保留显式 gc.collect() 调用的相关性),如 cmets 所示。

    import gc
    
    import numpy as np
    import pandas as pd
    import sklearn
    from sklearn.model_selection import train_test_split
    from sklearn import metrics
    from sklearn.ensemble import RandomForestClassifier
    
    
    def get_predictors_list(rekfile):
        with open(rekfile) as inf:  #tab seperated feature value file
            w_pre = [
                line.split('\t', 1)[0]
                for line in inf
            ]
        return x_pre
    
    
    def fit_clf(x_pre):
        y = data_as_pd['label']
        X = data_as_pd[x_pre]
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5)
        # Create a rf Classifier
        clf=RandomForestClassifier(n_estimators=500) 
        # Train the model using the training sets
        clf.fit(X_train, y_train)
        # Predict the response for test dataset
        y_pred = clf.predict(X_test)
        accu = metrics.accuracy_score(y_test, y_pred)
        # Model Accuracy: how often is the classifier correct?
        print("Accuracy:", accu)
        # Return the trained model and its accuracy.
        return clf, accu
    
    
    def export_features_list(run, clf, x_pre):
        featout = "/home/andre/tf/forest/rekursion/rf_feature_values/feature_values_rf_fullmodel_wo_normal_n500_50perc-split_rekursion_run_"+str(run)+".txt" ##where the features for the next runs are saved
        orf = open(featout, 'w')
        index = 0
        ## sorting the features for their importance and deleting zeros
        cladd = {}
        for classi in clf.feature_importances_:
            if classi != float(0):
                cladd[x_pre[index]] = classi
            else:
                dump.write(x_pre[index]+'\n')
            index = index + 1
        sorted_cladd = sorted(cladd.items(), key=lambda x: x[1], reverse=True)
        ## deleting the feature with the least information
        sorted_cladd.pop()[-1]
        for a in range(len(sorted_cladd)):
            orf.write(sorted_cladd[a][0]+'\t'+str(sorted_cladd[a][1])+'\n')
        orf.close()
        ## return the path to the kept features' file
        return featout
    
    
    # Recursive way - preferrably avoid this as recursion is not that great in Python.
    def rek(rekfile, run):
        x_pre = get_predictors_list(rekfile)
        clf, accu = fit_clf(x_pre)
        ## only proceeding with the rekursion if accuracy limit is satisfied
        if accu > 0.925:
            featout = export_features_list(run, clf, x_pre)
            gc.collect()  # hopefully useless, but who knows?
            rek(featout, run + 1)
    
    
    # Alternative way, with a for loop.
    def main(features_file, accu_thresh=0.925, max_runs=20):
        """Train RandomForest classifiers, iteratively removing features.
    
        Stop removing features when it would result in an accuracy below
        `accu_thresh`, or when `max_runs` features have been removed.
    
        Return the list of kept features, as well as the list of dropped
        ones, in removal order.
        """
        features = get_predictors_list(rekfile)
        dropped = []
        for run in range(max_aruns):
            print('Run %i' % run)
            # Fit a model and evaluate its accuracy.
            # FIXME: we could also return features importance only!
            clf, accuracy = fit_clf(features)
            # If the accuracy is too low, stop the process.
            if accuracy <= accu_thresh:
                break
            # Otherwise, drop the least important feature.
            dropped.append(features.pop(np.argmin(clf.feature_importances_)))
            print('Dropped feature %s.' % dropped[run])
            # Explicitly delete the model and garbage collect, for safety.
            del clf
            gc.collect()
        # Return selected features, and the list of dropped ones.
        return features, dropped
    

    我希望这会有所帮助。 最好的祝福, 保罗

    【讨论】:

    • 您好 Paul,非常感谢您非常详细的回复。在我尝试完成 while 循环建议后,我会尝试一下。问候,
    猜你喜欢
    • 2021-12-03
    • 2017-10-09
    • 2014-01-26
    • 2019-12-22
    • 2017-10-26
    • 1970-01-01
    • 1970-01-01
    • 2021-03-19
    • 2019-11-29
    相关资源
    最近更新 更多