【问题标题】:How to calculate correlation between all columns and remove highly correlated ones using pandas?如何计算所有列之间的相关性并使用熊猫删除高度相关的列?
【发布时间】:2015-05-31 10:40:21
【问题描述】:

我有一个庞大的数据集,在机器学习建模之前,总是建议您首先删除高度相关的描述符(列)我如何计算列的 wice 相关性并删除具有阈值的列,比如删除所有具有 >0.8 相关性的列或描述符。它也应该保留减少数据中的标题..

示例数据集

 GA      PN       PC     MBP      GR     AP   
0.033   6.652   6.681   0.194   0.874   3.177    
0.034   9.039   6.224   0.194   1.137   3.4      
0.035   10.936  10.304  1.015   0.911   4.9      
0.022   10.11   9.603   1.374   0.848   4.566    
0.035   2.963   17.156  0.599   0.823   9.406    
0.033   10.872  10.244  1.015   0.574   4.871     
0.035   21.694  22.389  1.015   0.859   9.259     
0.035   10.936  10.304  1.015   0.911   4.5       

请帮忙……

【问题讨论】:

  • Feature-Engine 有一个内置的DropCorrelatedFeatures() 变压器,可以为您完成繁重的工作并且与 sklearn 兼容。 features_to_drop_ 属性显示它将丢弃哪个。

标签: python pandas correlation


【解决方案1】:

这里的方法对我来说效果很好,只有几行代码:https://chrisalbon.com/machine_learning/feature_selection/drop_highly_correlated_features/

import numpy as np

# Create correlation matrix
corr_matrix = df.corr().abs()

# Select upper triangle of correlation matrix
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))

# Find features with correlation greater than 0.95
to_drop = [column for column in upper.columns if any(upper[column] > 0.95)]

# Drop features 
df.drop(to_drop, axis=1, inplace=True)

【讨论】:

  • 这不是有缺陷吗?始终删除第一列,即使它可能与任何其他列不高度相关。 when upper triangle is selected none of the first col value remains
  • 删除所选功能时出错,以下代码对我有用df.drop(to_drop,axis=1,inplace=True)
  • @ikbelbenabdessamad 是的,你的代码更好。我刚刚更新了那个旧版本的代码,谢谢!
  • 截至撰写此评论之日,这似乎工作正常。我使用答案中提供的其他方法交叉检查了不同的阈值,结果是相同的。谢谢!
  • 这将删除所有 corr > 0.95 的列,我们要删除除一个以外的所有列。
【解决方案2】:

这是我使用的方法 -

def correlation(dataset, threshold):
    col_corr = set() # Set of all the names of deleted columns
    corr_matrix = dataset.corr()
    for i in range(len(corr_matrix.columns)):
        for j in range(i):
            if (corr_matrix.iloc[i, j] >= threshold) and (corr_matrix.columns[j] not in col_corr):
                colname = corr_matrix.columns[i] # getting the name of column
                col_corr.add(colname)
                if colname in dataset.columns:
                    del dataset[colname] # deleting the column from the dataset

    print(dataset)

希望这会有所帮助!

【讨论】:

  • 我觉得这个解决方案在以下一般情况下会失败:假设您有列 c1、c2 和 c3。 c1 和 c2 在阈值以上相关,c2 和 c3 也是如此。使用此解决方案,即使 c3 可能与高于该阈值的 c1 不相关,c2 和 c3 也会被丢弃。我建议更改:if corr_matrix.iloc[i, j] >= threshold: 至:if corr_matrix.iloc[i, j] >= threshold and (corr_matrix.columns[j] not in col_corr):
  • @vcovo 如果 c1 和 c2 相关且 c2 和 c3 相关,则 c1 和 c3 也很有可能相关。虽然,如果这不是真的,那么我相信您更改代码的建议是正确的。
  • 它们很可能是相关的,但不一定高于相同的threshold。这导致我的用例中删除的列存在显着差异。添加第一条评论中提到的附加条件时,我最终得到了 218 列而不是 180 列。
  • 有道理。已根据您的建议更新了代码。
  • 不应该使用相关矩阵的绝对值吗?
【解决方案3】:

这是我创建的一个 Auto ML 类,用于消除特征之间的多重共线性。

使我的代码与众不同的是,在两个具有高相关性的特征中,我消除了与目标相关性最小的特征!我从 Vishal Patel Sir 的研讨会中得到了这个想法 - https://www.youtube.com/watch?v=ioXKxulmwVQ&feature=youtu.be

#Feature selection class to eliminate multicollinearity
class MultiCollinearityEliminator():
    
    #Class Constructor
    def __init__(self, df, target, threshold):
        self.df = df
        self.target = target
        self.threshold = threshold

    #Method to create and return the feature correlation matrix dataframe
    def createCorrMatrix(self, include_target = False):
        #Checking we should include the target in the correlation matrix
        if (include_target == False):
            df_temp = self.df.drop([self.target], axis =1)
            
            #Setting method to Pearson to prevent issues in case the default method for df.corr() gets changed
            #Setting min_period to 30 for the sample size to be statistically significant (normal) according to 
            #central limit theorem
            corrMatrix = df_temp.corr(method='pearson', min_periods=30).abs()
        #Target is included for creating the series of feature to target correlation - Please refer the notes under the 
        #print statement to understand why we create the series of feature to target correlation
        elif (include_target == True):
            corrMatrix = self.df.corr(method='pearson', min_periods=30).abs()
        return corrMatrix

    #Method to create and return the feature to target correlation matrix dataframe
    def createCorrMatrixWithTarget(self):
        #After obtaining the list of correlated features, this method will help to view which variables 
        #(in the list of correlated features) are least correlated with the target
        #This way, out the list of correlated features, we can ensure to elimate the feature that is 
        #least correlated with the target
        #This not only helps to sustain the predictive power of the model but also helps in reducing model complexity
        
        #Obtaining the correlation matrix of the dataframe (along with the target)
        corrMatrix = self.createCorrMatrix(include_target = True)                           
        #Creating the required dataframe, then dropping the target row 
        #and sorting by the value of correlation with target (in asceding order)
        corrWithTarget = pd.DataFrame(corrMatrix.loc[:,self.target]).drop([self.target], axis = 0).sort_values(by = self.target)                    
        print(corrWithTarget, '\n')
        return corrWithTarget

    #Method to create and return the list of correlated features
    def createCorrelatedFeaturesList(self):
        #Obtaining the correlation matrix of the dataframe (without the target)
        corrMatrix = self.createCorrMatrix(include_target = False)                          
        colCorr = []
        #Iterating through the columns of the correlation matrix dataframe
        for column in corrMatrix.columns:
            #Iterating through the values (row wise) of the correlation matrix dataframe
            for idx, row in corrMatrix.iterrows():                                            
                if(row[column]>self.threshold) and (row[column]<1):
                    #Adding the features that are not already in the list of correlated features
                    if (idx not in colCorr):
                        colCorr.append(idx)
                    if (column not in colCorr):
                        colCorr.append(column)
        print(colCorr, '\n')
        return colCorr

    #Method to eliminate the least important features from the list of correlated features
    def deleteFeatures(self, colCorr):
        #Obtaining the feature to target correlation matrix dataframe
        corrWithTarget = self.createCorrMatrixWithTarget()                                  
        for idx, row in corrWithTarget.iterrows():
            print(idx, '\n')
            if (idx in colCorr):
                self.df = self.df.drop(idx, axis =1)
                break
        return self.df

    #Method to run automatically eliminate multicollinearity
    def autoEliminateMulticollinearity(self):
        #Obtaining the list of correlated features
        colCorr = self.createCorrelatedFeaturesList()                                       
        while colCorr != []:
            #Obtaining the dataframe after deleting the feature (from the list of correlated features) 
            #that is least correlated with the taregt
            self.df = self.deleteFeatures(colCorr)
            #Obtaining the list of correlated features
            colCorr = self.createCorrelatedFeaturesList()                                     
        return self.df

【讨论】:

    【解决方案4】:

    你可以在下面测试这段代码吗?

    加载库导入

      pandas as pd
      import numpy as np
    # Create feature matrix with two highly correlated features
    
    X = np.array([[1, 1, 1],
              [2, 2, 0],
              [3, 3, 1],
              [4, 4, 0],
              [5, 5, 1],
              [6, 6, 0],
              [7, 7, 1],
              [8, 7, 0],
              [9, 7, 1]])
    
    # Convert feature matrix into DataFrame
    df = pd.DataFrame(X)
    
    # View the data frame
    df
    
    # Create correlation matrix
    corr_matrix = df.corr().abs()
    
    # Select upper triangle of correlation matrix
    upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))
    
    # Find index of feature columns with correlation greater than 0.95
    to_drop = [column for column in upper.columns if any(upper[column] > 0.95)]
    # Drop features 
    df.drop(df[to_drop], axis=1)
    

    【讨论】:

    • 虽然此代码可能会为问题提供解决方案,但最好添加有关其工作原理/方式的上下文。这可以帮助未来的用户学习并将这些知识应用到他们自己的代码中。在解释代码时,您也可能会以赞成票的形式从用户那里获得积极的反馈。
    【解决方案5】:

    对于给定的数据框 df,您可以使用以下内容:

    corr_matrix = df.corr().abs()
    high_corr_var=np.where(corr_matrix>0.8)
    high_corr_var=[(corr_matrix.columns[x],corr_matrix.columns[y]) for x,y in zip(*high_corr_var) if x!=y and x<y]
    

    【讨论】:

    • 这对我不起作用。请考虑将您的解决方案重写为一种方法。错误:“ValueError:要解压的值太多(预期为 2)”。
    • 应该是high_corr_var=[(corr_matrix.index[x],corr_matrix.columns[y]) for x,y in zip(*high_corr_var) if x!=y and x&lt;y]
    【解决方案6】:

    我发现answer provided by TomDobbs 非常有用,但它并没有按预期工作。它有两个问题:

    • 它错过了每个相关矩阵行/列中的最后一对变量。
    • 无法从返回的数据帧中删除每对共线变量中的一个。

    我在下面的修订版纠正了这些问题:

    def remove_collinear_features(x, threshold):
        '''
        Objective:
            Remove collinear features in a dataframe with a correlation coefficient
            greater than the threshold. Removing collinear features can help a model 
            to generalize and improves the interpretability of the model.
    
        Inputs: 
            x: features dataframe
            threshold: features with correlations greater than this value are removed
    
        Output: 
            dataframe that contains only the non-highly-collinear features
        '''
    
        # Calculate the correlation matrix
        corr_matrix = x.corr()
        iters = range(len(corr_matrix.columns) - 1)
        drop_cols = []
    
        # Iterate through the correlation matrix and compare correlations
        for i in iters:
            for j in range(i+1):
                item = corr_matrix.iloc[j:(j+1), (i+1):(i+2)]
                col = item.columns
                row = item.index
                val = abs(item.values)
    
                # If correlation exceeds the threshold
                if val >= threshold:
                    # Print the correlated features and the correlation value
                    print(col.values[0], "|", row.values[0], "|", round(val[0][0], 2))
                    drop_cols.append(col.values[0])
    
        # Drop one of each pair of correlated columns
        drops = set(drop_cols)
        x = x.drop(columns=drops)
    
        return x
    

    【讨论】:

    • 我真的很喜欢它!已将它用于我正在构建的模型并且非常容易理解 - 非常感谢。
    【解决方案7】:

    首先,我建议使用诸如 PCA 之类的方法作为 dimensionality reduction 方法,但如果您必须自己动手,那么您的问题就没有受到足够的约束。其中两列是相关的,您要删除哪一列?如果A列与B列相关,而B列与C列相关,但A列不相关?

    您可以通过调用DataFrame.corr() (docs) 获得成对的相关矩阵,这可能有助于您开发算法,但最终您需要将其转换为要保留的列列表。

    【讨论】:

    • 虽然我完全同意您的推理,但这并不能真正回答问题。 PCA 是更高级的降维概念。但请注意,使用相关性确实有效,而且这个问题是合理的(但 IMO 肯定缺乏研究工作)。
    • @Jamie Bull 感谢您在使用降维(例如 PCA)或特征选择方法(例如基于树或基于 SVM 的特征消除)等高级技术之前的友好回复,始终建议删除借助基本技术(例如相关计算的方差计算),我在各种已发表的作品的帮助下学到了无用的功能。根据您评论的第二部分“通过调用 DataFrame.corr() 进行相关性”将对我的情况有所帮助。
    • @jax,it is always suggested to remove useless feature with the help of basic techniques。这不是真的。有多种方法不需要这样的预处理步骤。
    • @cel 好的,实际上我正在关注一些已发表的作品,因此他们建议了预处理步骤。您能否向我推荐任何一种不需要预处理步骤的方法,谢谢。
    • @JamieBull 感谢您的回复,在发布此之前我已经去过那里(您建议的网络链接)。但是如果你仔细阅读了这些问题,这篇文章只涵盖了问题的一半答案,但我已经阅读了很多,希望很快我会用自己的方式发布答案。非常感谢您的支持和关注。谢谢
    【解决方案8】:

    我冒昧地修改了 TomDobbs 的答案。 cmets 中报告的错误现已删除。此外,新功能也过滤掉了负相关。

    def corr_df(x, corr_val):
        '''
        Obj: Drops features that are strongly correlated to other features.
              This lowers model complexity, and aids in generalizing the model.
        Inputs:
              df: features df (x)
              corr_val: Columns are dropped relative to the corr_val input (e.g. 0.8)
        Output: df that only includes uncorrelated features
        '''
    
        # Creates Correlation Matrix and Instantiates
        corr_matrix = x.corr()
        iters = range(len(corr_matrix.columns) - 1)
        drop_cols = []
    
        # Iterates through Correlation Matrix Table to find correlated columns
        for i in iters:
            for j in range(i):
                item = corr_matrix.iloc[j:(j+1), (i+1):(i+2)]
                col = item.columns
                row = item.index
                val = item.values
                if abs(val) >= corr_val:
                    # Prints the correlated feature set and the corr val
                    print(col.values[0], "|", row.values[0], "|", round(val[0][0], 2))
                    drop_cols.append(i)
    
        drops = sorted(set(drop_cols))[::-1]
    
        # Drops the correlated columns
        for i in drops:
            col = x.iloc[:, (i+1):(i+2)].columns.values
            x = x.drop(col, axis=1)
        return x
    

    【讨论】:

    • 这里的循环跳过了 corr_matrix 的前两列,因此不考虑 col1 和 col2 之间的相关性,之后看起来没问题
    • @Ryan 你是怎么解决这个问题的?
    • @poPYtheSailor 请查看我发布的解决方案
    【解决方案9】:

    在此函数中插入您的特征数据框,然后设置您的相关阈值。它会自动删除列,但如果您想手动执行,它还会为您提供它删除的列的诊断。

    def corr_df(x, corr_val):
        '''
        Obj: Drops features that are strongly correlated to other features.
              This lowers model complexity, and aids in generalizing the model.
        Inputs:
              df: features df (x)
              corr_val: Columns are dropped relative to the corr_val input (e.g. 0.8)
        Output: df that only includes uncorrelated features
        '''
    
        # Creates Correlation Matrix and Instantiates
        corr_matrix = x.corr()
        iters = range(len(corr_matrix.columns) - 1)
        drop_cols = []
    
        # Iterates through Correlation Matrix Table to find correlated columns
        for i in iters:
            for j in range(i):
                item = corr_matrix.iloc[j:(j+1), (i+1):(i+2)]
                col = item.columns
                row = item.index
                val = item.values
                if val >= corr_val:
                    # Prints the correlated feature set and the corr val
                    print(col.values[0], "|", row.values[0], "|", round(val[0][0], 2))
                    drop_cols.append(i)
    
        drops = sorted(set(drop_cols))[::-1]
    
        # Drops the correlated columns
        for i in drops:
            col = x.iloc[:, (i+1):(i+2)].columns.values
            df = x.drop(col, axis=1)
    
        return df
    

    【讨论】:

    • 这似乎对我不起作用。找到相关性并打印与阈值匹配的对(即具有更高的相关性)。但是生成的数据框只缺少一个(第一个)变量,该变量具有很高的相关性。
    【解决方案10】:

    首先,感谢 TomDobbs 和 Synergix 提供的代码。下面我将分享我的 modifield 版本并添加一些内容:

    1. 此函数在两个相关变量之间删除与目标变量相关性最小的变量
    2. 添加了一些有用的日志(设置 verbose 为 True 以打印日志)
    def remove_collinear_features(df_model, target_var, threshold, verbose):
        '''
        Objective:
            Remove collinear features in a dataframe with a correlation coefficient
            greater than the threshold and which have the least correlation with the target (dependent) variable. Removing collinear features can help a model 
            to generalize and improves the interpretability of the model.
    
        Inputs: 
            df_model: features dataframe
            target_var: target (dependent) variable
            threshold: features with correlations greater than this value are removed
            verbose: set to "True" for the log printing
    
        Output: 
            dataframe that contains only the non-highly-collinear features
        '''
    
        # Calculate the correlation matrix
        corr_matrix = df_model.drop(target_var, 1).corr()
        iters = range(len(corr_matrix.columns) - 1)
        drop_cols = []
        dropped_feature = ""
    
        # Iterate through the correlation matrix and compare correlations
        for i in iters:
            for j in range(i+1): 
                item = corr_matrix.iloc[j:(j+1), (i+1):(i+2)]
                col = item.columns
                row = item.index
                val = abs(item.values)
    
                # If correlation exceeds the threshold
                if val >= threshold:
                    # Print the correlated features and the correlation value
                    if verbose:
                        print(col.values[0], "|", row.values[0], "|", round(val[0][0], 2))
                    col_value_corr = df_model[col.values[0]].corr(df_model[target_var])
                    row_value_corr = df_model[row.values[0]].corr(df_model[target_var])
                    if verbose:
                        print("{}: {}".format(col.values[0], np.round(col_value_corr, 3)))
                        print("{}: {}".format(row.values[0], np.round(row_value_corr, 3)))
                    if col_value_corr < row_value_corr:
                        drop_cols.append(col.values[0])
                        dropped_feature = "dropped: " + col.values[0]
                    else:
                        drop_cols.append(row.values[0])
                        dropped_feature = "dropped: " + row.values[0]
                    if verbose:
                        print(dropped_feature)
                        print("-----------------------------------------------------------------------------")
    
        # Drop one of each pair of correlated columns
        drops = set(drop_cols)
        df_model = df_model.drop(columns=drops)
    
        print("dropped columns: ")
        print(list(drops))
        print("-----------------------------------------------------------------------------")
        print("used columns: ")
        print(df_model.columns.tolist())
    
        return df_model
    

    【讨论】:

    • 如果我们在计算目标和特征之间的相关值时添加abs()函数,我们不会看到负相关值。这很重要,因为当我们有负相关代码时,会下降一个较小的负相关值更强的代码。 /// col_corr = abs(df_model[col.values[0]].corr(df_model[target_var]))
    【解决方案11】:

    如果由于 pandas .corr() 而导致内存不足,您可能会发现以下解决方案很有用:

        import numpy as np 
        from numba import jit
        
        @jit(nopython=True)
        def corr_filter(X, threshold):
            n = X.shape[1]
            columns = np.ones((n,))
            for i in range(n-1):
                for j in range(i+1, n):
                    if columns[j] == 1:
                        correlation = np.abs(np.corrcoef(X[:,i], X[:,j])[0,1])
                        if correlation >= threshold:
                            columns[j] = 0
            return columns
        
        columns = corr_filter(df.values, 0.7).astype(bool) 
        selected_columns = df.columns[columns]
    

    【讨论】:

    【解决方案12】:

    对 user3025698 发布的解决方案的小修订,解决了前两列之间的相关性未被捕获和一些数据类型检查的问题。

    def filter_df_corr(inp_data, corr_val):
        '''
        Returns an array or dataframe (based on type(inp_data) adjusted to drop \
            columns with high correlation to one another. Takes second arg corr_val
            that defines the cutoff
    
        ----------
        inp_data : np.array, pd.DataFrame
            Values to consider
        corr_val : float
            Value [0, 1] on which to base the correlation cutoff
        '''
        # Creates Correlation Matrix
        if isinstance(inp_data, np.ndarray):
            inp_data = pd.DataFrame(data=inp_data)
            array_flag = True
        else:
            array_flag = False
        corr_matrix = inp_data.corr()
    
        # Iterates through Correlation Matrix Table to find correlated columns
        drop_cols = []
        n_cols = len(corr_matrix.columns)
    
        for i in range(n_cols):
            for k in range(i+1, n_cols):
                val = corr_matrix.iloc[k, i]
                col = corr_matrix.columns[i]
                row = corr_matrix.index[k]
                if abs(val) >= corr_val:
                    # Prints the correlated feature set and the corr val
                    print(col, "|", row, "|", round(val, 2))
                    drop_cols.append(col)
    
        # Drops the correlated columns
        drop_cols = set(drop_cols)
        inp_data = inp_data.drop(columns=drop_cols)
        # Return same type as inp
        if array_flag:
            return inp_data.values
        else:
            return inp_data
    

    【讨论】:

      【解决方案13】:

      这里的问题是指一个巨大的数据集。但是,我看到的所有答案都是处理数据帧。我提出了一个并行运行的 scipy 稀疏矩阵的答案。这不是返回一个巨大的相关矩阵,而是在检查所有字段的正和负 Pearson 相关后返回一个字段的特征掩码。

      我还尝试使用以下策略最小化计算:

      • 处理每一列
      • 从当前列 + 1 开始并向右移动计算相关性。
      • 对于任何 abs(correlation) >= 阈值,将当前列标记为要删除,并且不再计算相关性。
      • 对数据集中除最后一列之外的每一列执行这些步骤。

      通过保留标记为要删除的列的全局列表并跳过对此类列的进一步相关性计算,这可能会进一步加快速度,因为列将无序执行。但是,我对 python 中的竞争条件了解不够,无法在今晚实现这一点。

      与返回整个相关矩阵相比,返回列掩码显然允许代码处理更大的数据集。

      使用此函数检查每一列:

      def get_corr_row(idx_num, sp_mat, thresh):
          # slice the column at idx_num
          cols = sp_mat.shape[1]
          x = sp_mat[:,idx_num].toarray().ravel()
          start = idx_num + 1
          
          # Now slice each column to the right of idx_num   
          for i in range(start, cols):
              y = sp_mat[:,i].toarray().ravel()
              # Check the pearson correlation
              corr, pVal = pearsonr(x,y)
              # Pearson ranges from -1 to 1.
              # We check both positive and negative correlations >= thresh using abs(corr)
              if abs(corr) >= thresh:
                  # stop checking after finding the 1st correlation > thresh   
                  return False
                  # Mark column at idx_num for removal in the mask  
          return True  
          
      

      并行运行列级相关性检查:

      from joblib import Parallel, delayed  
      import multiprocessing
      
      
      def Get_Corr_Mask(sp_mat, thresh, n_jobs=-1):
          
          # we must make sure the matrix is in csc format 
          # before we start doing all these column slices!  
          sp_mat = sp_mat.tocsc()
          cols = sp_mat.shape[1]
          
          if n_jobs == -1:
              # Process the work on all available CPU cores
              num_cores = multiprocessing.cpu_count()
          else:
              # Process the work on the specified number of CPU cores
              num_cores = n_jobs
      
          # Return a mask of all columns to keep by calling get_corr_row() 
          # once for each column in the matrix     
          return Parallel(n_jobs=num_cores, verbose=5)(delayed(get_corr_row)(i, sp_mat, thresh)for i in range(cols))
      

      一般用法:

      #Get the mask using your sparse matrix and threshold.
      corr_mask = Get_Corr_Mask(X_t_fpr, 0.95) 
      
      # Remove features that are >= 95% correlated
      X_t_fpr_corr = X_t_fpr[:,corr_mask]
      

      【讨论】:

        【解决方案14】:

        我知道已经有很多答案,但我发现一种非常简单和简短的方法如下:

        
        # Get correlation matrix 
        corr = X.corr()
        
        # Create a mask for values above 90% 
        # But also below 100% since it variables correlated with the same one
        mask = (X.corr() > 0.9) & (X.corr() < 1.0)
        high_corr = corr[mask]
        
        # Create a new column mask using any() and ~
        col_to_filter_out = ~high_corr[mask].any()
        
        # Apply new mask
        X_clean = X[high_corr.columns[col_to_filter_out]]
        
        # Visualize cleaned dataset
        X_clean
        

        【讨论】:

          【解决方案15】:

          这是我上个月在工作中使用的方法。也许这不是最好或最快的方法,但它工作正常。这里,df 是我原来的 Pandas 数据框:

          dropvars = []
          threshold = 0.95
          df_corr = df.corr().stack().reset_index().rename(columns={'level_0': 'Var 1', 'level_1': 'Var 2', 0: 'Corr'})
          df_corr = df_corr[(df_corr['Corr'].abs() >= threshold) & (df_corr['Var 1'] != df_corr['Var 2'])]
          while len(df_corr) > 0:
              var = df_corr['Var 1'].iloc[0]
              df_corr = df_corr[((df_corr['Var 1'] != var) & (df_corr['Var 2'] != var))]
              dropvars.append(var)
          df.drop(columns=dropvars, inplace=True)
          

          我的想法如下:首先,我创建一个包含列 Var 1、Var 2 和 Corr 的数据框,其中我只保留相关性高于或等于我的阈值(绝对值)的那些变量对。然后,我迭代地选择此关联数据框中的第一个变量(Var 1 值),将其添加到 dropvar 列表中,并删除它出现的关联数据框的所有行,直到我的关联数据框为空。最后,我从原始数据框中删除了 dropvar 列表中的列。

          【讨论】:

            【解决方案16】:

            我今天有一个类似的问题,并遇到了这篇文章。这就是我最终的结果。

            def uncorrelated_features(df, threshold=0.7):
                """
                Returns a subset of df columns with Pearson correlations
                below threshold.
                """
            
                corr = df.corr().abs()
                keep = []
                for i in range(len(corr.iloc[:,0])):
                    above = corr.iloc[:i,i]
                    if len(keep) > 0: above = above[keep]
                    if len(above[above < threshold]) == len(above):
                        keep.append(corr.columns.values[i])
            
                return df[keep]
            

            【讨论】:

              【解决方案17】:

              我用自己的方式编写了没有任何 for 循环的方法来从 pandas 数据帧中删除高协方差数据

              #get co variance of data
              coVar = df.corr() # or df.corr().abs()
              threshold = 0.5 # 
              """
              1. .where(coVar != 1.0) set NaN where col and index is 1
              2. .where(coVar >= threshold) if not greater than threshold set Nan
              3. .fillna(0) Fill NaN with 0
              4. .sum() convert data frame to serise with sum() and just where is co var greater than threshold sum it
              5. > 0 convert all Series to Boolean
              """
              
              coVarCols = coVar.where(coVar != 1.0).where(coVar >=threshold).fillna(0).sum() > 0
              
              # Not Boolean Becuase we need to delete where is co var greater than threshold 
              coVarCols = ~coVarCols
              
              # get where you want
              df[coVarCols[coVarCols].index]
              

              我希望这可以帮助使用自己的 pandas 函数来处理任何 for 循环,这可以帮助提高你在大数据集中的速度

              【讨论】:

                【解决方案18】:
                correlatedColumns = []
                corr = df.corr()
                indices = corr.index
                columns = corr.columns
                posthreshold = 0.7
                negthreshold = -0.7
                
                for c in columns:
                    for r in indices:
                        if c != r and (corr[c][r] > posthreshold or corr[c][r] < negthreshold):
                            correlatedColumns.append({"column" : c , "row" : r , "val" :corr[c][r] })
                            
                
                print(correlatedColumns)
                

                【讨论】:

                  【解决方案19】:

                  在我的代码中,我需要删除与因变量相关的低相关列,我得到了这个代码

                  to_drop = pd.DataFrame(to_drop).fillna(True)
                  to_drop = list(to_drop[to_drop['SalePrice'] <.4 ].index)
                  df_h1.drop(to_drop,axis=1)
                  

                  df_h1 是我的数据框,SalePrice 是因变量...我认为更改值可能适合所有其他问题

                  【讨论】:

                    【解决方案20】:

                    下面的 sn-p 递归删除最相关的特征。

                    def get_corr_feature(df):
                        corr_matrix = df.corr().abs()
                        # Select upper triangle of correlation matrix
                        upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool_))
                        upper['score']= upper.max(axis=1)
                        upper.sort_values(by=['score'],ascending=False)
                        #Find the most correlated feature and send return it for drop
                        column_name=upper.sort_values(by=['score'],ascending=False).index[0]
                        max_score=upper.loc[column_name,'score']
                        return column_name, max_score
                    
                    max_score=1
                    while max_score>0.5:
                        column_name, max_score=get_corr_feature(df)
                        df.drop(column_name,axis=1,inplace=True)
                    

                    【讨论】:

                      【解决方案21】:

                      我写了一个使用偏相关的笔记本

                      https://gist.github.com/thistleknot/ce1fc38ea9fcb1a8dafcfe6e0d8af475

                      它的要点(双关语)

                      for train_index, test_index in kfold.split(all_data):
                          #print(iteration)
                          max_pvalue = 1
                          
                          subset = all_data.iloc[train_index].loc[:, ~all_data.columns.isin([exclude])]
                          
                          #skip y and states
                          set_ = subset.loc[:, ~subset.columns.isin([target])].columns.tolist()
                          
                          n=len(subset)
                          
                          while(max_pvalue>=.05):
                      
                              dist = scipy.stats.beta(n/2 - 1, n/2 - 1, loc=-1, scale=2)
                              p_values = pd.DataFrame(2*dist.cdf(-abs(subset.pcorr()[target]))).T
                              p_values.columns = list(subset.columns)
                              
                              max_pname = p_values.idxmax(axis=1)[0]
                              max_pvalue = p_values[max_pname].values[0]
                              
                              if (max_pvalue > .05):
                      
                                  set_.remove(max_pname)
                                  temp = [target]
                                  temp.extend(set_)
                                  subset = subset[temp]
                          
                          winners = p_values.loc[:, ~p_values.columns.isin([target])].columns.tolist()
                          sig_table = (sig_table + np.where(all_data.columns.isin(winners),1,0)).copy()
                          
                          signs_table[all_data.columns.get_indexer(winners)]+=np.where(subset.pcorr()[target][winners]<0,-1,1)
                      
                      
                      significance = pd.DataFrame(sig_table).T
                      significance.columns = list(all_data.columns)
                      display(significance)
                      
                      sign = pd.DataFrame(signs_table).T
                      sign.columns = list(all_data.columns)
                      display(sign)
                      
                      purity = abs((sign/num_folds)*(sign/significance)).T.replace([np.inf, -np.inf, np.NaN], 0)
                      display(purity.T)
                      

                      【讨论】:

                        【解决方案22】:

                        我认为这必须以迭代的方式完成:

                        uncorrelated_features = features.copy()
                        
                        # Loop until there's nothing to drop
                        while True:
                            # Calculating the correlation matrix for the remaining list of features
                            cor = uncorrelated_features.corr().abs()
                        
                            # Generating a square matrix with all 1s except for the main axis
                            zero_main = np.triu(np.ones(cor.shape), k=1) +
                                np.tril(np.ones(cor.shape), k=-1)
                        
                            # Using the zero_main matrix to filter out the main axis of the correlation matrix
                            except_main = cor.where(zero_main.astype(bool))
                        
                            # Calculating some metrics for each column, including the max correlation,
                            # mean correlation and the name of the column
                            mertics = [(except_main[column].max(), except_main[column].mean(), column) for column in except_main.columns]
                        
                            # Sort the list to find the most suitable candidate to drop at index 0
                            mertics.sort(key=lambda x: (x[0], x[1]), reverse=True)
                        
                            # Check and see if there's anything to drop from the list of features
                            if mertics[0][0] > 0.5:
                                uncorrelated_features.drop(mertics[0][2], axis=1, inplace=True)
                            else:
                                break
                        

                        值得一提的是,您可能想要自定义我对指标列表进行排序的方式和/或我如何检测是否要删除列。

                        【讨论】:

                          【解决方案23】:

                          如果您想返回相关列的细分,您可以使用此函数查看它们以查看您正在删除的内容并调整您的阈值

                          def corr_cols(df,thresh):
                              # Create correlation matrix
                              corr_matrix = df.corr().abs()
                              # Select upper triangle of correlation matrix
                              upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool_))
                          
                              dic = {'Feature_1':[],'Featur_2':[],'val':[]}
                              for col in upper.columns:
                                  corl = list(filter(lambda x: x >= thresh, upper[col] ))
                                  #print(corl)
                                  if len(corl) > 0:
                                      inds = [round(x,4) for x in corl]
                                      for ind in inds:
                                          #print(col)
                                          #print(ind)
                                          col2 = upper[col].index[list(upper[col].apply(lambda x: round(x,4))).index(ind)]
                                          #print(col2)
                                          dic['Feature_1'].append(col)
                                          dic['Featur_2'].append(col2)
                                          dic['val'].append(ind) 
                              return pd.DataFrame(dic).sort_values(by="val", ascending=False)
                          

                          然后通过调用 df 删除它们

                              corr = corr_cols(star,0.5)
                              df.drop(columns = corr.iloc[:,0].unique())
                          

                          【讨论】:

                            猜你喜欢
                            • 2014-04-12
                            • 2023-03-27
                            • 2020-06-23
                            • 2014-10-23
                            • 1970-01-01
                            • 1970-01-01
                            • 2018-12-16
                            • 2019-12-27
                            • 2017-05-02
                            相关资源
                            最近更新 更多