问使用矢量化通过Pandas转换R函数插入符号：：findCorrelation与Python 3的关系
EN

Stack Overflow用户

提问于 2017-01-20 02:27:11

回答 1查看 702关注 0票数 2

R函数caret::findCorrelation搜索相关矩阵并返回与变量对应的整数向量，如果删除，将减少其余变量之间的成对相关性。下面是这个函数的R代码：

function (x, cutoff = 0.9, verbose = FALSE, names = FALSE, exact = ncol(x) < 
100) 
 {
  if (names & is.null(colnames(x))) 
    stop("'x' must have column names when `names = TRUE`")
  out <- if (exact) 
    findCorrelation_exact(x = x, cutoff = cutoff, verbose = verbose)
  else findCorrelation_fast(x = x, cutoff = cutoff, verbose = verbose)
  out
  if (names) 
    out <- colnames(x)[out]
  out
}

以及函数findCorrelation_fast，它是我感兴趣的函数(去掉了可选参数)：

findCorrelation_fast <- function(x, cutoff = .90)
{
 if(any(!complete.cases(x)))
 stop("The correlation matrix has some missing values.")
 averageCorr <- colMeans(abs(x))
 averageCorr <- as.numeric(as.factor(averageCorr))
 x[lower.tri(x, diag = TRUE)] <- NA
 combsAboveCutoff <- which(abs(x) > cutoff)

 colsToCheck <- ceiling(combsAboveCutoff / nrow(x))
 rowsToCheck <- combsAboveCutoff %% nrow(x)

 colsToDiscard <- averageCorr[colsToCheck] > averageCorr[rowsToCheck]
 rowsToDiscard <- !colsToDiscard

 deletecol <- c(colsToCheck[colsToDiscard], rowsToCheck[rowsToDiscard])
 deletecol <- unique(deletecol)
 deletecol
}

在熊猫的帮助下，我正在编写一个函数来模仿Python 3中这个函数的意图。我的实现包含一个嵌套的for循环，据我所知，这并不是实现所需结果的最有效方法。原始的R函数不需要任何循环就能完成任务。

我的两个问题是：

基于下面的实现，是否有一种Pythonic方法将嵌套的for循环替换为向量化的实现？
与(1)相关的是，R函数findCorrelation_fast使用行averageCorr <- as.numeric(as.factor(averageCorr))。这种结构对我来说似乎是非常陌生的，也是成功实施掠夺者R的关键。有人能弄清楚这条线在干什么吗？我的直觉告诉我，它非常聪明，利用了R的一些独特的行为。

我的Python实现及其使用示例：

import numpy as np
import pandas as pd

# calculate pair-wise correlations

def findCorrelated(corrmat, cutoff = 0.8):    

### search correlation matrix and identify pairs that if removed would reduce pair-wise correlations
# args:
    # corrmat: a correlation matrix
    # cutoff: pairwise absolute correlation cutoff
# returns:
    # variables to removed

    if(len(corrmat) != len(corrmat.columns)) : return 'Correlation matrix is not square'
    averageCorr = corrmat.abs().mean(axis = 1)

    # set lower triangle and diagonal of correlation matrix to NA
    corrmat = corrmat.where(np.triu(np.ones(corrmat.shape)).astype(np.bool))
    corrmat.values[[np.arange(len(corrmat))]*2] = None 

    # where a pairwise correlation is greater than the cutoff value, check whether mean abs.corr of a or b is greater and cut it
    to_delete = list()
    for col in range(0, len(corrmat.columns)):
        for row in range(0, len(corrmat)):
            if(corrmat.iloc[row, col] > cutoff):
                if(averageCorr.iloc[row] > averageCorr.iloc[col]): to_delete.append(row)
                else: to_delete.append(col)

    to_delete = list(set(to_delete))

    return to_delete

# generate some data
df = pd.DataFrame(np.random.randn(50,25))

# demonstrate usage of function    
removeCols = findCorrelated(df.corr(), cutoff = 0.01) #set v.low cutoff as data is uncorrelated
print('Columns to be removed:')
print(removeCols)
uncorrelated = df.drop(df.index[removeCols], axis =1, inplace = False)
print('Uncorrelated variables:')
print(uncorrelated)