求解释变量的方差膨胀因子,方差膨胀因子的定义

方差放大因子(variance inflation factor )简称为VIF，是表示自变量观察值之间复共线性程度的数值。在线性回归分析中，回归系数j的估计量的方差为2Cjj，其中Cjj=(1-Rj )-1将Cjj称为j的方差扩大因子，其中rj是xj的复相关系数相对于剩下的p-1个自变量的平方，Cjj1 反映存在的程度，cjj越大，【以上引用自百度百科】关于多重共线性的更多内容，可以自己查阅资料，资料越多。

本文重点介绍在Python中调用variance_inflation_factor计算VIF函数时遇到的孔。

由以下定义函数直接调用的variance_inflation_factor函数计算的VIF会生成错误的结果：

差速器检查器(df ) :

fromstatsmodels.stats.outliers _ influenceimportvariance _ inflation _ factor

name=df.columns

x=NP.matrix(df )

Vif _ list=[ variance _ inflation _ factor (x，I ) forIinrange ) x.shape[1]]

VIF=PD.dataframe((feature ) :name，(VIF ) :vif_list ) ) )。

max_vif=max(vif_list ) ) )

打印(max _ Vif )是

return VIF以上使用的数据集都是经过预处理的WOE转换后的

可以看出，以上计算的VIF都异常大(VIF大于10时，我们认为存在重度多重共线性，而大于5时，我们认为存在多重共线性)。

然后，从VIF的定义中重新定义计算VIF的新函数。如下所示。

defVIF(df ) :

fromstatsmodels.formula.apiimportols

cols=list(df.columns ) )。

cols_noti=cols

公式=col _ I ‘~’ ‘.join (cols _ noti ) ) ) ) )。

R2=ols (公式，df ).fit ).rsquared

返回1./(1.-R2 )

for i in X_rep.columns:

print(I，’\t ‘，VIF ) df=x_rep，col_i=i ) )

结果输出：上述使用的数据都相同

上述输出值才是该数据集变量的VIF值。

那为什么用variance_inflation_factor计算是错误的呢？以下是variance_inflation_factor的源代码：

fromstatsmodels.regression.linear _ modelimportols

ef variance _ inflation _ factor (exog，exog_idx ) :

“”可变基础设施，VIF， foroneexogenousvariablethevarianceinflationfactorisameasurefortheincreaseofthevarianceoftheparameterestimatesifanadditionalvarational esifanaddditionalvarional ble，givenbyexog _ idxisaddedtothelinearregression.itisameasureformulticolllinearityofthedesignmation exog.onerecommendationisthatifvifisgreaterthan 5，theexplanatoryvariablegivenbyexog _ idxishighlycolllinearwiththeotheothererexog andtheparameterestimateswillhavelargestandarderrorsbecauseofthis—- exog 3360 nd array，k_vars ) designmatrixwithallexplpls asforexampleusedinregressionexog _ idx : intindexoftheexogenousvariableinthecolumnsofexogreturns—Vif 3360 flo ar nfactacture avetheauxiliaryregression.see also—- XXX 3360 classforregressiondiagnon styetreferences—— http://en .维基百科

k_vars=exog.shape[1]

x_i=exog[:exog_idx]

mask=NP.arange(k_vars )！=exog_idx

x_noti=exog[:mask]

r_squared_I=ols(x_I，x_noti ).fit ).rsquared

vif=1./(1. – r_squared_i )

返回视图

这是因为Python的OLS不同，所以用于计算Python色散膨胀系数的OLS默认不添加截距。

因此，在数据框中添加另一列，并填充一个表示常量的列(使用常量1 )。这将是方程的截距项。此操作完成后，计算的值正确。如下所示。

efcheckVIF_new(df ) :

fromstatsmodels.stats.outliers _ influenceimportvariance _ inflation _ factor

df[‘c’]=1