使用OLS做回归

#使用OLS做多元线性回归拟合from sklearn import linear_model,cross_validation, feature_selection,preprocessingimport statsmodels.formula.api as smfrom statsmodels.tools.eval_measures import mse from statsmodels.tools.tools import add_constant from sklearn.metrics import mean_squared_errorX = b_data.values.copy()X_train, X_valid, y_train, y_valid =cross_validation.train_test_split( X[:, :-1], X[:, -1],train_size=0.80)result = sm.OLS( y_train, add_constant(X_train) ).fit()result.summary()

https://www.datarobot.com/blog/ordinary-least-squares-in-python/

解读summary结果

summary结果中提供了很多关于拟合的信息,下面是这些描述信息的含义:

第一个表左边部分是关于拟合的基本信息:

Element Description中文描述(个人翻译,如有错误请留言指出)Dep. Variable  Which variable is the response in the model 响应变量/独立变量 yModel What model you are using in the fit 用来做拟合的模型Method  How the parameters of the model were calculated 模型参数的计算方法No. Observations   The number of observations (examples) 观察样本个数DF Residuals   Degrees of freedom of the residuals. Number of observations – number of parameters

 残差自由度 = 样本数-模型参数个数

DF Residuals=No. Observations-DF Model -1

DF Model  Number of parameters in the model (not including the constant term if present) 模型自由度 = 模型参数个数(不包括常数

  
第一个表的右边部分显示的是拟合的好坏情况:

Element   Description  中文描述(个人翻译,如有错误请留言指出)R-squared The coefficient of determination. A statistical measure of how well the regression line approximates the real data points R决定系数,描述回归曲线对真实数据点拟合程度的统计量,取值[0,1]之间,越接近于1 越好Adj. R-squared  The above value adjusted based on the number of observations and the degrees-of-freedom of the residuals  调整R决定系数,是R决定系数加上了残差自由度后的重新计算的结果,通过比R小。计算公式:F-statistic A measure how significant the fit is. The mean squared error of the model divided by the mean squared error of the residuals F统计值是对总体线性的显著性假设检验值,衡量拟合的重要性。计算公式:Prob (F-statistic)  The probability that you would get the above statistic, given the null hypothesis that they are unrelated概率F值Log-likelihood The log of the likelihood function.似然函数的对数AIC   The Akaike Information Criterion. Adjusts the log-likelihood based on the number of observations and the complexity of the model.    AIC(赤池信息准则),是统计模型选择中用于评判模型优劣的一个非常广泛应用的信息量准则。
AIC值越小,模型越好。BIC The Bayesian Information Criterion. Similar to the AIC, but has a higher penalty for models with more parameters. BIC准则,对模型参数考虑的更多,是对AIC模型的弥补。它弥补了AIC准则不能给出模型阶数的相容估计问题

第二个表显示的是拟合系数信息:

Element  Description中文描述(个人翻译,如有错误请留言指出)coef  The estimated value of the coefficient  模型系数估计值std err The basic standard error of the estimate of the coefficient. More sophisticated errors are also available. 系数的标准差t    The t-statistic value. This is a measure of how statistically significant the coefficient is.t统计量,衡量系数的统计显著性P > |t|  P-value that the null-hypothesis that the coefficient = 0 is true. If it is less than the confidence level, often 0.05, it indicates that there is a statistically significant relationship between the term and the response.    P值,显示统计量和响应变量间是否具有显著线性相关性。[95.0% Conf. Interval]    The lower and upper values of the 95% confidence interval95%置信区间的上下边界值

最后一个表显示的是对残差分布的统计检验评估:

Element    Description  中文描述(个人翻译,如有错误请留言指出)Skewness A measure of the symmetry of the data about the mean. Normally-distributed errors should be symmetrically distributed about the mean (equal amounts above and below the line).  偏度系数,描述数据关于均值的对称情况。Kurtosis   A measure of the shape of the distribution. Compares the amount of data close to the mean with those far away from the mean (in the tails).   峰度系数,通过比较接近平均值的数据量 和远离平均值的数据量来 描述数据分布形状的度量值。Omnibus   D’Angostino’s test. It provides a combined statistical test for the presence of skewness and kurtosis.    对偏度和峰度进行组合统计检验的一个概括值Prob(Omnibus)  The above statistic turned into a probability把上面的统计值变成概率值Jarque-Bera A different test of the skewness and kurtosis另一个检验偏度和峰度的统计测试Prob (JB)   The above statistic turned into a probability 把上面的统计值变成概率值Durbin-Watson A test for the presence of autocorrelation (that the errors are not independent.) Often important in time-series analysis 对自相关性的假设检验。通常在时间序列分析中很重要。Cond. No   A test for multicollinearity (if in a fit with multiple parameters, the parameters ar快三大小单双口诀measure of how statistically significant the coefficient is.t统计量,衡量系数的统计显著性P > |t|  P-value that the null-hypothesis that the coefficient = 0 is true. If it is less than the confidence level, often 0.05, it indicates that there is a statistically significant relationship between the term and the response.    P值,显示统计量和响应变量间是否具有显著线性相关性。[95.0% Conf. Interval]    The lower and upper values of the 95% confidence interval95%置信区间的上下边界值

最后一个表显示的是对残差分布的统计检验评估:

Element    Description  中文描述(个人翻译,如有错误请留言指出)Skewness A measure of the symmetry of the data about the mean. Normally-distributed errors should be symmetrically distributed about the mean (equal amounts above and below the line).  偏度系数,描述数据关于均值的对称情况。Kurtosis   A measure of the shape of the distribution. Compares the amount of data close to the mean with those far away from the mean (in the tails).   峰度系数,通过比较接近平均值的数据量 和远离平均值的数据量来 描述数据分布形状的度量值。Omnibus   D’Angostino’s test. It provides a combined statistical test for the presence of skewness and kurtosis.    对偏度和峰度进行组合统计检验的一个概括值Prob(Omnibus)  The above statistic turned into a probability把上面的统计值变成概率值Jarque-Bera A different test of the skewness and kurtosis另一个检验偏度和峰度的统计测试Prob (JB)   The above statistic turned into a probability 把上面的统计值变成概率值Durbin-Watson A test for the presence of autocorrelation (that the errors are not independent.) Often important in time-series analysis 对自相关性的假设检验。通常在时间序列分析中很重要。Cond. No   A test for multicollinearity (if in a fit with multiple parameters, the parameters are related with each other).    多重共线性的假设检验