鏈接:https://machinelearningmastery.com/feature-selection-with-the-caret-r-package/
基于重要性的特征排序 Rank Features By Importance
特征的重要性可以通過構建模型來評估。比如決策樹(Decision tree)就有內部機制來評估特征的重要性. 其它方法也可以根據(jù)ROC曲線分析來評估特征的重要性。
下面給出了基于Pima Indians Diabetes數(shù)據(jù)庫和Learning Vector Quantization(LVQ)模型. 這樣就能夠根據(jù)重要性來對特征進行排序。代碼如下:
# ensure results are repeatable
set.seed(7)
# load the library
library(mlbench)
library(caret)
# load the dataset
data(PimaIndiansDiabetes)
# prepare training scheme
control <- trainControl(method="repeatedcv", number=10, repeats=3)
# train the model
model <- train(diabetes~., data=PimaIndiansDiabetes, method="lvq", preProcess="scale", trControl=control)
# estimate variable importance
importance <- varImp(model, scale=FALSE)
# summarize importance
print(importance)
# plot importance
plot(importance)
更進一步的,Recursive Feature Elimination (RFE)方法能夠實現(xiàn)特征的選擇。這個結果最后需要借助統(tǒng)計學和實際需要來判定特征選擇結果的合理性分析。
# ensure the results are repeatable
set.seed(7)
# load the library
library(mlbench)
library(caret)
# load the data
data(PimaIndiansDiabetes)
# define the control using a random forest selection function
control <- rfeControl(functions=rfFuncs, method="cv", number=10)
# run the RFE algorithm
results <- rfe(PimaIndiansDiabetes[,1:8], PimaIndiansDiabetes[,9], sizes=c(1:8), rfeControl=control)
# summarize the results
print(results)
# list the chosen features
predictors(results)
# plot the results
plot(results, type=c("g", "o"))
需要說明的是,通過相關性分析(Remove Redundant Features)部分,我不確定分析之后該如何處理這些特征呢?既沒有對特征排序,又沒有給出一個特征子集,所以怎么使用這個信息還不明確。
學者網

評論 0