Powered by GitBook

scikit-learn库有建立机器学习管道. 我们只需要指定这些函数, 它将构建一个使数据通过整个流水线的组合对象. 该管道可以包括预处理, 特征选择, 监督学习, 无监督学习等功能. 在这个食谱中, 我们将构建一个管道来获取输入特征向量, 选择顶部的k个特征, 然后使用随机森林分类器进行分类.

怎么做...?

创建一个新的Python文件, 并导入以下软件包:

from sklearn.datasets import samples_generator
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.pipeline import Pipeline

生成一些数据:

# generate sample data
# 该行生成20维特征向量, 因为这是默认值. 您可以改变n_features参数进行更改
X, y = samples_generator.make_classification(
    n_informative=4,
    n_features=20,
    n_redundant=0,
    random_state=5
)

我们的第一步是选择k最好的功能, 之后再使用数据点. 在这种情况下, 我们设置k为10:

# Feature selector
selector_k_best = SelectKBest(f_regression, k=10)

下一步是使用随机森林分类器对数据进行分类:

# Random forest classifier
classifier = RandomForestClassifier(n_estimators=50, max_depth=4)

我们现在准备建造管道. 管道方法允许我们使用预定义的对象来构建管道:

# Build the machine learning pipeline
pipeline_classifier = Pipeline(
    [
        ('selector', selector_k_best),
        ('rf', classifier)
    ]
)

我们也可以为管道中的块指定名称. 在上一行中, 我们将选择器名称分配给我们的特征选择器, 并将rf分配给我们的随机森林分类器. 您可以随意使用任何其他随机名称.

我们也可以随时更新这些参数. 我们可以使用我们在上一步中分配的名称来设置参数. 例如, 如果要在功能选择器中将k设置为6, 并在随机林分类器中将n_estimators设置为25, 我们可以像下面的代码那样执行. 请注意, 这些是上一步中给出的变量名称:

# We can set the parameters using the names we assigned
# earlier. For example, if we want to set 'k' to 6 in the
# feature selector and set 'n_estimators' in the Random
# Forest Classifier to 25, we can do it as shown below
pipeline_classifier.set_params(
    selector__k=6,
    rf__n_estimators=25
)

让我们继续训练分类器:

# Training the classifier
pipeline_classifier.fit(X, y)

我们预测训练数据的输出:

# Predict the output
prediction = pipeline_classifier.predict(X)
print ("Predictions:", prediction)

我们估计这个分类器的性能:

# Print score
print ("Score:", pipeline_classifier.score(X, y))

我们还可以看到哪些功能被选中. 我们继续打印它们:

# Print the selected features chosen by the selector
features_status = pipeline_classifier.named_steps['selector'].get_support()
selected_features = []
for count, item in enumerate(features_status):
    if item:
        selected_features.append(count)

print (
    "Selected features (0-indexed):",
    ', '.join([str(x) for x in selected_features])
)

运行结果如下:

运行原理...?

选择k最好的功能的优点是我们能够处理低维数据. 这有助于降低计算复杂度. 我们选择k最佳特征的方式是基于单变量特征选择. 这执行单变量统计测试, 然后从特征向量中提取出最高性能的特征. 单变量统计检验是指涉及单个变量的分析技术.

一旦执行了这些测试, 特征向量中的每个特征被分配一个分数. 基于这些分数, 我们选择顶部的k特征. 我们将其作为分类器管道中的预处理步骤. 一旦我们提取了顶部的k特征, 就形成了一个k维特征向量, 我们用它作为随机森林分类器的输入训练数据.

results matching ""

No results matching ""