让我们将所有这些知识应用到现实世界的问题. 我们将构建一个SVM来预测进出大楼的人数. 数据集可从以下网址获得: https://archive.ics.uci.edu/ml/datasets/CalIt2+Building+People+Counts. 我们将使用此数据集的略微修改版本, 以便更容易分析. 修改后的数据位于已提供给您的building_event_binary.txt和building_event_multiclass.txt文件中.
准备
让我们在开始构建模型之前了解数据格式. 构建事件binary.txt中的每一行由六个逗号分隔的字符串组成. 这六个字符串的顺序如下:
- Day
- Date
- Time
- The number of people going out of the building
- The number of people coming into the building
- The output indicating whether or not it's an event
前五个字符串构成输入数据, 我们的任务是预测一个事件是否在建筑物中发生.
building_event_multiclass.txt中的每行由六个逗号分隔的字符串组成. 这比以前的文件更细致, 因为输出是建筑物中发生的确切类型的事件. 这六个字符串的顺序如下:
- Day
- Date
- Time
- The number of people going out of the building
- The number of people coming into of the building
- The output indicating whether or not it's an event
前五个字符串形成输入数据, 我们的任务是预测在建筑物中发生什么类型的事件.
怎么做...?
- 我们将使用已经提供给您的event.py以供参考. 创建一个新的Python文件, 并添加以下行:
import numpy as np
from sklearn import preprocessing
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score
input_file = 'building_event_binary.txt'
# input_file = 'building_event_multiclass.txt'
# Reading the data
X = []
count = 0
with open(input_file, 'r') as f:
for line in f.readlines():
data = line[:-1].split(',')
X.append([data[0]] + data[2:])
X = np.array(X)
# We just loaded all the data into X
- 让我们将数据转换为数值形式
# Convert string data to numerical data
label_encoder = []
X_encoded = np.empty(X.shape)
for i, item in enumerate(X[0]):
if item.isdigit():
X_encoded[:, i] = X[:, i]
else:
label_encoder.append(preprocessing.LabelEncoder())
X_encoded[:, i] = label_encoder[-1].fit_transform(X[:, i])
X = X_encoded[:, :-1].astype(int)
y = X_encoded[:, -1].astype(int)
- 让我们使用radial basis function,Platt缩放和类平衡来训练SVM:
params = {'kernel': 'rbf', 'probability': True, 'class_weight': 'auto'}
classifier = SVC(**params)
classifier.fit(X, y)
- 使用交叉验证:
accuracy = cross_val_score(
classifier,
X,
y,
scoring='accuracy', cv=3
)
print (
"Accuracy of the classifier: " + str(
round(100 * accuracy.mean(), 2)
) + "%"
)
- 使用新数据测试分类器
input_data = ['Tuesday', '12:30:00', '21', '23']
input_data_encoded = [-1] * len(input_data)
count = 0
for i, item in enumerate(input_data):
if item.isdigit():
input_data_encoded[i] = int(input_data[i])
else:
input_data_encoded[i] = int(
label_encoder[count].transform([input_data[i]]))
count = count + 1
input_data_encoded = np.array(input_data_encoded)
# Predict and print output for a particular datapoint
output_class = classifier.predict(input_data_encoded)
print ("Output class:", label_encoder[-1].inverse_transform(output_class)[0])
- 输出结果如下:
Accuracy of the classifier: 93.95%
Output class: noevent
- 如果用building_event_multiclass.txt代替building_event_binary.txt, 则输出结果为:
Accuracy of the classifier: 65.33%
Output class: eventA