【5】数据集转化--3--预处理数据-One Hot Encoding in Scikit-Learn
One-Hot编码,又称为一位有效编码,主要是采用位状态寄存器来对个状态进行编码,每个状态都由他独立的寄存器位,并且在任意时候只有一位有效。
在实际的机器学习的应用任务中,特征有时候并不总是连续值,有可能是一些分类值,如性别可分为“male”和“female”。在机器学习任务中,对于这样的特征,通常我们需要对其进行特征数字化,如下面的例子:
有如下三个特征属性:
- 性别:[“male”,“female”]
- 地区:[“Europe”,“US”,“Asia”]
- 浏览器:[“Firefox”,“Chrome”,“Safari”,“Internet Explorer”]
对于某一个样本,如[“male”,“US”,“Internet Explorer”],我们需要将这个分类值的特征数字化,最直接的方法,我们可以采用序列化的方式:[0,1,3]。但是这样的特征处理并不能直接放入机器学习算法中。
对于上述的问题,性别的属性是二维的,同理,地区是三维的,浏览器则是思维的,这样,我们可以采用One-Hot编码的方式对上述的样本“[“male”,“US”,“Internet Explorer”]”编码,“male”则对应着[1,0],同理“US”对应着[0,1,0],“Internet Explorer”对应着[0,0,0,1]。则完整的特征数字化的结果为:[1,0,0,1,0,0,0,0,1]。这样导致的一个结果就是数据会变得非常的稀疏。
优点:
- 解决了分类器不好处理离散数据的问题,
- 在一定程度上也起到了扩充特征的作用(上面样本特征数从3扩展到了9)
缺点:
- 它是一个词袋模型,不考虑词与词之间的顺序(文本中词的顺序信息也是很重要的);
- 它假设词与词相互独立(在大多数情况下,词与词是相互影响的)
- 它得到的特征是离散稀疏的。
一、one-hot encoding的代码实现
from numpy import argmax
# define input string
data = 'hello world'
print(data)
# define universe of possible input values
alphabet = 'abcdefghijklmnopqrstuvwxyz '
# define a mapping of chars to integers
char_to_int = dict((c, i) for i, c in enumerate(alphabet))
int_to_char = dict((i, c) for i, c in enumerate(alphabet))
# integer encode input data
integer_encoded = [char_to_int[char] for char in data]
print(integer_encoded)
#[7, 4, 11, 11, 14, 26, 22, 14, 17, 11, 3]
# one hot encode
onehot_encoded = list()
for value in integer_encoded:
letter = [0 for _ in range(len(alphabet))]
letter[value] = 1
onehot_encoded.append(letter)
print onehot_encoded
#, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
# invert encoding
inverted = int_to_char[argmax(onehot_encoded[0])]
print inverted
二、sklearn实现one-hot encoding
LabelEncoder说明
sklearn.preprocessing.LabelEncoder
将分类变量编码为0到(分类变量个数-1)的数值
例子:
>>> from sklearn import preprocessing
>>> le = preprocessing.LabelEncoder()
>>> le.fit([1, 2, 2, 6])
LabelEncoder()
>>> le.classes_
array([1, 2, 6])
>>> le.transform([1, 1, 2, 6])
array([0, 0, 1, 2]...)
>>> le.inverse_transform([0, 0, 1, 2])
array([1, 1, 2, 6])
处理非数值的列
>>> le = preprocessing.LabelEncoder()
>>> le.fit(["paris", "paris", "tokyo", "amsterdam"])
LabelEncoder()
>>> list(le.classes_)
['amsterdam', 'paris', 'tokyo']
>>> le.transform(["tokyo", "tokyo", "paris"])
array([2, 2, 1]...)
>>> list(le.inverse_transform([2, 2, 1]))
['tokyo', 'tokyo', 'paris']
说明:
- fit(y) Fit label encoder
- fit_transform(y) Fit label encoder and return encoded labels
- get_params([deep]) Get parameters for this estimator.
- inverse_transform(y) Transform labels back to original encoding.
- set_params(**params) Set the parameters of this estimator.
- transform(y) Transform labels to normalized encoding.
OneHotEncoder
>>> from sklearn.preprocessing import OneHotEncoder
>>> enc = OneHotEncoder()
>>> enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]])
OneHotEncoder(categorical_features='all', dtype=<... 'numpy.float64'>,
handle_unknown='error', n_values='auto', sparse=True)
>>> enc.n_values_
array([2, 3, 4])
>>> enc.feature_indices_
array([0, 2, 5, 9])
>>> enc.transform([[0, 1, 1]]).toarray()
array([[ 1., 0., 0., 1., 0., 0., 1., 0., 0.]])
方法
fit(X[, y]) Fit OneHotEncoder to X.
fit_transform(X[, y]) Fit OneHotEncoder to X, then transform X.
get_params([deep]) Get parameters for this estimator.
set_params(**params) Set the parameters of this estimator.
transform(X) Transform X using one-hot encoding.
三、案例详解
3.1 读取数据
#import
import numpy as np
import pandas as pd
# load dataset
X = pd.read_csv('titanic_data.csv') # 第一行是否需要加#,还是默认的就把第一行作为列名?
X.head(3)
# limit to categorical data using df.select_dtypes()
X = X.select_dtypes(include=[object])
X.head(3)
说明:DataFrame.select_dtypes(include=None, exclude=None),include, exclude : list-like(传入想要查找的类型)
- | Name | Sex | Ticket | Cabin | Embarked |
---|---|---|---|---|---|
0 | Braund, Mr. Owen Harris | male | A/5 21171 | NaN | S |
1 | Cumings, Mrs. John Bradley (Florence Briggs Th… | female | PC 17599 | C85 | C |
2 | Heikkinen, Miss. Laina | female | STON/O2. 3101282 | NaN | S |
# check original shape
X.shape
(891, 5)
3.2 分类变量转换成数值变量
# import preprocessing from sklearn
from sklearn import preprocessing
# view columns using df.columns
X.columns
Index([u’Name', u’Sex', u’Ticket', u’Cabin', u’Embarked'], dtype=‘object’)
# TODO: create a LabelEncoder object and fit it to each feature in X
# 1. INSTANTIATE
# encode labels with value between 0 and n_classes-1.
le = preprocessing.LabelEncoder()
# 2/3. FIT AND TRANSFORM
# use df.apply() to apply le.fit_transform to all columns
X_2 = X.apply(le.fit_transform)
X_2.head()
- | Name | Sex | Ticket | Cabin | Embarked |
---|---|---|---|---|---|
0 | 108 | 1 | 523 | 0 | 3 |
1 | 190 | 0 | 596 | 82 | 1 |
2 | 353 | 0 | 669 | 0 | 3 |
3 | 272 | 0 | 49 | 56 | 3 |
4 | 15 | 1 | 472 | 0 | 3 |
OneHotEncoder:
- Encode categorical integer features using a one-hot aka one-of-K scheme.
- The input to this transformer should be a matrix of integers, denoting the values taken on by categorical (discrete) features.
- The output will be a sparse matrix where each column corresponds to one possible value of one feature.
- It is assumed that input features take on values in the range [0, n_values).
- This encoding is needed for feeding categorical data to many scikit-learn estimators, notably linear models and SVMs with the standard kernels.
问题:每一列是单独处理么??
3.3 数值转化成向量
# TODO: create a OneHotEncoder object, and fit it to all of X
# 1. INSTANTIATE
enc = preprocessing.OneHotEncoder()
# 2. FIT
enc.fit(X_2)
# 3. Transform
onehotlabels = enc.transform(X_2).toarray()
onehotlabels.shape
# as you can see, you've the same number of rows 891
# but now you've so many more columns due to how we changed all the categorical data into numerical data
(891, 1726)
onehotlabels
array([[ 0., 0., 0., ..., 0., 0., 1.],
[ 0., 0., 0., ..., 1., 0., 0.],
[ 0., 0., 0., ..., 0., 0., 1.],
...,
[ 0., 0., 0., ..., 0., 0., 1.],
[ 0., 0., 0., ..., 1., 0., 0.],
[ 0., 0., 0., ..., 0., 1., 0.]])
type(onehotlabels)
numpy.ndarray
参考资料:
- http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html#sklearn.preprocessing.LabelEncoder.fit_transform
- http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html
- http://www.ritchieng.com/machinelearning-one-hot-encoding/
- https://machinelearningmastery.com/how-to-one-hot-encode-sequence-data-in-python/
- https://blog.csdn.net/google19890102/article/details/44039761
个人公众号,比较懒,很少更新,可以在上面提问题,如果回复不及时,可发邮件给我: tiehan@sina.cn