【5】数据集转化--3--预处理数据-One Hot Encoding in Scikit-Learn

May 13, 2018 sklearn 阅读量：次

One-Hot编码，又称为一位有效编码，主要是采用位状态寄存器来对个状态进行编码，每个状态都由他独立的寄存器位，并且在任意时候只有一位有效。

在实际的机器学习的应用任务中，特征有时候并不总是连续值，有可能是一些分类值，如性别可分为“male”和“female”。在机器学习任务中，对于这样的特征，通常我们需要对其进行特征数字化，如下面的例子：

有如下三个特征属性：

性别：[“male”，“female”]
地区：[“Europe”，“US”，“Asia”]
浏览器：[“Firefox”，“Chrome”，“Safari”，“Internet Explorer”]

对于某一个样本，如[“male”，“US”，“Internet Explorer”]，我们需要将这个分类值的特征数字化，最直接的方法，我们可以采用序列化的方式：[0,1,3]。但是这样的特征处理并不能直接放入机器学习算法中。

对于上述的问题，性别的属性是二维的，同理，地区是三维的，浏览器则是思维的，这样，我们可以采用One-Hot编码的方式对上述的样本“[“male”，“US”，“Internet Explorer”]”编码，“male”则对应着[1，0]，同理“US”对应着[0，1，0]，“Internet Explorer”对应着[0,0,0,1]。则完整的特征数字化的结果为：[1,0,0,1,0,0,0,0,1]。这样导致的一个结果就是数据会变得非常的稀疏。

优点：

解决了分类器不好处理离散数据的问题，
在一定程度上也起到了扩充特征的作用（上面样本特征数从3扩展到了9）

缺点：

它是一个词袋模型，不考虑词与词之间的顺序（文本中词的顺序信息也是很重要的）；
它假设词与词相互独立（在大多数情况下，词与词是相互影响的）
它得到的特征是离散稀疏的。

一、one-hot encoding的代码实现

from numpy import argmax
# define input string
data = 'hello world'
print(data)

# define universe of possible input values
alphabet = 'abcdefghijklmnopqrstuvwxyz '

# define a mapping of chars to integers
char_to_int = dict((c, i) for i, c in enumerate(alphabet))
int_to_char = dict((i, c) for i, c in enumerate(alphabet))

# integer encode input data
integer_encoded = [char_to_int[char] for char in data]
print(integer_encoded)
#[7, 4, 11, 11, 14, 26, 22, 14, 17, 11, 3]

# one hot encode
onehot_encoded = list()
for value in integer_encoded:
	letter = [0 for _ in range(len(alphabet))]
	letter[value] = 1
	onehot_encoded.append(letter)
print onehot_encoded

#, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]


# invert encoding
inverted = int_to_char[argmax(onehot_encoded[0])]
print inverted

二、sklearn实现one-hot encoding

LabelEncoder说明

sklearn.preprocessing.LabelEncoder

将分类变量编码为0到（分类变量个数-1）的数值

例子：

>>> from sklearn import preprocessing
>>> le = preprocessing.LabelEncoder()
>>> le.fit([1, 2, 2, 6])
LabelEncoder()
>>> le.classes_
array([1, 2, 6])
>>> le.transform([1, 1, 2, 6]) 
array([0, 0, 1, 2]...)
>>> le.inverse_transform([0, 0, 1, 2])
array([1, 1, 2, 6])

处理非数值的列

>>> le = preprocessing.LabelEncoder()
>>> le.fit(["paris", "paris", "tokyo", "amsterdam"])
LabelEncoder()
>>> list(le.classes_)
['amsterdam', 'paris', 'tokyo']
>>> le.transform(["tokyo", "tokyo", "paris"]) 
array([2, 2, 1]...)
>>> list(le.inverse_transform([2, 2, 1]))
['tokyo', 'tokyo', 'paris']

说明：

fit(y) Fit label encoder
fit_transform(y) Fit label encoder and return encoded labels
get_params([deep]) Get parameters for this estimator.
inverse_transform(y) Transform labels back to original encoding.
set_params(**params) Set the parameters of this estimator.
transform(y) Transform labels to normalized encoding.

OneHotEncoder

>>> from sklearn.preprocessing import OneHotEncoder
>>> enc = OneHotEncoder()
>>> enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]])  
OneHotEncoder(categorical_features='all', dtype=<... 'numpy.float64'>,
       handle_unknown='error', n_values='auto', sparse=True)
>>> enc.n_values_
array([2, 3, 4])
>>> enc.feature_indices_
array([0, 2, 5, 9])
>>> enc.transform([[0, 1, 1]]).toarray()
array([[ 1.,  0.,  0.,  1.,  0.,  0.,  1.,  0.,  0.]])

方法

fit(X[, y])	Fit OneHotEncoder to X.
fit_transform(X[, y])	Fit OneHotEncoder to X, then transform X.
get_params([deep])	Get parameters for this estimator.
set_params(**params)	Set the parameters of this estimator.
transform(X)	Transform X using one-hot encoding.

三、案例详解

3.1 读取数据

#import 
import numpy as np
import pandas as pd


# load dataset
X = pd.read_csv('titanic_data.csv')  # 第一行是否需要加#,还是默认的就把第一行作为列名？
X.head(3)

# limit to categorical data using df.select_dtypes()
X = X.select_dtypes(include=[object])
X.head(3)

说明：DataFrame.select_dtypes(include=None, exclude=None)，include, exclude : list-like(传入想要查找的类型)

-	Name	Sex	Ticket	Cabin	Embarked
0	Braund, Mr. Owen Harris	male	A/5 21171	NaN	S
1	Cumings, Mrs. John Bradley (Florence Briggs Th…	female	PC 17599	C85	C
2	Heikkinen, Miss. Laina	female	STON/O2. 3101282	NaN	S

# check original shape
X.shape

(891, 5)

3.2 分类变量转换成数值变量

# import preprocessing from sklearn
from sklearn import preprocessing

# view columns using df.columns
X.columns

Index([u’Name', u’Sex', u’Ticket', u’Cabin', u’Embarked'], dtype=‘object’)

# TODO: create a LabelEncoder object and fit it to each feature in X

# 1. INSTANTIATE
# encode labels with value between 0 and n_classes-1.
le = preprocessing.LabelEncoder()


# 2/3. FIT AND TRANSFORM
# use df.apply() to apply le.fit_transform to all columns
X_2 = X.apply(le.fit_transform)
X_2.head()

-	Name	Sex	Ticket	Cabin	Embarked
0	108	1	523	0	3
1	190	0	596	82	1
2	353	0	669	0	3
3	272	0	49	56	3
4	15	1	472	0	3

OneHotEncoder：

Encode categorical integer features using a one-hot aka one-of-K scheme.
The input to this transformer should be a matrix of integers, denoting the values taken on by categorical (discrete) features.
The output will be a sparse matrix where each column corresponds to one possible value of one feature.
It is assumed that input features take on values in the range [0, n_values).
This encoding is needed for feeding categorical data to many scikit-learn estimators, notably linear models and SVMs with the standard kernels.

问题：每一列是单独处理么？？

3.3 数值转化成向量

# TODO: create a OneHotEncoder object, and fit it to all of X

# 1. INSTANTIATE
enc = preprocessing.OneHotEncoder()

# 2. FIT
enc.fit(X_2)

# 3. Transform
onehotlabels = enc.transform(X_2).toarray()
onehotlabels.shape

# as you can see, you've the same number of rows 891
# but now you've so many more columns due to how we changed all the categorical data into numerical data

(891, 1726)

onehotlabels

array([[ 0.,  0.,  0., ...,  0.,  0.,  1.],
       [ 0.,  0.,  0., ...,  1.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  1.],
       ..., 
       [ 0.,  0.,  0., ...,  0.,  0.,  1.],
       [ 0.,  0.,  0., ...,  1.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  1.,  0.]])

type(onehotlabels)

numpy.ndarray

参考资料：

这里是一个广告位，，感兴趣的都可以发邮件聊聊：tiehan@sina.cn

个人公众号，比较懒，很少更新，可以在上面提问题，如果回复不及时，可发邮件给我： tiehan@sina.cn