# 【5】数据集转化--3--预处理数据-One Hot Encoding in Scikit-Learn

One-Hot编码，又称为一位有效编码，主要是采用位状态寄存器来对个状态进行编码，每个状态都由他独立的寄存器位，并且在任意时候只有一位有效。

• 性别：[“male”，”female”]
• 地区：[“Europe”，”US”，”Asia”]
• 浏览器：[“Firefox”，”Chrome”，”Safari”，”Internet Explorer”]

1. 解决了分类器不好处理离散数据的问题，
2. 在一定程度上也起到了扩充特征的作用（上面样本特征数从3扩展到了9）

1. 它是一个词袋模型，不考虑词与词之间的顺序（文本中词的顺序信息也是很重要的）；
2. 它假设词与词相互独立（在大多数情况下，词与词是相互影响的）
3. 它得到的特征是离散稀疏的。

## 一、one-hot encoding的代码实现

from numpy import argmax
# define input string
data = 'hello world'
print(data)

# define universe of possible input values
alphabet = 'abcdefghijklmnopqrstuvwxyz '

# define a mapping of chars to integers
char_to_int = dict((c, i) for i, c in enumerate(alphabet))
int_to_char = dict((i, c) for i, c in enumerate(alphabet))

# integer encode input data
integer_encoded = [char_to_int[char] for char in data]
print(integer_encoded)
#[7, 4, 11, 11, 14, 26, 22, 14, 17, 11, 3]

# one hot encode
onehot_encoded = list()
for value in integer_encoded:
letter = [0 for _ in range(len(alphabet))]
letter[value] = 1
onehot_encoded.append(letter)
print onehot_encoded

#, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]

# invert encoding
inverted = int_to_char[argmax(onehot_encoded[0])]
print inverted


## 二、sklearn实现one-hot encoding

### LabelEncoder说明

sklearn.preprocessing.LabelEncoder

>>> from sklearn import preprocessing
>>> le = preprocessing.LabelEncoder()
>>> le.fit([1, 2, 2, 6])
LabelEncoder()
>>> le.classes_
array([1, 2, 6])
>>> le.transform([1, 1, 2, 6])
array([0, 0, 1, 2]...)
>>> le.inverse_transform([0, 0, 1, 2])
array([1, 1, 2, 6])


>>> le = preprocessing.LabelEncoder()
>>> le.fit(["paris", "paris", "tokyo", "amsterdam"])
LabelEncoder()
>>> list(le.classes_)
['amsterdam', 'paris', 'tokyo']
>>> le.transform(["tokyo", "tokyo", "paris"])
array([2, 2, 1]...)
>>> list(le.inverse_transform([2, 2, 1]))
['tokyo', 'tokyo', 'paris']


• fit(y) Fit label encoder
• fit_transform(y) Fit label encoder and return encoded labels
• get_params([deep]) Get parameters for this estimator.
• inverse_transform(y) Transform labels back to original encoding.
• set_params(**params) Set the parameters of this estimator.
• transform(y) Transform labels to normalized encoding.

### OneHotEncoder

>>> from sklearn.preprocessing import OneHotEncoder
>>> enc = OneHotEncoder()
>>> enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]])
OneHotEncoder(categorical_features='all', dtype=<... 'numpy.float64'>,
handle_unknown='error', n_values='auto', sparse=True)
>>> enc.n_values_
array([2, 3, 4])
>>> enc.feature_indices_
array([0, 2, 5, 9])
>>> enc.transform([[0, 1, 1]]).toarray()
array([[ 1.,  0.,  0.,  1.,  0.,  0.,  1.,  0.,  0.]])


fit(X[, y]) Fit OneHotEncoder to X.
fit_transform(X[, y])   Fit OneHotEncoder to X, then transform X.
get_params([deep])  Get parameters for this estimator.
set_params(**params)    Set the parameters of this estimator.
transform(X)    Transform X using one-hot encoding.


## 三、案例详解

### 3.1 读取数据

#import
import numpy as np
import pandas as pd

# load dataset
X = pd.read_csv('titanic_data.csv')  # 第一行是否需要加#,还是默认的就把第一行作为列名？
X.head(3)


# limit to categorical data using df.select_dtypes()
X = X.select_dtypes(include=[object])
X.head(3)


- Name Sex Ticket Cabin Embarked
0 Braund, Mr. Owen Harris male A/5 21171 NaN S
1 Cumings, Mrs. John Bradley (Florence Briggs Th… female PC 17599 C85 C
2 Heikkinen, Miss. Laina female STON/O2. 3101282 NaN S
# check original shape
X.shape


(891, 5)

### 3.2 分类变量转换成数值变量

# import preprocessing from sklearn
from sklearn import preprocessing

# view columns using df.columns
X.columns


Index([u’Name’, u’Sex’, u’Ticket’, u’Cabin’, u’Embarked’], dtype=‘object’)

# TODO: create a LabelEncoder object and fit it to each feature in X

# 1. INSTANTIATE
# encode labels with value between 0 and n_classes-1.
le = preprocessing.LabelEncoder()

# 2/3. FIT AND TRANSFORM
# use df.apply() to apply le.fit_transform to all columns
X_2 = X.apply(le.fit_transform)
X_2.head()

- Name Sex Ticket Cabin Embarked
0 108 1 523 0 3
1 190 0 596 82 1
2 353 0 669 0 3
3 272 0 49 56 3
4 15 1 472 0 3

OneHotEncoder：

• Encode categorical integer features using a one-hot aka one-of-K scheme.
• The input to this transformer should be a matrix of integers, denoting the values taken on by categorical (discrete) features.
• The output will be a sparse matrix where each column corresponds to one possible value of one feature.
• It is assumed that input features take on values in the range [0, n_values).
• This encoding is needed for feeding categorical data to many scikit-learn estimators, notably linear models and SVMs with the standard kernels.

### 3.3 数值转化成向量

# TODO: create a OneHotEncoder object, and fit it to all of X

# 1. INSTANTIATE
enc = preprocessing.OneHotEncoder()

# 2. FIT
enc.fit(X_2)

# 3. Transform
onehotlabels = enc.transform(X_2).toarray()
onehotlabels.shape

# as you can see, you've the same number of rows 891
# but now you've so many more columns due to how we changed all the categorical data into numerical data


(891, 1726)

onehotlabels

array([[ 0.,  0.,  0., ...,  0.,  0.,  1.],
[ 0.,  0.,  0., ...,  1.,  0.,  0.],
[ 0.,  0.,  0., ...,  0.,  0.,  1.],
...,
[ 0.,  0.,  0., ...,  0.,  0.,  1.],
[ 0.,  0.,  0., ...,  1.,  0.,  0.],
[ 0.,  0.,  0., ...,  0.,  1.,  0.]])


type(onehotlabels)

numpy.ndarray

About Sam