【5】数据集转化--3--预处理数据-One Hot Encoding in Scikit-Learn




  • 性别:[“male”,“female”]
  • 地区:[“Europe”,“US”,“Asia”]
  • 浏览器:[“Firefox”,“Chrome”,“Safari”,“Internet Explorer”]

对于某一个样本,如[“male”,“US”,“Internet Explorer”],我们需要将这个分类值的特征数字化,最直接的方法,我们可以采用序列化的方式:[0,1,3]。但是这样的特征处理并不能直接放入机器学习算法中。

对于上述的问题,性别的属性是二维的,同理,地区是三维的,浏览器则是思维的,这样,我们可以采用One-Hot编码的方式对上述的样本“[“male”,“US”,“Internet Explorer”]”编码,“male”则对应着[1,0],同理“US”对应着[0,1,0],“Internet Explorer”对应着[0,0,0,1]。则完整的特征数字化的结果为:[1,0,0,1,0,0,0,0,1]。这样导致的一个结果就是数据会变得非常的稀疏。


  1. 解决了分类器不好处理离散数据的问题,
  2. 在一定程度上也起到了扩充特征的作用(上面样本特征数从3扩展到了9)


  1. 它是一个词袋模型,不考虑词与词之间的顺序(文本中词的顺序信息也是很重要的);
  2. 它假设词与词相互独立(在大多数情况下,词与词是相互影响的)
  3. 它得到的特征是离散稀疏的。

一、one-hot encoding的代码实现

from numpy import argmax
# define input string
data = 'hello world'

# define universe of possible input values
alphabet = 'abcdefghijklmnopqrstuvwxyz '

# define a mapping of chars to integers
char_to_int = dict((c, i) for i, c in enumerate(alphabet))
int_to_char = dict((i, c) for i, c in enumerate(alphabet))

# integer encode input data
integer_encoded = [char_to_int[char] for char in data]
#[7, 4, 11, 11, 14, 26, 22, 14, 17, 11, 3]

# one hot encode
onehot_encoded = list()
for value in integer_encoded:
	letter = [0 for _ in range(len(alphabet))]
	letter[value] = 1
print onehot_encoded

#, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]

# invert encoding
inverted = int_to_char[argmax(onehot_encoded[0])]
print inverted

二、sklearn实现one-hot encoding





>>> from sklearn import preprocessing
>>> le = preprocessing.LabelEncoder()
>>> le.fit([1, 2, 2, 6])
>>> le.classes_
array([1, 2, 6])
>>> le.transform([1, 1, 2, 6]) 
array([0, 0, 1, 2]...)
>>> le.inverse_transform([0, 0, 1, 2])
array([1, 1, 2, 6])


>>> le = preprocessing.LabelEncoder()
>>> le.fit(["paris", "paris", "tokyo", "amsterdam"])
>>> list(le.classes_)
['amsterdam', 'paris', 'tokyo']
>>> le.transform(["tokyo", "tokyo", "paris"]) 
array([2, 2, 1]...)
>>> list(le.inverse_transform([2, 2, 1]))
['tokyo', 'tokyo', 'paris']


  • fit(y) Fit label encoder
  • fit_transform(y) Fit label encoder and return encoded labels
  • get_params([deep]) Get parameters for this estimator.
  • inverse_transform(y) Transform labels back to original encoding.
  • set_params(**params) Set the parameters of this estimator.
  • transform(y) Transform labels to normalized encoding.


>>> from sklearn.preprocessing import OneHotEncoder
>>> enc = OneHotEncoder()
>>> enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]])  
OneHotEncoder(categorical_features='all', dtype=<... 'numpy.float64'>,
       handle_unknown='error', n_values='auto', sparse=True)
>>> enc.n_values_
array([2, 3, 4])
>>> enc.feature_indices_
array([0, 2, 5, 9])
>>> enc.transform([[0, 1, 1]]).toarray()
array([[ 1.,  0.,  0.,  1.,  0.,  0.,  1.,  0.,  0.]])


fit(X[, y])	Fit OneHotEncoder to X.
fit_transform(X[, y])	Fit OneHotEncoder to X, then transform X.
get_params([deep])	Get parameters for this estimator.
set_params(**params)	Set the parameters of this estimator.
transform(X)	Transform X using one-hot encoding.


3.1 读取数据

import numpy as np
import pandas as pd

# load dataset
X = pd.read_csv('titanic_data.csv')  # 第一行是否需要加#,还是默认的就把第一行作为列名?

# limit to categorical data using df.select_dtypes()
X = X.select_dtypes(include=[object])

说明:DataFrame.select_dtypes(include=None, exclude=None),include, exclude : list-like(传入想要查找的类型)

- Name Sex Ticket Cabin Embarked
0 Braund, Mr. Owen Harris male A/5 21171 NaN S
1 Cumings, Mrs. John Bradley (Florence Briggs Th… female PC 17599 C85 C
2 Heikkinen, Miss. Laina female STON/O2. 3101282 NaN S
# check original shape

(891, 5)

3.2 分类变量转换成数值变量

# import preprocessing from sklearn
from sklearn import preprocessing

# view columns using df.columns

Index([u’Name', u’Sex', u’Ticket', u’Cabin', u’Embarked'], dtype=‘object’)

# TODO: create a LabelEncoder object and fit it to each feature in X

# encode labels with value between 0 and n_classes-1.
le = preprocessing.LabelEncoder()

# use df.apply() to apply le.fit_transform to all columns
X_2 = X.apply(le.fit_transform)
- Name Sex Ticket Cabin Embarked
0 108 1 523 0 3
1 190 0 596 82 1
2 353 0 669 0 3
3 272 0 49 56 3
4 15 1 472 0 3


  • Encode categorical integer features using a one-hot aka one-of-K scheme.
  • The input to this transformer should be a matrix of integers, denoting the values taken on by categorical (discrete) features.
  • The output will be a sparse matrix where each column corresponds to one possible value of one feature.
  • It is assumed that input features take on values in the range [0, n_values).
  • This encoding is needed for feeding categorical data to many scikit-learn estimators, notably linear models and SVMs with the standard kernels.


3.3 数值转化成向量

# TODO: create a OneHotEncoder object, and fit it to all of X

enc = preprocessing.OneHotEncoder()

# 2. FIT

# 3. Transform
onehotlabels = enc.transform(X_2).toarray()

# as you can see, you've the same number of rows 891
# but now you've so many more columns due to how we changed all the categorical data into numerical data

(891, 1726)


array([[ 0.,  0.,  0., ...,  0.,  0.,  1.],
       [ 0.,  0.,  0., ...,  1.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  1.],
       [ 0.,  0.,  0., ...,  0.,  0.,  1.],
       [ 0.,  0.,  0., ...,  1.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  1.,  0.]])




