【2.4.1】直方图(matplotlib-hist)
直方图显示给定变量的频率分布。 下面的表示基于分类变量对频率条进行分组,从而更好地了解连续变量和串联变量。
一、概念
区分直方图与条形图:
- 条形图是用条形的长度表示各类别频数的多少,其宽度(表示类别)则是固定的;
- 直方图是用面积表示各组频数的多少,矩形的高度表示每一组的频数或频率,宽度则表示各组的组距,因此其高度与宽度均有意义。
- 由于分组数据具有连续性,直方图的各矩形通常是连续排列,而条形图则是分开排列。
- 条形图主要用于展示分类数据,而直方图则主要用于展示数据型数据
更多参数说明
- color:可以是一个list,用于对应每个bin的颜色
- range: 选择保留的数字范围
二、常见的例子
2.1 常规例子
from numpy.random import normal
gaussian_numbers = normal(size=1000)
import matplotlib.pyplot as plt
from numpy.random import normal
gaussian_numbers = normal(size=1000)
plt.hist(gaussian_numbers)
plt.title("Gaussian Histogram")
plt.xlabel("Value")
plt.ylabel("Frequency")
plt.show()
因为histogram默认是生成10个bin,有时候我们觉得区分细度不够的时候,可以人为来增加bin
plt.hist(gaussian_numbers, bins=20)
plt.show()
我们也可以将纵坐标的频数转换成频率
plt.hist(gaussian_numbers, bins=20, normed=True)
plt.show()
2.2 作累积概率分布图(cumulative distribution)
plt.hist(gaussian_numbers, bins=20, normed=True, cumulative=True)
plt.show()
2.3 指定bin的横坐标范围
plt.hist(gaussian_numbers, bins=(-10,-1,1,10))
plt.show()
2.4 未填充的bar
plt.hist(gaussian_numbers, bins=20, histtype='step')
plt.show()
2.5 连续变量的直方图 Histogram for Continuous Variable
# Import Data
df = pd.read_csv("https://github.com/selva86/datasets/raw/master/mpg_ggplot2.csv")
# Prepare data
x_var = 'displ'
groupby_var = 'class'
df_agg = df.loc[:, [x_var, groupby_var]].groupby(groupby_var)
vals = [df[x_var].values.tolist() for i, df in df_agg]
# Draw
plt.figure(figsize=(16,9), dpi= 80)
colors = [plt.cm.Spectral(i/float(len(vals)-1)) for i in range(len(vals))]
n, bins, patches = plt.hist(vals, 30, stacked=True, density=False, color=colors[:len(vals)])
# Decoration
plt.legend({group:col for group, col in zip(np.unique(df[groupby_var]).tolist(), colors[:len(vals)])})
plt.title(f"Stacked Histogram of ${x_var}$ colored by ${groupby_var}$", fontsize=22)
plt.xlabel(x_var)
plt.ylabel("Frequency")
plt.ylim(0, 25)
plt.xticks(ticks=bins[::3], labels=[round(b,1) for b in bins[::3]])
plt.show()
2.7 分类变量的直方图 Histogram for Categorical Variable
分类变量的直方图显示该变量的频率分布。 通过对条形图进行着色,您可以将分布与表示颜色的另一个分类变量相关联
# Import Data
df = pd.read_csv("https://github.com/selva86/datasets/raw/master/mpg_ggplot2.csv")
# Prepare data
x_var = 'manufacturer'
groupby_var = 'class'
df_agg = df.loc[:, [x_var, groupby_var]].groupby(groupby_var)
vals = [df[x_var].values.tolist() for i, df in df_agg]
# Draw
plt.figure(figsize=(16,9), dpi= 80)
colors = [plt.cm.Spectral(i/float(len(vals)-1)) for i in range(len(vals))]
n, bins, patches = plt.hist(vals, df[x_var].unique().__len__(), stacked=True, density=False, color=colors[:len(vals)])
# Decoration
plt.legend({group:col for group, col in zip(np.unique(df[groupby_var]).tolist(), colors[:len(vals)])})
plt.title(f"Stacked Histogram of ${x_var}$ colored by ${groupby_var}$", fontsize=22)
plt.xlabel(x_var)
plt.ylabel("Frequency")
plt.ylim(0, 40)
plt.xticks(ticks=bins, labels=np.unique(df[x_var]).tolist(), rotation=90, horizontalalignment='left')
plt.show()
2.8 密度曲线与直方图 Density Curves with Histogram
带有直方图的密度曲线将两个图表传达的集体信息汇集在一起,这样您就可以将它们放在一个图形而不是两个图形中。
# Import Data
df = pd.read_csv("https://github.com/selva86/datasets/raw/master/mpg_ggplot2.csv")
# Draw Plot
plt.figure(figsize=(13,10), dpi= 80)
sns.distplot(df.loc[df['class'] == 'compact', "cty"], color="dodgerblue", label="Compact", hist_kws={'alpha':.7}, kde_kws={'linewidth':3})
sns.distplot(df.loc[df['class'] == 'suv', "cty"], color="orange", label="SUV", hist_kws={'alpha':.7}, kde_kws={'linewidth':3})
sns.distplot(df.loc[df['class'] == 'minivan', "cty"], color="g", label="minivan", hist_kws={'alpha':.7}, kde_kws={'linewidth':3})
plt.ylim(0, 0.35)
# Decoration
plt.title('Density Plot of City Mileage by Vehicle Type', fontsize=22)
plt.legend()
plt.show()
histtype
{'bar', 'barstacked', 'step', 'stepfilled'}, optional
The type of histogram to draw.
'bar' is a traditional bar-type histogram. If multiple data are given the bars are arranged side by side.
'barstacked' is a bar-type histogram where multiple data are stacked on top of each other.
'step' generates a lineplot that is by default unfilled.
'stepfilled' generates a lineplot that is by default filled.
Default is 'bar'
三、更高级的用法
3.1 两个hist图的重叠
例1
import matplotlib.pyplot as plt
from numpy.random import normal, uniform
gaussian_numbers = normal(size=1000)
uniform_numbers = uniform(low=-3, high=3, size=1000)
plt.hist(gaussian_numbers, bins=20, histtype='stepfilled', normed=True, color='b', label='Gaussian')
plt.hist(uniform_numbers, bins=20, histtype='stepfilled', normed=True, color='r', alpha=0.5, label='Uniform')
plt.title("Gaussian/Uniform Histogram")
plt.xlabel("Value")
plt.ylabel("Probability")
plt.legend()
plt.show()
例2:
import random
import numpy
from matplotlib import pyplot
x = [random.gauss(3,1) for _ in range(400)]
y = [random.gauss(4,2) for _ in range(400)]
bins = numpy.linspace(-10, 10, 100)
pyplot.hist(x, bins, alpha=0.5, label='x')
pyplot.hist(y, bins, alpha=0.5, label='y')
pyplot.legend(loc='upper right')
pyplot.show()
例3
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('seaborn-deep')
x = np.random.normal(1, 2, 5000)
y = np.random.normal(-1, 3, 2000)
bins = np.linspace(-10, 10, 30)
plt.hist([x, y], bins, label=['x', 'y'])
plt.legend(loc='upper right')
plt.show()
3.2 x、y轴互换
import numpy as np
import matplotlib.pyplot as plt
data = np.random.exponential(1, 100)
# Showing the first plot.
plt.hist(data, bins=10)
plt.show()
# Cleaning the plot (useful if you want to draw new shapes without closing the figure
# but quite useless for this particular example. I put it here as an example).
plt.gcf().clear()
# Showing the plot with horizontal orientation
plt.hist(data, bins=10, orientation='horizontal')
plt.show()
# Cleaning the plot.
plt.gcf().clear()
# Showing the third plot with orizontal orientation and inverted y axis.
plt.hist(data, bins=10, orientation='horizontal')
plt.gca().invert_yaxis()
plt.show()
图1:
图2:
图3:
3.3 y轴log化
plt.hist(** ,log =True)
3.4 加叠合曲线
# -*- coding:utf-8 -*-
import pandas as pd
import numpy as np
import random
data = np.zeros((1000,1000),dtype=int)
for i in range(len(data)):#这里速度比较慢,因为随机给1000*1000的数组赋值
for j in range(len(data[0])):
data[i][j] = random.randint(1,20)#赋值的范围是1-20中的任意一个
#首先构造数据,这里注意构造的是一维数组可以使用pandas中的Series,如果是二维数组使用DataFrame。
data_m = pd.DataFrame(data)
data_m = data_m[1].value_counts()#注意value_counts函数统计一个series上的数据情况
data_m = data_m.sort_index()#给统计后的数据排序
print(data_m)
#随后开始画直方图
import matplotlib.pyplot as plt
plt.hist(data[0])
plt.show()
plt.hist(data[0],bins=20)
plt.show()
四、讨论
4.1 bins 没有在ticks对应的正中心
需要被讨论的例子:
import matplotlib.pyplot as plt
l = [3,3,3,2,1,4,4,5,5,5,5,5,5,5,5,5]
plt.hist(l,normed=True)
plt.show()
解决问题的方法
import matplotlib.pyplot as plt
aa = [3,3,3,2,1,4,4,5,5,5,5,5,5,5,5,5]
plt.hist(aa,normed=False,bins=range(0,7),rwidth = 0.5,align='left')
plt.xticks(range(7),range(7))
# plt.yticks(range(6),range(6))
plt.show()
- bins 可以接受两种类型的参数,如果是数字,则代表bin的数量;如果是一个区间,则 代表在什么位置切割成一个个的bin ,举个例子吧,bins=range(1,7),则代表形成[1,2], [2,3], [3,4], …, [5, 6]区间,每个区间的取值为一个bin
- align : bin对齐的位置,默认是middle ,这里需要改成left
- rwidth: bin的宽度
- normed=True的时候,不要随意rwidth,否则比例会不对
五、报错
5.1 报错1
ValueError: range parameter must be finite.
解决办法:
当在使用hist() 或者histogram() 遇到上述错误,多数是数据中存在NaN,使用fillna()等转化NaN 为其他numeric即可
参考资料
- https://bespokeblog.wordpress.com/2011/07/11/basic-data-plotting-with-matplotlib-part-3-histograms/
- https://stackoverflow.com/questions/39048603/how-do-i-change-x-and-y-axes-in-matplotlib
- https://stackoverflow.com/questions/6871201/plot-two-histograms-at-the-same-time-with-matplotlib
- https://stackoverflow.com/questions/17451425/hist-in-matplotlib-bins-are-not-centered-and-proportions-not-correct-on-the-axi
- https://blog.csdn.net/xc_zhou/article/details/82224865
- https://matplotlib.org/api/_as_gen/matplotlib.pyplot.hist.html
这里是一个广告位,,感兴趣的都可以发邮件聊聊:tiehan@sina.cn
个人公众号,比较懒,很少更新,可以在上面提问题,如果回复不及时,可发邮件给我: tiehan@sina.cn
个人公众号,比较懒,很少更新,可以在上面提问题,如果回复不及时,可发邮件给我: tiehan@sina.cn