Share as bigheiniu.

Pandas入门

Posted on By Big heiniu

pandaspython 下面的一个科学运算包,本文是模仿博客从三个方面来介绍, 创建对象,查看数据,数据选择

import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt 
%matplotlib inline

1.创建对象

pandas 是中的对象有 DataframeSeries , Series 是一维数据展示,而 DataFrame 是多维数据,相对于 numpy 的 __array__的不同是列与列之间存储的内容的类型可以不同,但是列内类型会强制转换成相同.

详情参阅Data Structure Intro Section

首先尝试构建Series对象. Series 需要 list 作为输入数据, pandas 会自动为其生成 integer 索引.

s = pd.Series([1,3,5,np.nan,6,8])
s
0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

构建DataFrame对象,如果是手工构建,是需要读入一个 numpy 数组,列名和索引也能够自动生成,但你也能在构建.

data = pd.DataFrame(np.random.randn(6,4))
data
0 1 2 3
0 0.713297 0.653896 -0.830697 -0.862059
1 -0.112909 -1.308807 0.915732 0.250069
2 -0.191729 0.364527 -0.952037 0.814861
3 0.077660 -0.613967 0.103296 -1.031920
4 -0.838760 -1.064579 0.220003 -0.577011
5 1.077944 1.430014 0.962501 0.685414
df = pd.DataFrame(np.random.randn(6,4),index=list('ABCDEF'),columns=list('!@#$'))
df
! @ # $
A 0.219259 0.033841 0.084150 0.794347
B -0.736359 1.438696 -0.020479 0.060730
C -2.215969 -0.153340 0.515900 0.098534
D -0.822583 0.325459 -0.134615 0.419446
E 0.126939 0.302446 0.905639 0.165117
F -0.523610 0.702762 1.023131 1.052215

也可以使用 dict 对象,这个表现很直接, dict 键值对应DataFrame对象的列名,索引是自动生成.

 df2 = pd.DataFrame({ 'A' : 1.,
                     'B' : pd.Timestamp('20130102'),
                     'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
                     'D' : np.array([3] * 4,dtype='int32'),
                     'E' : pd.Categorical(["test","train","test","train"]),
                     'F' : 'foo' })
df2
A B C D E F
0 1.0 2013-01-02 1.0 3 test foo
1 1.0 2013-01-02 1.0 3 train foo
2 1.0 2013-01-02 1.0 3 test foo
3 1.0 2013-01-02 1.0 3 train foo

2.查看数据

主要是查看 DataFrame 的头和尾,列名,索引以及描述信息操作.

详情参阅Basics section

查看DateFrametop & Bottom 信息.

df.head()
! @ # $
A 0.219259 0.033841 0.084150 0.794347
B -0.736359 1.438696 -0.020479 0.060730
C -2.215969 -0.153340 0.515900 0.098534
D -0.822583 0.325459 -0.134615 0.419446
E 0.126939 0.302446 0.905639 0.165117
df.tail(3)
! @ # $
D -0.822583 0.325459 -0.134615 0.419446
E 0.126939 0.302446 0.905639 0.165117
F -0.523610 0.702762 1.023131 1.052215

查看索引(index)信息,列名(columns)信息和 numpy 数据.

df.index
Index([u'A', u'B', u'C', u'D', u'E', u'F'], dtype='object')
df.columns
Index([u'!', u'@', u'#', u'$'], dtype='object')
df.values
array([[ 0.21925947,  0.03384101,  0.08414998,  0.79434669],
       [-0.73635939,  1.43869561, -0.02047871,  0.06073024],
       [-2.2159687 , -0.15333967,  0.51589987,  0.0985342 ],
       [-0.82258259,  0.32545932, -0.13461521,  0.41944583],
       [ 0.12693888,  0.3024455 ,  0.90563948,  0.16511706],
       [-0.52360974,  0.702762  ,  1.02313089,  1.05221477]])

从values可以看出,数据是许多个array list组成,拼接成DataFrame.

使用descirbe能够快速展示 Dataframe 的统计学信息.

df.describe()
! @ # $
count 6.000000 6.000000 6.000000 6.000000
mean -0.658720 0.441644 0.395621 0.431731
std 0.879121 0.568415 0.493894 0.408995
min -2.215969 -0.153340 -0.134615 0.060730
25% -0.801027 0.100992 0.005678 0.115180
50% -0.629985 0.313952 0.300025 0.292281
75% -0.035698 0.608436 0.808205 0.700621
max 0.219259 1.438696 1.023131 1.052215

转置 Dataframe .

df.T
A B C D E F
! 0.219259 -0.736359 -2.215969 -0.822583 0.126939 -0.523610
@ 0.033841 1.438696 -0.153340 0.325459 0.302446 0.702762
# 0.084150 -0.020479 0.515900 -0.134615 0.905639 1.023131
$ 0.794347 0.060730 0.098534 0.419446 0.165117 1.052215

使用sort_index通过 axisDataframe 内容进行排序

df.sort_index(axis=1, ascending=False)
@ $ # !
A 0.033841 0.794347 0.084150 0.219259
B 1.438696 0.060730 -0.020479 -0.736359
C -0.153340 0.098534 0.515900 -2.215969
D 0.325459 0.419446 -0.134615 -0.822583
E 0.302446 0.165117 0.905639 0.126939
F 0.702762 1.052215 1.023131 -0.523610
df.sort_index(axis=0, ascending=False)
! @ # $
F -0.523610 0.702762 1.023131 1.052215
E 0.126939 0.302446 0.905639 0.165117
D -0.822583 0.325459 -0.134615 0.419446
C -2.215969 -0.153340 0.515900 0.098534
B -0.736359 1.438696 -0.020479 0.060730
A 0.219259 0.033841 0.084150 0.794347

axis 参数中0是 index 索引,1是 columns

也可以通过sort_values指定特定的 Column 来进行排序

df.sort_values(by='!')
! @ # $
C -2.215969 -0.153340 0.515900 0.098534
D -0.822583 0.325459 -0.134615 0.419446
B -0.736359 1.438696 -0.020479 0.060730
F -0.523610 0.702762 1.023131 1.052215
E 0.126939 0.302446 0.905639 0.165117
A 0.219259 0.033841 0.084150 0.794347

3.数据选择

  • ”[]”直接使用

    选择特定的列和行,行和列的选择有不同;选择列需要枚举 [[xx,xx]] ,选择行可以片选 [yy:yy]

df['!'] # the same as df.!
A    0.219259
B   -0.736359
C   -2.215969
D   -0.822583
E    0.126939
F   -0.523610
Name: !, dtype: float64
df[['!','@']]
! @
A 0.219259 0.033841
B -0.736359 1.438696
C -2.215969 -0.153340
D -0.822583 0.325459
E 0.126939 0.302446
F -0.523610 0.702762
df['A':'B']
! @ # $
A 0.219259 0.033841 0.084150 0.794347
B -0.736359 1.438696 -0.020479 0.060730
df[0:3]
! @ # $
A 0.219259 0.033841 0.084150 0.794347
B -0.736359 1.438696 -0.020479 0.060730
C -2.215969 -0.153340 0.515900 0.098534
  • loc,iloc,ix,iat

    loc 是使用索引和列定义的labe; iloc 是使用interger,支持”-1”,”-2”从后往前数位置; iat 类似于 iloc 但是只能定位单个元素;而 ix 是既能使用label,也能使用interger,但是在速度上要弱于 iloc

df.loc['A',:]
!    0.219259
@    0.033841
#    0.084150
$    0.794347
Name: A, dtype: float64
df.iloc[0,:]
!    0.219259
@    0.033841
#    0.084150
$    0.794347
Name: A, dtype: float64
df.iat[0,1]
0.033841006173188068
df.ix[0,:]
!    0.219259
@    0.033841
#    0.084150
$    0.794347
Name: A, dtype: float64
df.ix['A',1]
0.033841006173188068

索引到的数据能够直接进行修改,并保存在原来的 DataFrame

df.ix['A',:] = '-1'
df
! @ # $
A -1 -1 -1 -1
B -0.736359 1.4387 -0.0204787 0.0607302
C -2.21597 -0.15334 0.5159 0.0985342
D -0.822583 0.325459 -0.134615 0.419446
E 0.126939 0.302446 0.905639 0.165117
F -0.52361 0.702762 1.02313 1.05221
  • 通过bool序列进行索引

df[df['!'] > 0]
! @ # $
A 0.219259 0.033841 0.084150 0.794347
E 0.126939 0.302446 0.905639 0.165117
df[df > 0]
! @ # $
A 0.219259 0.033841 0.084150 0.794347
B NaN 1.438696 NaN 0.060730
C NaN NaN 0.515900 0.098534
D NaN 0.325459 NaN 0.419446
E 0.126939 0.302446 0.905639 0.165117
F NaN 0.702762 1.023131 1.052215