Pandas For Beginners¶

#coding:utf8
%matplotlib inline

获取更多资源，请关注公众号：【鲸析】。

from IPython.display import IFrame

通常，我们首先要导入以下几个库：

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

创建对象¶

通过传递一个list来创建Series，pandas会默认创建整型索引：

s = pd.Series([1,3,5,np.nan,6,8])
s

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

通过传递一个numpy array，日期索引以及列标签来创建一个DataFrame：

dates = pd.date_range('20130101', periods=6)
dates

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
df

通过传递一个能够被转换为类似series的dict对象来创建一个DataFrame:

df2 = pd.DataFrame({ 'A' : 1.,
                     'B' : pd.Timestamp('20130102'),
                     'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
                     'D' : np.array([3]*4,dtype='int32'),
                     'E' : pd.Categorical(["test","train","test","train"]),
                     'F' : 'foo' })
df2

可以看到各列的数据类型为：

df2.dtypes

A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

查看数据¶

查看frame中头部和尾部的几行：

df.head()

df.tail(3)

显示索引、列名以及底层的numpy数据

df.index

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

df.columns

Index(['A', 'B', 'C', 'D'], dtype='object')

df.values

array([[-0.90052384, -0.30251543, -0.54176245,  1.56291588],
       [-0.8841172 , -0.65074073,  0.21734508,  0.26891483],
       [ 0.22082238,  0.79052719,  0.69217223,  0.72344092],
       [ 1.2602764 ,  1.0002968 ,  0.80980141, -0.38971272],
       [ 1.67938067,  1.46860938,  0.36064787, -0.24084994],
       [ 0.56786654,  0.23535244,  1.117395  , -0.60432593]])

describe()能对数据做一个快速统计汇总

df.describe()

对数据做转置：

df.T

按轴进行排序：

df.sort_index(axis=1, ascending=False)

按值进行排序 :

df.sort_values(by='B')

数据选择¶

注意：虽然标准的Python/Numpy的表达式能完成选择与赋值等功能，但我们仍推荐使用优化过的pandas数据访问方法：.at，.iat，.loc，.iloc和.ix

选取¶

选择某一列数据，它会返回一个Series，等同于df.A：

df['A']

2013-01-01   -0.900524
2013-01-02   -0.884117
2013-01-03    0.220822
2013-01-04    1.260276
2013-01-05    1.679381
2013-01-06    0.567867
Freq: D, Name: A, dtype: float64

通过使用[ ]进行切片选取：

df[0:3]

df['20130102':'20130104']

通过标签选取¶

通过标签进行交叉选取：

df.loc[dates[0]]

A   -0.900524
B   -0.302515
C   -0.541762
D    1.562916
Name: 2013-01-01 00:00:00, dtype: float64

使用标签对多个轴进行选取

df.loc[:,['A','B']]

df.loc[:,['A','B']][:3]

进行标签切片，包含两个端点

df.loc['20130102':'20130104',['A','B']]

对于返回的对象进行降维处理

df.loc['20130102',['A','B']]

A   -0.884117
B   -0.650741
Name: 2013-01-02 00:00:00, dtype: float64

获取一个标量

df.loc[dates[0],'A']

-0.9005238449408509

快速获取标量（与上面的方法等价）

df.at[dates[0],'A']

-0.9005238449408509

通过位置选取¶

通过传递整型的位置进行选取

df.iloc[3]

A    1.260276
B    1.000297
C    0.809801
D   -0.389713
Name: 2013-01-04 00:00:00, dtype: float64

通过整型的位置切片进行选取，与python/numpy形式相同

df.iloc[3:5,0:2]

只对行进行切片

df.iloc[1:3,:]

只对列进行切片

df.iloc[:,1:3]

只获取某个值

df.iloc[1,1]

-0.6507407272837356

快速获取某个值（与上面的方法等价）

df.iat[1,1]

-0.6507407272837356

布尔索引¶

用某列的值来选取数据

df[df.A > 0]

用where操作来选取数据

df[df > 0]

用isin()方法来过滤数据

df2 = df.copy()

df2['E'] = ['one', 'one', 'two', 'three', 'four', 'three']
df2

df2[df2['E'].isin(['two', 'four'])]

赋值¶

赋值一个新的列，通过索引来自动对齐数据

s1 = pd.Series([1,2,3,4,5,6], index=pd.date_range('20130102',periods=6))
s1

2013-01-02    1
2013-01-03    2
2013-01-04    3
2013-01-05    4
2013-01-06    5
2013-01-07    6
Freq: D, dtype: int64

df['F'] = s1
df

通过标签赋值

df.at[dates[0], 'A'] = 0
df

通过位置赋值

df.iat[0,1] = 0
df

通过传递numpy array赋值

df.loc[:,'D'] = np.array([5] * len(df))
df

通过where操作来赋值

df2 = df.copy()
df2[df2 > 0] = -df2
df2

缺失值处理¶

在pandas中，用np.nan来代表缺失值，这些值默认不会参与运算。

reindex()允许你修改、增加、删除指定轴上的索引，并返回一个数据副本。

df1 = df.reindex(index=dates[0:4], columns=list(df.columns)+['E'])
df1.loc[dates[0]:dates[1],'E'] = 1
df1

剔除所有包含缺失值的行数据

df1.dropna(how='any')

填充缺失值

df1.fillna(value=5)

获取值是否为nan的布尔标记

pd.isnull(df1)

运算¶

统计¶

运算过程中，通常不包含缺失值。

进行描述性统计

df.mean()

A    0.474038
B    0.474008
C    0.442600
D    5.000000
F    3.000000
dtype: float64

对其他轴进行同样的运算

df.mean(1)

2013-01-01    1.114559
2013-01-02    0.936497
2013-01-03    1.740704
2013-01-04    2.214075
2013-01-05    2.501728
2013-01-06    2.384123
Freq: D, dtype: float64

对于拥有不同维度的对象进行运算时需要对齐。除此之外，pandas会自动沿着指定维度计算。

s = pd.Series([1,3,5,np.nan,6,8], index=dates).shift(2)
s

2013-01-01    NaN
2013-01-02    NaN
2013-01-03    1.0
2013-01-04    3.0
2013-01-05    5.0
2013-01-06    NaN
Freq: D, dtype: float64

df.sub(s, axis='index')

Apply 函数作用¶

通过apply()对函数作用

df.apply(np.cumsum)

df.apply(lambda x:x.max()-x.min())

A    2.563498
B    2.119350
C    1.659157
D    0.000000
F    4.000000
dtype: float64

频数统计¶

s = pd.Series(np.random.randint(0, 7, size=10))
s

0    2
1    0
2    5
3    1
4    6
5    1
6    1
7    1
8    5
9    4
dtype: int32

s.value_counts()

1    4
5    2
6    1
4    1
2    1
0    1
dtype: int64

字符串方法¶

对于Series对象，在其str属性中有着一系列的字符串处理方法。就如同下段代码一样，能很方便的对array中各个元素进行运算。值得注意的是，在str属性中的模式匹配默认使用正则表达式。

s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])
s.str.lower()

0       a
1       b
2       c
3    aaba
4    baca
5     NaN
6    caba
7     dog
8     cat
dtype: object

合并¶

Concat 连接¶

pandas中提供了大量的方法能够轻松对Series，DataFrame和Panel对象进行不同满足逻辑关系的合并操作

通过concat()来连接pandas对象

df = pd.DataFrame(np.random.randn(10,4))
df

#break it into pieces
pieces = [df[:3], df[3:7], df[7:]]
pieces

[          0         1         2         3
 0  1.560690  2.253479  1.728586  1.224112
 1 -1.237557 -1.571768 -1.687004 -0.226474
 2 -0.591146 -0.054644  0.600806  0.076132,
           0         1         2         3
 3 -0.567678  0.426496 -0.972487  0.200211
 4 -2.073311 -1.566767 -0.533602  1.366468
 5  2.244767  1.612232  1.934717 -0.403805
 6 -2.640917  0.640549  1.257238  0.043773,
           0         1         2         3
 7  1.545405  1.771884 -0.273687  2.441483
 8 -0.440476  0.567536  2.379072  1.152354
 9 -0.047853 -0.440427 -1.382389  0.647217]

pd.concat(pieces)

Join 合并¶

类似于SQL中的合并(merge)

left = pd.DataFrame({'key':['foo', 'foo'], 'lval':[1,2]})
left

right = pd.DataFrame({'key':['foo', 'foo'], 'lval':[4,5]})
right

pd.merge(left, right, on='key')

Append 添加¶

将若干行添加到dataFrame后面

df = pd.DataFrame(np.random.randn(8, 4), columns=['A', 'B', 'C', 'D'])
df

s = df.iloc[3]
s

A    0.163904
B    1.324567
C   -0.768324
D   -0.205520
Name: 3, dtype: float64

df.append(s, ignore_index=True)

分组¶

对于“group by”操作，我们通常是指以下一个或几个步骤：

划分按照某些标准将数据分为不同的组
应用对每组数据分别执行一个函数
组合将结果组合到一个数据结构

df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar', 
                          'foo', 'bar', 'foo', 'bar'],
                   'B' : ['one', 'one', 'two', 'three', 
                          'two', 'two', 'one', 'three'],
                   'C' : np.random.randn(8),
                   'D' : np.random.randn(8)})
df

分组并对每个分组应用sum函数

df.groupby('A').sum()

按多个列分组形成层级索引，然后应用函数

df.groupby(['A','B']).sum()

变形¶

堆叠¶

tuples = list(zip(*[['bar', 'bar', 'baz', 'baz',
                     'foo', 'foo', 'qux', 'qux'],
                    ['one', 'two', 'one', 'two',
                     'one', 'two', 'one', 'two']]))

index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])

df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])

df2 = df[:4]
df2

stack()方法对DataFrame的列“压缩”一个层级

stacked = df2.stack()
stacked

first  second   
bar    one     A   -0.551894
               B    0.358046
       two     A   -0.773283
               B   -0.489393
baz    one     A   -0.461588
               B   -1.561085
       two     A    2.304456
               B   -0.241811
dtype: float64

对于一个“堆叠过的”DataFrame或者Series（拥有MultiIndex作为索引），stack()的逆操作是unstack()，默认反堆叠到上一个层级

stacked.unstack()

stacked.unstack(1)

stacked.unstack(0)

数据透视表¶

df = pd.DataFrame({'A' : ['one', 'one', 'two', 'three'] * 3,
                   'B' : ['A', 'B', 'C'] * 4,
                   'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,
                   'D' : np.random.randn(12),
                   'E' : np.random.randn(12)})
df

我们可以轻松地从这个数据得到透视表

pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C'])

时间序列¶

pandas在对频率转换进行重新采样时拥有着简单，强大而且高效的功能（例如把按秒采样的数据转换为按5分钟采样的数据）。这在金融领域很常见，但又不限于此。

rng = pd.date_range('1/1/2012', periods=100, freq='S')
rng

DatetimeIndex(['2012-01-01 00:00:00', '2012-01-01 00:00:01',
               '2012-01-01 00:00:02', '2012-01-01 00:00:03',
               '2012-01-01 00:00:04', '2012-01-01 00:00:05',
               '2012-01-01 00:00:06', '2012-01-01 00:00:07',
               '2012-01-01 00:00:08', '2012-01-01 00:00:09',
               '2012-01-01 00:00:10', '2012-01-01 00:00:11',
               '2012-01-01 00:00:12', '2012-01-01 00:00:13',
               '2012-01-01 00:00:14', '2012-01-01 00:00:15',
               '2012-01-01 00:00:16', '2012-01-01 00:00:17',
               '2012-01-01 00:00:18', '2012-01-01 00:00:19',
               '2012-01-01 00:00:20', '2012-01-01 00:00:21',
               '2012-01-01 00:00:22', '2012-01-01 00:00:23',
               '2012-01-01 00:00:24', '2012-01-01 00:00:25',
               '2012-01-01 00:00:26', '2012-01-01 00:00:27',
               '2012-01-01 00:00:28', '2012-01-01 00:00:29',
               '2012-01-01 00:00:30', '2012-01-01 00:00:31',
               '2012-01-01 00:00:32', '2012-01-01 00:00:33',
               '2012-01-01 00:00:34', '2012-01-01 00:00:35',
               '2012-01-01 00:00:36', '2012-01-01 00:00:37',
               '2012-01-01 00:00:38', '2012-01-01 00:00:39',
               '2012-01-01 00:00:40', '2012-01-01 00:00:41',
               '2012-01-01 00:00:42', '2012-01-01 00:00:43',
               '2012-01-01 00:00:44', '2012-01-01 00:00:45',
               '2012-01-01 00:00:46', '2012-01-01 00:00:47',
               '2012-01-01 00:00:48', '2012-01-01 00:00:49',
               '2012-01-01 00:00:50', '2012-01-01 00:00:51',
               '2012-01-01 00:00:52', '2012-01-01 00:00:53',
               '2012-01-01 00:00:54', '2012-01-01 00:00:55',
               '2012-01-01 00:00:56', '2012-01-01 00:00:57',
               '2012-01-01 00:00:58', '2012-01-01 00:00:59',
               '2012-01-01 00:01:00', '2012-01-01 00:01:01',
               '2012-01-01 00:01:02', '2012-01-01 00:01:03',
               '2012-01-01 00:01:04', '2012-01-01 00:01:05',
               '2012-01-01 00:01:06', '2012-01-01 00:01:07',
               '2012-01-01 00:01:08', '2012-01-01 00:01:09',
               '2012-01-01 00:01:10', '2012-01-01 00:01:11',
               '2012-01-01 00:01:12', '2012-01-01 00:01:13',
               '2012-01-01 00:01:14', '2012-01-01 00:01:15',
               '2012-01-01 00:01:16', '2012-01-01 00:01:17',
               '2012-01-01 00:01:18', '2012-01-01 00:01:19',
               '2012-01-01 00:01:20', '2012-01-01 00:01:21',
               '2012-01-01 00:01:22', '2012-01-01 00:01:23',
               '2012-01-01 00:01:24', '2012-01-01 00:01:25',
               '2012-01-01 00:01:26', '2012-01-01 00:01:27',
               '2012-01-01 00:01:28', '2012-01-01 00:01:29',
               '2012-01-01 00:01:30', '2012-01-01 00:01:31',
               '2012-01-01 00:01:32', '2012-01-01 00:01:33',
               '2012-01-01 00:01:34', '2012-01-01 00:01:35',
               '2012-01-01 00:01:36', '2012-01-01 00:01:37',
               '2012-01-01 00:01:38', '2012-01-01 00:01:39'],
              dtype='datetime64[ns]', freq='S')

ts = pd.Series(np.random.randint(0,500,len(rng)), index=rng)
ts

2012-01-01 00:00:00    100
2012-01-01 00:00:01     47
2012-01-01 00:00:02    473
2012-01-01 00:00:03    136
2012-01-01 00:00:04    192
2012-01-01 00:00:05    372
2012-01-01 00:00:06    459
2012-01-01 00:00:07    450
2012-01-01 00:00:08    244
2012-01-01 00:00:09    383
2012-01-01 00:00:10    250
2012-01-01 00:00:11     53
2012-01-01 00:00:12    364
2012-01-01 00:00:13    139
2012-01-01 00:00:14    239
2012-01-01 00:00:15    494
2012-01-01 00:00:16    370
2012-01-01 00:00:17    329
2012-01-01 00:00:18     48
2012-01-01 00:00:19    278
2012-01-01 00:00:20    100
2012-01-01 00:00:21    443
2012-01-01 00:00:22    348
2012-01-01 00:00:23     25
2012-01-01 00:00:24    257
2012-01-01 00:00:25    245
2012-01-01 00:00:26     60
2012-01-01 00:00:27     12
2012-01-01 00:00:28    312
2012-01-01 00:00:29    154
                      ... 
2012-01-01 00:01:10     31
2012-01-01 00:01:11    225
2012-01-01 00:01:12    364
2012-01-01 00:01:13    171
2012-01-01 00:01:14    351
2012-01-01 00:01:15    459
2012-01-01 00:01:16    363
2012-01-01 00:01:17    369
2012-01-01 00:01:18    421
2012-01-01 00:01:19    389
2012-01-01 00:01:20    408
2012-01-01 00:01:21     63
2012-01-01 00:01:22     55
2012-01-01 00:01:23    368
2012-01-01 00:01:24    439
2012-01-01 00:01:25     14
2012-01-01 00:01:26    220
2012-01-01 00:01:27    250
2012-01-01 00:01:28    495
2012-01-01 00:01:29    391
2012-01-01 00:01:30     79
2012-01-01 00:01:31    408
2012-01-01 00:01:32    232
2012-01-01 00:01:33    413
2012-01-01 00:01:34    373
2012-01-01 00:01:35    355
2012-01-01 00:01:36    308
2012-01-01 00:01:37     51
2012-01-01 00:01:38    375
2012-01-01 00:01:39    185
Freq: S, Length: 100, dtype: int32

ts.resample('5Min', how='sum')

C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: how in .resample() is deprecated
the new syntax is .resample(...).sum()
  """Entry point for launching an IPython kernel.

2012-01-01    26203
Freq: 5T, dtype: int32

时区表示

rng = pd.date_range('3/6/2012', periods=5, freq='D')
rng

DatetimeIndex(['2012-03-06', '2012-03-07', '2012-03-08', '2012-03-09',
               '2012-03-10'],
              dtype='datetime64[ns]', freq='D')

ts = pd.Series(np.random.randn(len(rng)), index=rng)
ts

2012-03-06    0.523781
2012-03-07   -0.670822
2012-03-08    0.934826
2012-03-09    0.002239
2012-03-10   -0.091952
Freq: D, dtype: float64

ts_utc = ts.tz_localize('UTC')
ts_utc

2012-03-06 00:00:00+00:00    0.523781
2012-03-07 00:00:00+00:00   -0.670822
2012-03-08 00:00:00+00:00    0.934826
2012-03-09 00:00:00+00:00    0.002239
2012-03-10 00:00:00+00:00   -0.091952
Freq: D, dtype: float64

时区转换

ts_utc.tz_convert('US/Eastern')

2012-03-05 19:00:00-05:00    0.523781
2012-03-06 19:00:00-05:00   -0.670822
2012-03-07 19:00:00-05:00    0.934826
2012-03-08 19:00:00-05:00    0.002239
2012-03-09 19:00:00-05:00   -0.091952
Freq: D, dtype: float64

时间跨度转换

rng = pd.date_range('1/1/2012', periods=5, freq='M')
rng

DatetimeIndex(['2012-01-31', '2012-02-29', '2012-03-31', '2012-04-30',
               '2012-05-31'],
              dtype='datetime64[ns]', freq='M')

ts = pd.Series(np.random.randn(len(rng)), index=rng)
ts

2012-01-31    1.296132
2012-02-29    1.023936
2012-03-31   -0.249774
2012-04-30    1.007810
2012-05-31   -0.051413
Freq: M, dtype: float64

ps = ts.to_period()
ps

2012-01    1.296132
2012-02    1.023936
2012-03   -0.249774
2012-04    1.007810
2012-05   -0.051413
Freq: M, dtype: float64

ps.to_timestamp()

2012-01-01    1.296132
2012-02-01    1.023936
2012-03-01   -0.249774
2012-04-01    1.007810
2012-05-01   -0.051413
Freq: MS, dtype: float64

日期与时间戳之间的转换使得可以使用一些方便的算术函数。例如，我们把以11月为年底的季度数据转换为当前季度末月底为始的数据

prng = pd.period_range('1990Q1', '2000Q4', freq='Q-NOV')
prng

PeriodIndex(['1990Q1', '1990Q2', '1990Q3', '1990Q4', '1991Q1', '1991Q2',
             '1991Q3', '1991Q4', '1992Q1', '1992Q2', '1992Q3', '1992Q4',
             '1993Q1', '1993Q2', '1993Q3', '1993Q4', '1994Q1', '1994Q2',
             '1994Q3', '1994Q4', '1995Q1', '1995Q2', '1995Q3', '1995Q4',
             '1996Q1', '1996Q2', '1996Q3', '1996Q4', '1997Q1', '1997Q2',
             '1997Q3', '1997Q4', '1998Q1', '1998Q2', '1998Q3', '1998Q4',
             '1999Q1', '1999Q2', '1999Q3', '1999Q4', '2000Q1', '2000Q2',
             '2000Q3', '2000Q4'],
            dtype='period[Q-NOV]', freq='Q-NOV')

ts = pd.Series(np.random.randn(len(prng)), index = prng)
ts

1990Q1    0.467059
1990Q2   -0.745741
1990Q3    0.106234
1990Q4   -0.306761
1991Q1   -2.012034
1991Q2   -2.033417
1991Q3   -1.116709
1991Q4    0.031495
1992Q1    0.407074
1992Q2    0.988878
1992Q3   -0.195428
1992Q4   -0.144770
1993Q1    0.733104
1993Q2   -0.742351
1993Q3   -0.464713
1993Q4   -1.247317
1994Q1   -0.108473
1994Q2    0.188728
1994Q3    0.052446
1994Q4   -0.004046
1995Q1    1.075490
1995Q2   -0.436009
1995Q3    0.747165
1995Q4    2.433366
1996Q1   -0.471832
1996Q2   -1.337675
1996Q3    0.931348
1996Q4   -0.461312
1997Q1    0.041943
1997Q2   -1.003802
1997Q3   -0.572660
1997Q4   -0.959966
1998Q1   -1.864559
1998Q2   -2.196461
1998Q3   -1.317999
1998Q4   -0.796643
1999Q1    1.183292
1999Q2   -0.046480
1999Q3   -0.727018
1999Q4   -0.972557
2000Q1    0.541060
2000Q2    1.221401
2000Q3   -1.407534
2000Q4   -0.773754
Freq: Q-NOV, dtype: float64

ts.index = (prng.asfreq('M', 'end') ) .asfreq('H', 'start') +9
ts

1990-02-01 09:00    0.467059
1990-05-01 09:00   -0.745741
1990-08-01 09:00    0.106234
1990-11-01 09:00   -0.306761
1991-02-01 09:00   -2.012034
1991-05-01 09:00   -2.033417
1991-08-01 09:00   -1.116709
1991-11-01 09:00    0.031495
1992-02-01 09:00    0.407074
1992-05-01 09:00    0.988878
1992-08-01 09:00   -0.195428
1992-11-01 09:00   -0.144770
1993-02-01 09:00    0.733104
1993-05-01 09:00   -0.742351
1993-08-01 09:00   -0.464713
1993-11-01 09:00   -1.247317
1994-02-01 09:00   -0.108473
1994-05-01 09:00    0.188728
1994-08-01 09:00    0.052446
1994-11-01 09:00   -0.004046
1995-02-01 09:00    1.075490
1995-05-01 09:00   -0.436009
1995-08-01 09:00    0.747165
1995-11-01 09:00    2.433366
1996-02-01 09:00   -0.471832
1996-05-01 09:00   -1.337675
1996-08-01 09:00    0.931348
1996-11-01 09:00   -0.461312
1997-02-01 09:00    0.041943
1997-05-01 09:00   -1.003802
1997-08-01 09:00   -0.572660
1997-11-01 09:00   -0.959966
1998-02-01 09:00   -1.864559
1998-05-01 09:00   -2.196461
1998-08-01 09:00   -1.317999
1998-11-01 09:00   -0.796643
1999-02-01 09:00    1.183292
1999-05-01 09:00   -0.046480
1999-08-01 09:00   -0.727018
1999-11-01 09:00   -0.972557
2000-02-01 09:00    0.541060
2000-05-01 09:00    1.221401
2000-08-01 09:00   -1.407534
2000-11-01 09:00   -0.773754
Freq: H, dtype: float64

分类¶

从版本0.15开始，pandas在DataFrame中开始包括分类数据。

df = pd.DataFrame({"id":[1,2,3,4,5,6], "raw_grade":['a', 'b', 'b', 'a', 'e', 'e']})
df

把raw_grade转换为分类类型

df["grade"] = df["raw_grade"].astype("category")
df["grade"]

0    a
1    b
2    b
3    a
4    e
5    e
Name: grade, dtype: category
Categories (3, object): [a, b, e]

重命名类别名为更有意义的名称

df["grade"].cat.categories = ["very good", "good", "very bad"]

对分类重新排序，并添加缺失的分类

df["grade"] = df["grade"].cat.set_categories(["very bad", "bad", "medium", "good", "very good"])
df["grade"]

0    very good
1         good
2         good
3    very good
4     very bad
5     very bad
Name: grade, dtype: category
Categories (5, object): [very bad, bad, medium, good, very good]

排序是按照分类的顺序进行的，而不是字典序

df.sort_values(by="grade")

按分类分组时，也会显示空的分类

df.groupby("grade").size()

grade
very bad     2
bad          0
medium       0
good         2
very good    2
dtype: int64

绘图¶

ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
ts = ts.cumsum()
ts.plot()

<matplotlib.axes._subplots.AxesSubplot at 0x2472e962d68>

对于DataFrame类型，plot()能很方便地画出所有列及其标签

df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index, columns=['A', 'B', 'C', 'D'])
df = df.cumsum()
plt.figure(); df.plot(); plt.legend(loc='best')

<matplotlib.legend.Legend at 0x247309914e0>

<matplotlib.figure.Figure at 0x24730a068d0>

获取数据的I/O¶

CSV¶

写入一个csv文件

df.to_csv('data/foo.csv')

从一个csv文件读入

pd.read_csv('data/foo.csv')

HDF5¶

HDFStores的读写

写入一个HDF5 Store

df.to_hdf('data/foo.h5', 'df')

从一个HDF5 Store读入

pd.read_hdf('data/foo.h5', 'df')

Excel¶

MS Excel的读写

写入一个Excel文件

df.to_excel('data/foo.xlsx', sheet_name='Sheet1')

从一个excel文件读入

pd.read_excel('data/foo.xlsx', 'Sheet1', index_col=None, na_values=['NA'])

	A	B	C	D
count	6.000000	6.000000	6.000000	6.000000
mean	0.323951	0.423588	0.442600	0.220064
std	1.071708	0.809463	0.579465	0.815219
min	-0.900524	-0.650741	-0.541762	-0.604326
25%	-0.607882	-0.168048	0.253171	-0.352497
50%	0.394344	0.512940	0.526410	0.014032
75%	1.087174	0.947854	0.780394	0.609809
max	1.679381	1.468609	1.117395	1.562916

	A	B	C	D	F
2013-01-01	0.000000	0.000000	-0.541762	-5	NaN
2013-01-02	-0.884117	-0.650741	-0.217345	-5	-1.0
2013-01-03	-0.220822	-0.790527	-0.692172	-5	-2.0
2013-01-04	-1.260276	-1.000297	-0.809801	-5	-3.0
2013-01-05	-1.679381	-1.468609	-0.360648	-5	-4.0
2013-01-06	-0.567867	-0.235352	-1.117395	-5	-5.0

	A	B	C	D
0	0.415810	-1.106857	-0.687920	2.422911
1	0.696149	-1.235975	0.201409	1.424596
2	-0.540622	0.121096	-0.861667	-0.171690
3	0.163904	1.324567	-0.768324	-0.205520
4	-1.581152	-0.079061	0.251810	-0.195755
5	1.254246	1.604556	0.766464	-1.090743
6	0.608609	1.000765	-0.407980	0.034970
7	-3.111914	2.163344	0.619885	-0.705518

	A	B	C	D
0	0.415810	-1.106857	-0.687920	2.422911
1	0.696149	-1.235975	0.201409	1.424596
2	-0.540622	0.121096	-0.861667	-0.171690
3	0.163904	1.324567	-0.768324	-0.205520
4	-1.581152	-0.079061	0.251810	-0.195755
5	1.254246	1.604556	0.766464	-1.090743
6	0.608609	1.000765	-0.407980	0.034970
7	-3.111914	2.163344	0.619885	-0.705518
8	0.163904	1.324567	-0.768324	-0.205520

	C	D
A
bar	-0.565344	1.886637
foo	2.226542	2.122855

	A	B	C	D
2013-01-01	-0.900524	-0.302515	-0.541762	1.562916
2013-01-02	-0.884117	-0.650741	0.217345	0.268915
2013-01-03	0.220822	0.790527	0.692172	0.723441
2013-01-04	1.260276	1.000297	0.809801	-0.389713
2013-01-05	1.679381	1.468609	0.360648	-0.240850
2013-01-06	0.567867	0.235352	1.117395	-0.604326

	A	B	C	D	E	F
0	1.0	2013-01-02	1.0	3	test	foo
1	1.0	2013-01-02	1.0	3	train	foo
2	1.0	2013-01-02	1.0	3	test	foo
3	1.0	2013-01-02	1.0	3	train	foo

	A	B	C	D	F	E
2013-01-01	False	False	False	False	True	False
2013-01-02	False	False	False	False	False	False
2013-01-03	False	False	False	False	False	True
2013-01-04	False	False	False	False	False	True

	0	1	2	3
0	1.560690	2.253479	1.728586	1.224112
1	-1.237557	-1.571768	-1.687004	-0.226474
2	-0.591146	-0.054644	0.600806	0.076132
3	-0.567678	0.426496	-0.972487	0.200211
4	-2.073311	-1.566767	-0.533602	1.366468
5	2.244767	1.612232	1.934717	-0.403805
6	-2.640917	0.640549	1.257238	0.043773
7	1.545405	1.771884	-0.273687	2.441483
8	-0.440476	0.567536	2.379072	1.152354
9	-0.047853	-0.440427	-1.382389	0.647217

	A	B	C	D
0	foo	one	0.190663	0.589384
1	bar	one	1.056331	0.035044
2	foo	two	0.723645	0.372672
3	bar	three	-1.306869	0.435296
4	foo	two	0.673661	-1.292242
5	bar	two	-0.823728	0.837556
6	foo	one	0.638573	2.453041
7	bar	three	0.508922	0.578740

		A	B
first	second
bar	one	-0.551894	0.358046
bar	two	-0.773283	-0.489393
baz	one	-0.461588	-1.561085
baz	two	2.304456	-0.241811

	A	B	C	D	E
0	one	A	foo	-0.169373	-1.070179
1	one	B	foo	-0.484603	-1.627113
2	two	C	foo	-1.289207	0.635274
3	three	A	bar	0.034032	-0.553259
4	one	B	bar	0.536531	-0.088784
5	one	C	bar	-0.977931	-0.211193
6	two	A	foo	1.486196	-0.377266
7	three	B	foo	-1.108725	0.050407
8	one	C	foo	0.562988	-0.195356
9	one	A	bar	0.510381	-2.050069
10	two	B	bar	0.005697	0.708523
11	three	C	bar	0.942276	0.769442

	id	raw_grade	grade
4	5	e	very bad
5	6	e	very bad
1	2	b	good
2	3	b	good
0	1	a	very good
3	4	a	very good

	Unnamed: 0	A	B	C	D
0	2000-01-01	-0.547915	0.758504	-0.791553	-0.014276
1	2000-01-02	-1.478652	-1.824863	0.001416	-1.067864
2	2000-01-03	-3.070727	-1.655967	-0.551411	-2.188238
3	2000-01-04	-3.317105	-1.810778	0.354000	-1.995760
4	2000-01-05	-2.822308	-2.757077	1.210057	-1.250592
5	2000-01-06	-2.223865	-3.256931	3.658672	-1.232510
6	2000-01-07	-0.559411	-2.842820	4.860993	-0.418853
7	2000-01-08	-0.524812	-2.748762	6.280814	0.170616
8	2000-01-09	0.204528	-0.865022	5.560456	0.875871
9	2000-01-10	0.688895	-1.704024	5.226062	0.444483
10	2000-01-11	1.272138	-2.478377	2.763524	-1.229797
11	2000-01-12	1.255657	-2.994073	2.869688	0.745504
12	2000-01-13	-0.001803	-2.295823	2.738087	2.640307
13	2000-01-14	0.144209	-0.926810	4.087308	0.164138
14	2000-01-15	1.304811	0.625530	3.542700	-1.499446
15	2000-01-16	0.682026	0.552441	3.201389	-0.155674
16	2000-01-17	-1.065321	-0.339092	3.792634	-1.754946
17	2000-01-18	-1.268418	-3.512773	3.656896	-0.132124
18	2000-01-19	-1.936390	-1.069188	2.554887	0.413930
19	2000-01-20	-1.138678	-2.993870	3.094198	-0.956574
20	2000-01-21	-0.096721	-4.141108	3.341248	-0.436886
21	2000-01-22	-0.678763	-5.012663	0.823576	1.303377
22	2000-01-23	-2.498525	-6.075938	0.302908	-0.294204
23	2000-01-24	-0.937785	-6.768487	-0.078521	-0.165799
24	2000-01-25	-0.252338	-6.882860	0.545684	-1.108641
25	2000-01-26	-0.772653	-7.080129	0.764647	-3.041450
26	2000-01-27	-0.348831	-6.035375	0.507673	-2.522836
27	2000-01-28	1.751721	-5.756738	-0.469014	-2.825118
28	2000-01-29	0.799938	-4.559172	-0.086763	-3.243101
29	2000-01-30	1.149650	-4.433331	1.476689	-2.479956
...	...	...	...	...	...
970	2002-08-28	0.466756	38.780786	-6.743120	-15.206051
971	2002-08-29	1.903199	36.687258	-5.720742	-14.335689
972	2002-08-30	1.432801	37.186065	-6.274877	-14.879621
973	2002-08-31	2.218021	36.200395	-5.638178	-15.625389
974	2002-09-01	3.405318	36.987365	-4.441912	-14.162031
975	2002-09-02	3.854252	38.283545	-4.845949	-15.449302
976	2002-09-03	4.328033	37.123804	-6.006344	-14.158929
977	2002-09-04	5.219103	37.203486	-6.035182	-14.461389
978	2002-09-05	4.884547	38.192836	-6.447570	-13.541204
979	2002-09-06	5.817435	37.669998	-5.983732	-12.238303
980	2002-09-07	7.953334	37.503576	-4.783659	-12.092902
981	2002-09-08	8.575181	37.377622	-3.154240	-12.895208
982	2002-09-09	8.242240	38.155500	-3.285132	-13.483775
983	2002-09-10	9.699693	37.421024	-1.763105	-13.397276
984	2002-09-11	7.228027	38.070563	-1.250151	-14.455654
985	2002-09-12	7.276336	38.503057	-1.110047	-15.362900
986	2002-09-13	6.955453	37.038613	-0.571942	-14.527590
987	2002-09-14	6.232800	37.254276	-1.893050	-14.385401
988	2002-09-15	5.331802	36.671504	-1.782282	-13.645305
989	2002-09-16	5.825638	36.920733	-2.718137	-13.875984
990	2002-09-17	5.972137	36.785577	-3.923763	-13.281418
991	2002-09-18	5.154685	34.789858	-4.426825	-13.805227
992	2002-09-19	4.228035	34.400350	-4.431980	-14.520423
993	2002-09-20	3.047640	36.244469	-5.093584	-15.106412
994	2002-09-21	2.984361	38.095349	-5.560222	-16.584216
995	2002-09-22	3.187098	37.169742	-6.132250	-14.910771
996	2002-09-23	1.772470	36.399488	-6.257600	-13.668261
997	2002-09-24	1.886272	34.543307	-6.644915	-15.213739
998	2002-09-25	2.918438	33.722114	-5.736130	-14.919711
999	2002-09-26	2.608268	33.473087	-5.111581	-15.969800

	A	B	C	D	F
2013-01-01	0.000000	0.000000	-0.541762	5	NaN
2013-01-02	-0.884117	-0.650741	-0.324417	10	1.0
2013-01-03	-0.663295	0.139786	0.367755	15	3.0
2013-01-04	0.596982	1.140083	1.177556	20	6.0
2013-01-05	2.276362	2.608693	1.538204	25	10.0
2013-01-06	2.844229	2.844045	2.655599	30	15.0

	key	lval
0	foo	1
1	foo	2

	key	lval
0	foo	4
1	foo	5