10分钟学会Pandas

# 10分钟学会Pandas

Pandas最初被作为金融数据分析工具而开发出来,在金融领域被广泛使用。Pandas纳入了大量库和一些标准的数据模型,提供了高效地操作大型数据集所需的工具、函数和方法。

本文是针对pandas新手的快速入门学习指南。你可以在 AI量化平台-编写策略 里,一步一步的学习和实践。

# 导入库
import pandas as pd
import numpy as np

# Pandas数据结构

主要数据结构:Series和DataFrame,Series是一种类似于一维数组的对象,它由一组数据以及一组与之相关的数据标签(即索引)组成。DataFrame是一个表格型 的数据结构,可以简单地理解为是由多个具有相同的索引的Series组成。DataFrame是金融数据分析中特别常用的数据结构,也是我们平台最常遇到的数据结构。

# 创建Series,以日期为索引
dates = pd.date_range('2/6/2017', periods=5, freq='D')
s = pd.Series([1,3,5,np.nan,6,],index=dates)
s
2017-02-06    1.0
2017-02-07    3.0
2017-02-08    5.0
2017-02-09    NaN
2017-02-10    6.0

Freq: D, dtype: float64
# 创建DataFrame,以日期为索引
df = pd.DataFrame(np.random.randn(5,4), index=dates, columns=list('ABCD'))
df
ABCD
2017-02-06-1.738581-1.750260-0.532282-0.409738
2017-02-07-0.8716250.188268-0.6894271.128257
2017-02-081.504632-0.229489-2.7757631.867307
2017-02-09-1.7534261.2529520.0531050.870436
2017-02-10-1.162485-0.3761721.548791-0.928139
# 获取平台数据
df = D.history_data(['000005.SZA'],start_date='2017-01-06',end_date='2017-01-16',
                    fields=['open','high','low','close'], groupped_by_instrument=False)
# 将日期列作为索引
df = df.set_index('date')
df
dateinstrumentopenhighlowclose
2017-01-06000005.SZA64.50249566.07798864.22446465.151230
2017-01-09000005.SZA64.50249565.89263964.40982165.521935
2017-01-10000005.SZA65.52193565.61460964.96587464.965874
2017-01-11000005.SZA64.96587464.96587463.57573763.575737
2017-01-12000005.SZA63.20503263.76108962.74165362.927006
2017-01-13000005.SZA62.92700663.20503261.90757061.907570
2017-01-16000005.SZA62.18559662.27827156.81038759.683346

# 查看数据

# 显示DataFrame前几行(默认是5)的数据
df.head()
dateinstrumentopenhighlowclose
2017-01-06000005.SZA64.50249566.07798864.22446465.151230
2017-01-09000005.SZA64.50249565.89263964.40982165.521935
2017-01-10000005.SZA65.52193565.61460964.96587464.965874
2017-01-11000005.SZA64.96587464.96587463.57573763.575737
2017-01-12000005.SZA63.20503263.76108962.74165362.927006
# 显示DataFrame后行的数据
df.tail(2)
dateinstrumentopenhighlowclose
2017-01-13000005.SZA62.92700663.20503261.90757061.907570
2017-01-16000005.SZA62.18559662.27827156.81038759.683346

# 显示索引、列名以及底层的numpy数据

# 数据索引
df.index
DatetimeIndex(['2017-01-06', '2017-01-09', '2017-01-10', '2017-01-11',
​                '2017-01-12', '2017-01-13', '2017-01-16'],
​              dtype='datetime64[ns]', name='date', freq=None)
# 数据列名
df.columns
Index(['instrument', 'open', 'high', 'low', 'close'], dtype='object')
# 底层numpy数据
df.values
array([['000005.SZA', 64.50249481201172, 66.07798767089844,
​        64.2244644165039, 65.15122985839844],
​       ['000005.SZA', 64.50249481201172, 65.89263916015625,
​        64.40982055664062, 65.52193450927734],
​       ['000005.SZA', 65.52193450927734, 65.61460876464844,
​        64.96587371826172, 64.96587371826172],
​       ['000005.SZA', 64.96587371826172, 64.96587371826172,
​        63.57573699951172, 63.57573699951172],
​       ['000005.SZA', 63.20503234863281, 63.76108932495117,
​        62.74165344238281, 62.927005767822266],
​       ['000005.SZA', 62.927005767822266, 63.20503234863281,
​        61.907569885253906, 61.907569885253906],
​       ['000005.SZA', 62.18559646606445, 62.27827072143555,
​        56.810386657714844, 59.683345794677734]], dtype=object)
# 数据类型
df.dtypes
instrument         object
open               float32
high                float32
low                  float32
close               float32
dtype:             object

# 转置、排序、偏移

# 转置
df.T
date2017-01-06 00: 00: 002017-01-09 00: 00: 002017-01-10 00: 00: 002017-01-11 00: 00: 002017-01-12 00: 00: 002017-01-13 00: 00: 002017-01-16 00: 00: 00
instrument000005.SZA000005.SZA000005.SZA000005.SZA000005.SZA000005.SZA000005.SZA
open64.502564.502565.521964.965963.20562.92762.1856
high66.07865.892665.614664.965963.761163.20562.2783
low64.224564.409864.965963.575762.741761.907656.8104
close65.151265.521964.965963.575762.92761.907659.6833
# 按轴进行排序
df.sort_index(axis=1, ascending=False)
dateopenlowinstrumenthighclose
2017-01-0664.50249564.224464000005.SZA66.07798865.151230
2017-01-0964.50249564.409821000005.SZA65.89263965.521935
2017-01-1065.52193564.965874000005.SZA65.61460964.965874
2017-01-1164.96587463.575737000005.SZA64.96587463.575737
2017-01-1263.20503262.741653000005.SZA63.76108962.927006
2017-01-1362.92700661.907570000005.SZA63.20503261.907570
2017-01-1662.18559656.810387000005.SZA62.27827159.683346
# 按值进行排序
df.sort_values(by='close',ascending=False) # ascending默认升序,ascending=False降序
dateinstrumentopenhighlowclose
2017-01-09000005.SZA64.50249565.89263964.40982165.521935
2017-01-06000005.SZA64.50249566.07798864.22446465.151230
2017-01-10000005.SZA65.52193565.61460964.96587464.965874
2017-01-11000005.SZA64.96587464.96587463.57573763.575737
2017-01-12000005.SZA63.20503263.76108962.74165362.927006
2017-01-13000005.SZA62.92700663.20503261.90757061.907570
2017-01-16000005.SZA62.18559662.27827156.81038759.683346

# shift数据偏移(可理解为回溯)

# 利用pct_change计算每日收益
df = D.history_data(['000001.SZA'],'2010-01-01','2016-12-31',fields=['close'],groupped_by_instrument=False).set_index('date')['close']
rets = df.pct_change()
rets.head()
date
2010-01-04         NaN
2010-01-05   -0.017292
2010-01-06   -0.017167
2010-01-07   -0.010917
2010-01-08   -0.002208
Name: close, dtype: float32
# 计算每日收益,等价于pct_change函数
daily_returns = df / df.shift(1) -1 # 正数表示向前偏移,负数表示向后偏移
daily_returns.head()
date
2010-01-04         NaN
2010-01-05   -0.017292
2010-01-06   -0.017167
2010-01-07   -0.010917
2010-01-08   -0.002208
Name: close, dtype: float32

# 数据选择

# 选取

df = D.history_data(['000005.SZA'],start_date='2017-01-06',end_date='2017-02-10',
                    fields=['open','high','low','close'], groupped_by_instrument=False)
df = df.set_index('date')

# 选择某一列数据,它会返回一个Series,等同于df.close:
df['close']
date
2017-01-06    65.151230
2017-01-09    65.521935
2017-01-10    64.965874
2017-01-11    63.575737
2017-01-12    62.927006
2017-01-13    61.907570
2017-01-16    59.683346
2017-01-17    60.054047
2017-01-18    60.702782
2017-01-19    59.683346
2017-01-20    60.239399
2017-01-23    60.517429
2017-01-24    60.332077
2017-01-25    60.424751
2017-01-26    60.795456
2017-02-03    60.517429
2017-02-06    60.795456
2017-02-07    61.166161
2017-02-08    61.073483
2017-02-09    61.444187
2017-02-10    61.536865
Name: close, dtype: float32
# 获取2017-02-06到2017-02-09的所有数据
df['20170206':'20170209']
dateinstrumentopenhighlowclose
2017-02-06000005.SZA60.51742960.79545660.14672560.795456
2017-02-07000005.SZA60.79545661.62953960.42475161.166161
2017-02-08000005.SZA60.98080861.16616160.42475161.073483
2017-02-09000005.SZA61.07348361.53686560.88813061.444187
# 获取2017-02-06的所有数据
df.ix['2017-02-06':'2017-02-06',
dateinstrumentopenhighlowclose
2017-02-06000005.SZA60.51742960.79545660.14672560.795456
# 获取2017-02-06的开盘价和收盘价数据
df.ix['2017-02-06':'2017-02-06',['open','close']]
dateopenclose
2017-02-0660.51742960.795456
# 获取2017-02-06和2017-02-09间断两天的开盘价和收盘价数据
df.ix['2017-02-06':'2017-02-09',['open','close']]
dateopenclose
2017-02-0660.51742960.795456
2017-02-0760.79545661.166161
2017-02-0860.98080861.073483
2017-02-0961.07348361.444187
# 获取2017-02-06的收盘价,返回DataFrame
df.ix['2017-02-06':'2017-02-06',['open']]
dateopen
2017-02-0660.517429
# 获取2017-02-06的收盘价,返回标量
df.ix['2017-02-06','open']
60.517429

# 布尔索引

# 获取最高价大于64的交易日数据
df[df.high > 64]
dateinstrumentopenhighlowclose
2017-01-06000005.SZA64.50249566.07798864.22446465.151230
2017-01-09000005.SZA64.50249565.89263964.40982165.521935
2017-01-10000005.SZA65.52193565.61460964.96587464.965874
2017-01-11000005.SZA64.96587464.96587463.57573763.575737
2017-01-18000005.SZA60.88813064.68785160.51742960.702782

# where 索引

# 选出大于60的元素
df[df > 60]
dateinstrumentopenhighlowclose
2017-01-06000005.SZA64.50249566.07798864.22446465.151230
2017-01-09000005.SZA64.50249565.89263964.40982165.521935
2017-01-10000005.SZA65.52193565.61460964.96587464.965874
2017-01-11000005.SZA64.96587464.96587463.57573763.575737
2017-01-12000005.SZA63.20503263.76108962.74165362.927006
2017-01-13000005.SZA62.92700663.20503261.90757061.907570
2017-01-16000005.SZA62.18559662.278271NaNNaN
2017-01-17000005.SZANaN60.424751NaN60.054047
2017-01-18000005.SZA60.88813064.68785160.51742960.702782
2017-01-19000005.SZANaN60.980808NaNNaN
2017-01-20000005.SZA60.23939960.610104NaN60.239399
2017-01-23000005.SZA60.33207760.88813060.14672560.517429
2017-01-24000005.SZA60.61010460.61010460.23939960.332077
2017-01-25000005.SZA60.42475160.517429NaN60.424751
2017-01-26000005.SZA60.51742960.79545660.42475160.795456
2017-02-03000005.SZA60.88813060.88813060.23939960.517429
2017-02-06000005.SZA60.51742960.79545660.14672560.795456
2017-02-07000005.SZA60.79545661.62953960.42475161.166161
2017-02-08000005.SZA60.98080861.16616160.42475161.073483
2017-02-09000005.SZA61.07348361.53686560.88813061.444187
2017-02-10000005.SZA61.62953961.62953961.25883561.536865

# 丢弃数据

# 丢弃行
df.drop([pd.Timestamp('2017-01-06')]) # 丢弃行
dateinstrumentopenhighlowclose
2017-01-09000005.SZA64.50249565.89263964.40982165.521935
2017-01-10000005.SZA65.52193565.61460964.96587464.965874
2017-01-11000005.SZA64.96587464.96587463.57573763.575737
2017-01-12000005.SZA63.20503263.76108962.74165362.927006
2017-01-13000005.SZA62.92700663.20503261.90757061.907570
2017-01-16000005.SZA62.18559662.27827156.81038759.683346
2017-01-17000005.SZA59.49799360.42475158.38588060.054047
2017-01-18000005.SZA60.88813064.68785160.51742960.702782
2017-01-19000005.SZA59.77602060.98080859.49799359.683346
2017-01-20000005.SZA60.23939960.61010459.77602060.239399
2017-01-23000005.SZA60.33207760.88813060.14672560.517429
2017-01-24000005.SZA60.61010460.61010460.23939960.332077
2017-01-25000005.SZA60.42475160.51742959.96137260.424751
2017-01-26000005.SZA60.51742960.79545660.42475160.795456
2017-02-03000005.SZA60.88813060.88813060.23939960.517429
2017-02-06000005.SZA60.51742960.79545660.14672560.795456
2017-02-07000005.SZA60.79545661.62953960.42475161.166161
2017-02-08000005.SZA60.98080861.16616160.42475161.073483
2017-02-09000005.SZA61.07348361.53686560.88813061.444187
2017-02-10000005.SZA61.62953961.62953961.25883561.536865
# 丢弃列 ,del df['open'],原地修改
df.drop('open',axis=1)  # 非原地修改
dateinstrumenthighlowclose
2017-01-06000005.SZA66.07798864.22446465.151230
2017-01-09000005.SZA65.89263964.40982165.521935
2017-01-10000005.SZA65.61460964.96587464.965874
2017-01-11000005.SZA64.96587463.57573763.575737
2017-01-12000005.SZA63.76108962.74165362.927006
2017-01-13000005.SZA63.20503261.90757061.907570
2017-01-16000005.SZA62.27827156.81038759.683346
2017-01-17000005.SZA60.42475158.38588060.054047
2017-01-18000005.SZA64.68785160.51742960.702782
2017-01-19000005.SZA60.98080859.49799359.683346
2017-01-20000005.SZA60.61010459.77602060.239399
2017-01-23000005.SZA60.88813060.14672560.517429
2017-01-24000005.SZA60.61010460.23939960.332077
2017-01-25000005.SZA60.51742959.96137260.424751
2017-01-26000005.SZA60.79545660.42475160.795456
2017-02-03000005.SZA60.88813060.23939960.517429
2017-02-06000005.SZA60.79545660.14672560.795456
2017-02-07000005.SZA61.62953960.42475161.166161
2017-02-08000005.SZA61.16616160.42475161.073483
2017-02-09000005.SZA61.53686560.88813061.444187
2017-02-10000005.SZA61.62953961.25883561.536865

# 对齐、合并、赋值

# 通过索引,自动对齐两个DataFrame的数据
df1 = D.history_data(['000005.SZA'],start_date='2017-02-01',end_date='2017-02-10',
                    fields=['open','high','low','close'], groupped_by_instrument=False).set_index('date')

df2 = D.history_data(['000005.SZA'],start_date='2017-01-01',end_date='2017-02-10',
                    fields=['market_cap','pe_ttm'], groupped_by_instrument=False).set_index('date')
# 取索引的交集进行合并
df1.merge(df2,left_index=True,right_index=True,how='inner')
dateinstrument_xopenhighlowcloseinstrument_ymarket_cappe_ttm
2017-02-03000005.SZA60.88813060.88813060.23939960.517429000005.SZA6.912246e+0963.786068
2017-02-06000005.SZA60.51742960.79545660.14672560.795456000005.SZA6.944002e+0964.079109
2017-02-07000005.SZA60.79545661.62953960.42475161.166161000005.SZA6.986343e+0964.469841
2017-02-08000005.SZA60.98080861.16616160.42475161.073483000005.SZA6.975758e+0964.372154
2017-02-09000005.SZA61.07348361.53686560.88813061.444187000005.SZA7.018099e+0964.762886
2017-02-10000005.SZA61.62953961.62953961.25883561.536865000005.SZA7.028685e+0964.860565
# 取索引的并集进行合并
df1.merge(df2,left_index=True,right_index=True,how='outer')
dateinstrument_xopenhighlowcloseinstrument_ymarket_cappe_ttm
2017-01-03NaNNaNNaNNaNNaN000005.SZA7.229807e+0966.716515
2017-01-04NaNNaNNaNNaNNaN000005.SZA7.420343e+0968.474785
2017-01-05NaNNaNNaNNaNNaN000005.SZA7.378002e+0968.084053
2017-01-06NaNNaNNaNNaNNaN000005.SZA7.441514e+0968.670143
2017-01-09NaNNaNNaNNaNNaN000005.SZA7.483855e+0969.060875
2017-01-10NaNNaNNaNNaNNaN000005.SZA7.420343e+0968.474785
2017-01-11NaNNaNNaNNaNNaN000005.SZA7.261563e+0967.009560
2017-01-12NaNNaNNaNNaNNaN000005.SZA7.187465e+0966.325790
2017-01-13NaNNaNNaNNaNNaN000005.SZA7.071026e+0965.251289
2017-01-16NaNNaNNaNNaNNaN000005.SZA6.816977e+0962.906933
2017-01-17NaNNaNNaNNaNNaN000005.SZA6.859319e+0963.297661
2017-01-18NaNNaNNaNNaNNaN000005.SZA6.933416e+0963.981430
2017-01-19NaNNaNNaNNaNNaN000005.SZA6.816977e+0962.906933
2017-01-20NaNNaNNaNNaNNaN000005.SZA6.880489e+0963.493023
2017-01-23NaNNaNNaNNaNNaN000005.SZA6.912246e+0963.786068
2017-01-24NaNNaNNaNNaNNaN000005.SZA6.891075e+0963.590702
2017-01-25NaNNaNNaNNaNNaN000005.SZA6.901660e+0963.688385
2017-01-26NaNNaNNaNNaNNaN000005.SZA6.944002e+0964.079109
2017-02-03000005.SZA60.88813060.88813060.23939960.517429000005.SZA6.912246e+0963.786068
2017-02-06000005.SZA60.51742960.79545660.14672560.795456000005.SZA6.944002e+0964.079109
2017-02-07000005.SZA60.79545661.62953960.42475161.166161000005.SZA6.986343e+0964.469841
2017-02-08000005.SZA60.98080861.16616160.42475161.073483000005.SZA6.975758e+0964.372154
2017-02-09000005.SZA61.07348361.53686560.88813061.444187000005.SZA7.018099e+0964.762886
2017-02-10000005.SZA61.62953961.62953961.25883561.536865000005.SZA7.028685e+0964.860565
# 通过标签赋值
df1.ix['2017-02-03','market_cap'] = 0
df1
dateinstrumentopenhighlowclosemarket_cap
2017-02-03000005.SZA60.88813060.88813060.23939960.5174290.0
2017-02-06000005.SZA60.51742960.79545660.14672560.795456NaN
2017-02-07000005.SZA60.79545661.62953960.42475161.166161NaN
2017-02-08000005.SZA60.98080861.16616160.42475161.073483NaN
2017-02-09000005.SZA61.07348361.53686560.88813061.444187NaN
2017-02-10000005.SZA61.62953961.62953961.25883561.536865NaN
# 通过位置赋值
df1.iat[0,0] ='600519.SHA'
df1
dateinstrumentopenhighlowclosemarket_cap
2017-02-03600519.SHA60.88813060.88813060.23939960.5174290.0
2017-02-06000005.SZA60.51742960.79545660.14672560.795456NaN
2017-02-07000005.SZA60.79545661.62953960.42475161.166161NaN
2017-02-08000005.SZA60.98080861.16616160.42475161.073483NaN
2017-02-09000005.SZA61.07348361.53686560.88813061.444187NaN
2017-02-10000005.SZA61.62953961.62953961.25883561.536865NaN
# 通过where操作来赋值
df1[df1['market_cap'] == 0] = 111
df1.head(4)
dateinstrumentopenhighlowclosemarket_cap
2017-02-03111111.000000111.000000111.000000111.000000111.0
2017-02-06000005.SZA60.51742960.79545660.14672560.795456NaN
2017-02-07000005.SZA60.79545661.62953960.42475161.166161NaN
2017-02-08000005.SZA60.98080861.16616160.42475161.073483NaN

# 缺失值处理

在pandas中,用np.nan来代表缺失值,这些值默认不会参与运算。

# 通过set_index设置某一列为索引
df2 = D.history_data(['000005.SZA'],'2016-11-15','2016-12-01',fields=['open','high','low','close'],
                     groupped_by_instrument=False).set_index('date')
df2
dateinstrumentopenhighlowclose
2016-11-15000005.SZA69.78502770.61911069.50700470.526436
2016-11-16000005.SZA71.36051971.54587669.78502770.155731
2016-11-17000005.SZA69.97038370.15573169.13629969.970383
2016-11-18000005.SZA69.97038370.43376269.32164870.063057
2016-11-21000005.SZA69.97038370.71178469.69235269.877701
2016-11-22000005.SZA69.97038370.61911069.69235270.341087
2016-11-23000005.SZA70.43376270.89714169.78502770.526436
2016-11-24000005.SZA70.15573170.61911069.50700469.970383
2016-11-25000005.SZA69.97038370.43376268.67291370.063057
2016-11-28000005.SZA70.43376271.63855069.78502770.711784
2016-11-29000005.SZA70.43376271.08248969.22897369.321648
2016-11-30000005.SZA69.04361769.59967868.58023868.765594
2016-12-01000005.SZA68.95094369.32164868.67291369.043617
# 通过resset_index来将DataFrame的索引转化为列
df2.reset_index()
dateinstrumentopenhighlowclose
02016-11-15000005.SZA69.78502770.61911069.50700470.526436
12016-11-16000005.SZA71.36051971.54587669.78502770.155731
22016-11-17000005.SZA69.97038370.15573169.13629969.970383
32016-11-18000005.SZA69.97038370.43376269.32164870.063057
42016-11-21000005.SZA69.97038370.71178469.69235269.877701
52016-11-22000005.SZA69.97038370.61911069.69235270.341087
62016-11-23000005.SZA70.43376270.89714169.78502770.526436
72016-11-24000005.SZA70.15573170.61911069.50700469.970383
82016-11-25000005.SZA69.97038370.43376268.67291370.063057
92016-11-28000005.SZA70.43376271.63855069.78502770.711784
102016-11-29000005.SZA70.43376271.08248969.22897369.321648
112016-11-30000005.SZA69.04361769.59967868.58023868.765594
122016-12-01000005.SZA68.95094369.32164868.67291369.043617
# 将2016-11-22以后的数据用缺失值代替
df2.ix['2016-11-22':,:] = np.nan
df2
dateinstrumentopenhighlowclose
2016-11-15000005.SZA69.78502770.61911069.50700470.526436
2016-11-16000005.SZA71.36051971.54587669.78502770.155731
2016-11-17000005.SZA69.97038370.15573169.13629969.970383
2016-11-18000005.SZA69.97038370.43376269.32164870.063057
2016-11-21000005.SZA69.97038370.71178469.69235269.877701
2016-11-22NaNNaNNaNNaNNaN
2016-11-23NaNNaNNaNNaNNaN
2016-11-24NaNNaNNaNNaNNaN
2016-11-25NaNNaNNaNNaNNaN
2016-11-28NaNNaNNaNNaNNaN
2016-11-29NaNNaNNaNNaNNaN
2016-11-30NaNNaNNaNNaNNaN
2016-12-01NaNNaNNaNNaNNaN
# 剔除所有包含缺失值的行数据
df2.dropna(how='any')
dateinstrumentopenhighlowclose
2016-11-15000005.SZA69.78502770.61911069.50700470.526436
2016-11-16000005.SZA71.36051971.54587669.78502770.155731
2016-11-17000005.SZA69.97038370.15573169.13629969.970383
2016-11-18000005.SZA69.97038370.43376269.32164870.063057
2016-11-21000005.SZA69.97038370.71178469.69235269.877701
# 填充缺失值
df2.fillna(value=154)
dateinstrumentopenhighlowclose
2016-11-15000005.SZA69.78502770.61911069.50700470.526436
2016-11-16000005.SZA71.36051971.54587669.78502770.155731
2016-11-17000005.SZA69.97038370.15573169.13629969.970383
2016-11-18000005.SZA69.97038370.43376269.32164870.063057
2016-11-21000005.SZA69.97038370.71178469.69235269.877701
2016-11-22154154.000000154.000000154.000000154.000000
2016-11-23154154.000000154.000000154.000000154.000000
2016-11-24154154.000000154.000000154.000000154.000000
2016-11-25154154.000000154.000000154.000000154.000000
2016-11-28154154.000000154.000000154.000000154.000000
2016-11-29154154.000000154.000000154.000000154.000000
2016-11-30154154.000000154.000000154.000000154.000000
2016-12-01154154.000000154.000000154.000000154.000000
# 获取值是否为nan的布尔标记
pd.isnull(df2)
dateinstrumentopenhighlowclose
2016-11-15FalseFalseFalseFalseFalse
2016-11-16FalseFalseFalseFalseFalse
2016-11-17FalseFalseFalseFalseFalse
2016-11-18FalseFalseFalseFalseFalse
2016-11-21FalseFalseFalseFalseFalse
2016-11-22TrueTrueTrueTrueTrue
2016-11-23TrueTrueTrueTrueTrue
2016-11-24TrueTrueTrueTrueTrue
2016-11-25TrueTrueTrueTrueTrue
2016-11-28TrueTrueTrueTrueTrue
2016-11-29TrueTrueTrueTrueTrue
2016-11-30TrueTrueTrueTrueTrue
2016-12-01TrueTrueTrueTrueTrue

# 运算

# describe()能对数据做一个快速统计汇总
df2.describe()
openhighlowclose
count5.0000005.0000005.0000005.000000
mean70.21134270.69325369.48846470.118668
std0.6474060.5217930.2653820.250397
min69.78502770.15573169.13629969.877701
25%69.97038370.43376269.32164869.970383
50%69.97038370.61911069.50700470.063057
75%69.97038370.71178469.69235270.155731
max71.36051971.54587669.78502770.526436

运算过程中,通常不包含缺失值。

# 按列计算平均值
df2.mean()
open     70.211342
high     70.693253
low      69.488464
close    70.118668
dtype: float32
# 按行计算平均值
df2.mean(1)
date
2016-11-15    70.109398
2016-11-16    70.711792
2016-11-17    69.808197
2016-11-18    69.947212
2016-11-21    70.063049
2016-11-22          NaN
2016-11-23          NaN
2016-11-24          NaN
2016-11-25          NaN
2016-11-28          NaN
2016-11-29          NaN
2016-11-30          NaN
2016-12-01          NaN
dtype: float32
# pandas会按照指定的索引对齐
dates = ['2016-11-15','2016-11-16','2016-11-17','2016-11-18','2016-11-23']
for i in range(len(dates)):
    dates[i] = pd.Timestamp(dates[i])
pd.DataFrame(df2,index=dates)
instrumentopenhighlowclose
2016-11-15000005.SZA69.78502770.61911069.50700470.526436
2016-11-16000005.SZA71.36051971.54587669.78502770.155731
2016-11-17000005.SZA69.97038370.15573169.13629969.970383
2016-11-18000005.SZA69.97038370.43376269.32164870.063057
2016-11-23NaNNaNNaNNaNNaN

# Apply 函数

# 通过apply函数求每个交易日高开低收价格的平均值
df3 = df2[['open','high','low','close']]
df3.apply(np.mean,axis=1)
date
2016-11-15    70.109398
2016-11-16    70.711792
2016-11-17    69.808197
2016-11-18    69.947212
2016-11-21    70.063049
2016-11-22          NaN
2016-11-23          NaN
2016-11-24          NaN
2016-11-25          NaN
2016-11-28          NaN
2016-11-29          NaN
2016-11-30          NaN
2016-12-01          NaN
dtype: float64
# 计算每日股价振幅,振幅=最高价-最低价
df3.apply(lambda x:x.max()-x.min())
open     1.575493
high     1.390144
low      0.648727
close    0.648735
dtype: float64
# 通过apply对各个元素调用字符串方法,比如将股票代码的后缀改成小写
temp = df2.ix['2016-11-15':'2016-11-21']
temp['instrument'].apply(lambda x : x.lower())
date
2016-11-15    000005.sza
2016-11-16    000005.sza
2016-11-17    000005.sza
2016-11-18    000005.sza
2016-11-21    000005.sza
Name: instrument,
dtype: object

# rolling_apply的使用

# 获取数据
data = D.history_data(['000001.SZA'], '2016-01-01', '2016-10-01',
           fields=['close'],
           groupped_by_instrument=False).set_index('date')

# 计算5日均线和20日均线
data['MA_' + str(5)] = pd.rolling_apply(data['close'], 5,np.mean)
data['MA_' + str(20)] = pd.rolling_apply(data['close'], 20,np.mean)
data.head(12)
instrumentcloseMA_5MA_20
date
2016-01-04000001.SZA974.312988NaNNaN
2016-01-05000001.SZA980.332520NaNNaN
2016-01-06000001.SZA991.511780NaNNaN
2016-01-07000001.SZA940.775269NaNNaN
2016-01-08000001.SZA956.254211968.637354NaN
2016-01-11000001.SZA925.296326958.834021NaN
2016-01-12000001.SZA929.596008948.686719NaN
2016-01-13000001.SZA920.996643934.583691NaN
2016-01-14000001.SZA926.156250931.659888NaN
2016-01-15000001.SZA899.498108920.308667NaN
2016-01-18000001.SZA895.198425914.289087NaN
2016-01-19000001.SZA920.996643912.569214NaN

# 合并

# Concat 连接

# 通过concat()来连接pandas对象
a = df2.ix['2016-11-15':'2016-11-16']
b = df2.ix['2016-11-20':'2016-11-26']
pd.concat([a,b])
dateinstrumentopenhighlowclose
2016-11-15000005.SZA69.78502770.61911069.50700470.526436
2016-11-16000005.SZA71.36051971.54587669.78502770.155731
2016-11-21000005.SZA69.97038370.71178469.69235269.877701
2016-11-22NaNNaNNaNNaNNaN
2016-11-23NaNNaNNaNNaNNaN
2016-11-24NaNNaNNaNNaNNaN
2016-11-25NaNNaNNaNNaNNaN

# Join 合并

# 合并,和merge功能类似
df5 = D.history_data(['000005.SZA'],start_date='2017-02-01',end_date='2017-02-10',
                    fields=['open','high','low','close'], groupped_by_instrument=False).set_index('date')

df6 = D.history_data(['000005.SZA'],start_date='2017-01-01',end_date='2017-02-10',
                    fields=['market_cap','pe_ttm'], groupped_by_instrument=False).set_index('date')

# 删除股票代码列
del df5['instrument']
del df6['instrument']

# 按日期轴合并数据
df5.join(df6)
dateopenhighlowclosemarket_cappe_ttm
2017-02-0360.88813060.88813060.23939960.5174296.912246e+0963.786068
2017-02-0660.51742960.79545660.14672560.7954566.944002e+0964.079109
2017-02-0760.79545661.62953960.42475161.1661616.986343e+0964.469841
2017-02-0860.98080861.16616160.42475161.0734836.975758e+0964.372154
2017-02-0961.07348361.53686560.88813061.4441877.018099e+0964.762886
2017-02-1061.62953961.62953961.25883561.5368657.028685e+0964.860565

# Append 添加

# 通过append方法,将若干行添加到dataFrame后面
c = df1.ix['2017-02-03':'2017-02-06']
d = df1.ix['2017-02-08':'2017-02-10']
c.append(d)
dateinstrumentopenhighlowclosemarket_cap
2017-02-03111111.000000111.000000111.000000111.000000111.0
2017-02-06000005.SZA60.51742960.79545660.14672560.795456NaN
2017-02-08000005.SZA60.98080861.16616160.42475161.073483NaN
2017-02-09000005.SZA61.07348361.53686560.88813061.444187NaN
2017-02-10000005.SZA61.62953961.62953961.25883561.536865NaN

# 数据透视表

也许大多数人都有在Excel中使用数据透视表的经历,其实Pandas也提供了一个类似的功能,名为pivot_table

df = pd.DataFrame({'A' : ['one', 'one', 'two', 'three'] * 3,
                   'B' : ['A', 'B', 'C'] * 4,
                   'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,
                   'D' : np.random.randn(12),
                   'E' : np.random.randn(12)})
df
ABCDE
0oneAfoo-2.2161990.288395
1oneBfoo1.5724960.159725
2twoCfoo-0.5592290.204685
3threeAbar2.044115-0.032094
4oneBbar1.0227781.650399
5oneCbar-0.5589431.074547
6twoAfoo-0.1473290.150598
7threeBfoo0.6597601.779248
8oneCfoo0.506593-0.343586
9oneAbar0.941787-0.277070
10twoBbar1.662566-0.558867
11threeCbar-1.671826-0.939832
# 我们可以轻松得到数据透视表
pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C'])
Cbarfoo
AB
oneA0.941787-2.216199
B1.0227781.572496
C-0.5589430.506593
threeA2.044115NaN
BNaN0.659760
C-1.671826NaN
twoANaN-0.147329
B1.662566NaN
CNaN-0.559229
# 查看每个交易日多只股票的收盘价数据(每列为股票)
df = D.history_data(['000001.SZA','000002.SZA'],'2016-12-15','2017-01-01',fields=['close'],groupped_by_instrument=False)
pd.pivot_table(df,values='close',index=['date'],columns=['instrument'])
instrument000001.SZA000002.SZA
date
2016-12-15969.0138552965.032471
2016-12-16969.0138552982.294189
2016-12-19963.7759402801.710205
2016-12-20954.3476562699.467529
2016-12-21959.5855712719.385010
2016-12-22957.4904172736.646729
2016-12-23951.2049562695.484131
2016-12-26955.3952642741.958008
2016-12-27951.2049562844.200439
2016-12-28949.1098022814.988281
2016-12-29951.2049562767.186768
2016-12-30953.3001102728.679688
# 查看不同股票每个交易日的收盘价数据(每列为交易日)
pd.pivot_table(df,values = 'close',index=['instrument'],columns=['date'])
date2016-12-15 00 :00 :002016-12-16 00 :00 :002016-12-19 00 :00 :002016-12-20 00 :00 :002016-12-21 00 :00 :002016-12-22 00 :00 :002016-12-23 00 :00 :002016-12-26 00 :00 :002016-12-27 00 :00 :002016-12-28 00 :00 :002016-12-29 00 :00 :002016-12-30 00 :00 :00
instrument
000001.SZA969.013855969.013855963.775940954.347656959.585571957.490417951.204956955.395264951.204956949.109802951.204956953.300110
000002.SZA2965.0324712982.2941892801.7102052699.4675292719.3850102736.6467292695.4841312741.9580082844.2004392814.9882812767.1867682728.679688

# 分组

对于“group by”操作,我们通常是指以下一个或几个步骤: * 划分 按照某些标准将数据分为不同的组 * 应用 对每组数据分别执行一个函数 * 组合 将结果组合到一个数据结构

df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
                          'foo', 'bar', 'foo', 'bar'],
                   'B' : ['one', 'one', 'two', 'three',
                          'two', 'two', 'one', 'three'],
                   'C' : np.random.randn(8),
                   'D' : np.random.randn(8)})
df
ABCD
0fooone-0.160349-0.086323
1barone-0.994189-1.444457
2footwo-0.0953650.227690
3barthree1.502585-0.286937
4footwo-0.258329-0.123057
5bartwo0.944706-0.211243
6fooone1.542726-0.164639
7barthree0.874764-0.075149
# 分组并对每个分组应用sum函数
df.groupby('A').sum()
ACD
bar2.327866-2.017786
foo1.028683-0.146328
# 按多个列分组形成层级索引,然后应用函数
df.groupby(['A','B']).sum()
ABCD
barone-0.994189-1.444457
three2.377349-0.362086
two0.944706-0.211243
fooone1.382377-0.250962
two-0.3536940.104634
# 通过groupby函数来计算各股票和大盘指数的相关系数(按年划分)

# 获取数据
stocks = ['000005.SZA','000010.SZA','603998.SHA','603898.SHA','000300.SHA']
df = D.history_data(stocks,'2010-01-01','2016-12-31',fields=['close'],groupped_by_instrument=False)

# 每列转化为每日多只股票的收盘价数据
df = pd.pivot_table(df,values='close',index=['date'],columns=['instrument'])

# 计算日收益率
rets = df.pct_change().dropna()

# 定义求相关系数的函数
corr_func = lambda x:x.corrwith(x['000300.SHA'])

# 求每只股票和大盘相关性
by_year = rets.groupby(lambda x:x.year)
by_year.apply(corr_func)
instrument000005.SZA000010.SZA000300.SHA603898.SHA603998.SHA
20150.4423710.4641521.00.1893080.540383
20160.7143350.6498361.00.6346430.623916

# 时间序列

pandas在对频率转换进行重新采样时拥有着简单,强大而且高效的功能(例如把按秒采样的数据转换为按5分钟采样的数据)。这在金融领域很常见,但又不限于此。

# 按秒产生5000个时间、起始时间是2017-01-01的时间序列
rng = pd.date_range('1/1/2017', periods=5000, freq='S')
rng

DatetimeIndex(['2017-01-01 00 :00 :00 ', '2017-01-01 00 :00 :01 ',
​               '2017-01-01 00 :00 :02 ', '2017-01-01 00 :00 :03 ',
​               '2017-01-01 00 :00 :04 ', '2017-01-01 00 :00 :05 ',
​               '2017-01-01 00 :00 :06 ', '2017-01-01 00 :00 :07 ',
​               '2017-01-01 00 :00 :08 ', '2017-01-01 00 :00 :09 ',
​               ...
​               '2017-01-01 01 :23 :10 ', '2017-01-01 01 :23 :11 ',
​               '2017-01-01 01 :23 :12 ', '2017-01-01 01 :23 :13 ',
​               '2017-01-01 01 :23 :14 ', '2017-01-01 01 :23 :15 ',
​               '2017-01-01 01 :23 :16 ', '2017-01-01 01 :23 :17 ',
​               '2017-01-01 01 :23 :18 ', '2017-01-01 01 :23 :19 '],
​              dtype='datetime64[ns]', length=5000, freq='S')
# 重采样,计算每五分钟的总和
ts = pd.Series(np.random.randint(0,5000,len(rng)), index=rng)
ts.resample('5Min', how='sum')
2017-01-01 00:00:00    765477
2017-01-01 00:05:00    755782
2017-01-01 00:10:00    759474
2017-01-01 00:15:00    738611
2017-01-01 00:20:00    782165
2017-01-01 00:25:00    741331
2017-01-01 00:30:00    771220
2017-01-01 00:35:00    790782
2017-01-01 00:40:00    739392
2017-01-01 00:45:00    800815
2017-01-01 00:50:00    725954
2017-01-01 00:55:00    725840
2017-01-01 01:00:00    739548
2017-01-01 01:05:00    756409
2017-01-01 01:10:00    727530
2017-01-01 01:15:00    776613
2017-01-01 01:20:00    522169
Freq: 5T, dtype: int64
# 频率转化,将日K线数据装换成周K
df = D.history_data(['000001.SZA'],'2016-11-01','2016-12-31',fields=['close'],groupped_by_instrument=False).set_index('date')[['close']]
df.resample('W',how='ohlc')
close
dateopenhighlowclose
2016-11-06957.490417957.490417950.157349954.347656
2016-11-13955.395264961.680786950.157349961.680786
2016-11-20965.871094966.918701961.680786961.680786
2016-11-27967.9662481007.774414967.9662481007.774414
2016-12-041008.8219601008.8219601000.4413451000.441345
2016-12-11991.0130621010.917114991.0130621010.917114
2016-12-18995.203430995.203430969.013855969.013855
2016-12-25963.775940963.775940951.204956951.204956
2017-01-01955.395264955.395264949.109802953.300110

# 绘图

数据可视化:可交互图表 (文档: 数据可视化

# 绘制沪深300走势图
history_hs300 = D.history_data(['000300.SHA'], start_date='2016-01-01', end_date='2017-01-01', fields=['close'])
T.plot(history_hs300[['date', 'close']].set_index('date'), chart_type='spline')

# 绘制多只股票走势图
data = D.history_data(['000005.SZA','000010.SZA','603998.SHA','603898.SHA'], start_date='2016-01-01', end_date='2017-01-01', fields=['close'],groupped_by_instrument=False)
data = pd.pivot_table(data,values='close',index=['date'],columns=['instrument'])
T.plot(data, chart_type='spline')