Pandas是在NumPy基础上建立的新程序库，提供了一种高效的DataFrame数据结构。DataFrame本质上是一种带行标签和列标签、支持相同类型数据和缺失值的多维数组。

3.1 安装并使用pandas

import pandas

pandas.__version__

'1.0.5'

一般会简写成pd

import pandas as pd

3.2Pandas对象介绍

如果从底层视角观察Pandas对象，可以把它们看成增强版的NumPy结构化数组，行列都不再只是简单的整数索引，还可以带上便签。

Pandas的三个基本数据结构:Series, DateFrame,Index

先从倒包开始:

import numpy as np
import pandas as pd

3.2.1 Pandas的series对象

Pandas的Series对象是一个带索引数据构成的一维数组。可以用一个数组创建Series对象，如下所示:

data = pd.Series(np.linspace(0.25,1,4))

data

0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64

从输出可以看到Series对象将一组数据和一组索引绑定在一起，可以从对象的values和index获取相关属性。

data.values

array([0.25, 0.5 , 0.75, 1.  ])

data.index

RangeIndex(start=0, stop=4, step=1)

和Numpy'数组一样，数据可以通过Python的中括号索引标签获取:

data[1]

0.5

data[1:3]

1    0.50
2    0.75
dtype: float64

1.Series是通用的Numpy数组

你可能会觉得Series对象和一维NumPy数组基本可以等价交换，但两者的本质差异其实是索引:

NumPy数组通过隐式定义的整数获取索引数值，而Pandas对象用一种显式定义的索引与数值关联。

注意一个隐式，一个是显式

显式索引让Series的索引不仅仅是整数，还可以是任意想要的类型。

data = pd.Series([0.25,0.5,0.75,1.0],index=list('abcd'))

data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

取值还是通过索引取值

data['b']

0.5

也可以使用不连续索引:

data = pd.Series([0.25,0.5,0.75,1.0],index=[2,5,3,7])

data

2    0.25
5    0.50
3    0.75
7    1.00
dtype: float64

data[5]

0.5

2.Series是特殊的字典

我们可以把Pandas的Series对象看成一种特殊的Python字典。字典是这一种将任意键映射到一组任意值的数据结构，而Series对象其实是一种将类型键映射到一组类型值的数据结构。可以直接用Python的字典创建一个Series对象。、

population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}
population = pd.Series(population_dict)
population

输出

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

用字典创建Series对象时，其索引默认按照顺序排序。取值还是通过索引取值

population['California']

38332521

与字典不同,Series还支持切片操作

population['California':'Illinois']

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

创建Series对象

前面已经看到过的创建方式：

pd.Series(data, index=index)

index是一个可选参数，data参数支持多种数据类型。

pd.Series([2, 4, 6])

0    2
1    4
2    6
dtype: int64

上面演示了默认的index，data也可以是标量，索引后面会自动填充

pd.Series(5, index=[100,200,300])

100    5
200    5
300    5
dtype: int64

data还可以是一个字典，index默认是排序的字典key

pd.Series({2:'a', 1:'b', 3:'c'})

2    a
1    b
3    c
dtype: object

每一种形式可以通过显式指定索引需要的结果:

pd.Series({2:'a',1:'b',3:'c'},index=[3,2])

3    c
2    a
dtype: object

筛选出来的将不在排序，而且需要注意的是，Series对象只会保留显式定义的键值对。

3.2.2Pandas的DataFrame对象

同样DataFrame既可以作为一个通用型NumPy数组，也可以看做特殊的Python字典。

先再来创建另外一个Series对象:

area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297,
             'Florida': 170312, 'Illinois': 149995}
area = pd.Series(area_dict)
area

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
dtype: int64

与之前创建population的Series对象一样，用一个字典创建这些二维对象。

states = pd.DataFrame({'populatioan':population,
                      'area': area})
states

输出

 	populatioan 	area
California 	38332521 	423967
Texas 	26448193 	695662
New York 	19651127 	141297
Florida 	19552860 	170312
Illinois 	12882135 	149995

DataFrame也有index属性

states.index

Index(['California', 'Texas', 'New York', 'Florida', 'Illinois'], dtype='object')

DataFrame还有一个columns属性，是存放标签的index对象

states.columns

Index(['populatioan', 'area'], dtype='object')

因此DateFrame可以看作一种通用的NumPy二维数组，它的行与列都可以通过索引获取

states['area']

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

通过索引可以获取一个列数据，也就是一个Series

3.创建DataFrame对象

1通过Series对象创建。DataFrame是一组Series对象的集合，可以用单个Series创建一个单列的DateFrame

pd.DataFrame(population,columns=['population'])

输出

 	population
California 	38332521
Texas 	26448193
New York 	19651127
Florida 	19552860
Illinois 	12882135

2通过字典的列表创建。任何只要是列表套字典的形式就可以转换成DataFrame，字典的key是DateFrame的columns名称。

data =[{'a':i,'b':2*i} 
      for i in range(3)]
pd.DataFrame(data)

输出

即使有些键不存在，Pandas也会用缺失值Nan

pd.DataFrame([{'a': 1, 'b': 2}, {'b': 3, 'c': 4}])

输出

a 	b 	c
0 	1.0 	2 	NaN
1 	NaN 	3 	4.0

通过Series对象字典创建，跟前面的示例一样

pd.DataFrame({'polulation':population,
             'area': area})

输出

polulation 	area
California 	38332521 	423967
Texas 	26448193 	695662
New York 	19651127 	141297
Florida 	19552860 	170312
Illinois 	12882135 	149995

通过二维数组创建，如果如果不指定行与列的索引，那么行列默认都是整数索引值

pd.DataFrame(np.random.rand(3,2),
            columns=['foo','bar'],
            index=['a','b','c'])

输出

foo 	bar
a 	0.926270 	0.753726
b 	0.537491 	0.967508
c 	0.817875 	0.590719

最后书中的说明，是将结构化数组直接转换

A=np.zeros(3,dtype=({'names':('A','B'),'formats':('i8','f8')}))

A

array([(0, 0.), (0, 0.), (0, 0.)], dtype=[('A', '<i8'), ('B', '<f8')])

pd.DataFrame(A)

	A 	B
0 	0 	0.0
1 	0 	0.0
2 	0 	0.0

3.2.3Pandas的Index对象

Pandas的index是一个不可变的数组，同时也是一个有序的集合。

首先创建一个Index对象：

ind = pd.Index([2,3,5,7,11])

ind

Int64Index([2, 3, 5, 7, 11], dtype='int64')

1将index看做不可变的数组

尽然这么说了，除了赋值，另外的操作应该都可以

ind[1]

3

ind[::2]

Int64Index([2, 5, 11], dtype='int64')

Index objects also have many of the attributes familiar from NumPy arrays:

print(ind.size, ind.shape, ind.ndim, ind.dtype)

5 (5,) 1 int64

可以切片，可以通过索引取值，可以输出size，shape等属性。

但通过索引赋值就报错。

Index对象的不可变特性，使得多个DateFrame和数组之间进行索引共享时更加安全，尤其是可以避免修改索引时粗心大意而导致的副作用。

将Index看作有序集合

indA = pd.Index([1, 3, 5, 7, 9])

indB = pd.Index([2, 3, 5, 7, 11])

indA & indB  # intersection

Int64Index([3, 5, 7], dtype='int64')

indA | indB  # union

Int64Index([1, 2, 3, 5, 7, 9, 11], dtype='int64')

indA ^ indB  # symmetric difference

Int64Index([1, 2, 9, 11], dtype='int64')

与Python的集合操作一样，分别操作了并集，交集，异或

3.3.1Series数据选择的方法

把Series对象与NumPy数组和Python字典在许多方面一样，记住这个类比，可以让我们更好的理解Series对象的数据索引与选择模式

1.将Series看作字典

 

import pandas as pd
import numpy as np
data = pd.Series(np.linspace(0.25,1,4),
                index=list('abcd'))
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

data['b']

0.5

还可以用Python字典的一些方法

 
'a' in data

True

data.keys()

Index(['a', 'b', 'c', 'd'], dtype='object')

 

list(data.items())

[('area',
  California    423967
  Texas         695662
  New York      141297
  Florida       170312
  Illinois      149995
  Name: area, dtype: int64),
 ('pop',
  California    38332521
  Texas         26448193
  New York      19651127
  Florida       19552860
  Illinois      12882135
  Name: pop, dtype: int64)]

通过添加索引值，理解为字典添加key，来扩展Series

 
data['e'] = 1.25

data

	area	pop	e
California	423967	38332521	1.25
Texas	695662	26448193	1.25
New York	141297	19651127	1.25
Florida	170312	19552860	1.25
Illinois	149995	12882135	1.25

2将Series看作一维数组

Series不仅有着和字典一样的接口，而且还具备和NumPy数组一样的数组数据选择功能，包括索引、掩码、花哨的索引等操作

data

	area	pop	e
California	423967	38332521	1.25
Texas	695662	26448193	1.25
New York	141297	19651127	1.25
Florida	170312	19552860	1.25
Illinois	149995	12882135	1.25

 
# 将显式索引作为切片
data['a':'c']

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/indexes/base.py in get_slice_bound(self, label, side, kind)
   4844             try:
-> 4845                 return self._searchsorted_monotonic(label, side)
   4846             except ValueError:

~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/indexes/base.py in _searchsorted_monotonic(self, label, side)
   4805 
-> 4806         raise ValueError("index must be monotonic increasing or decreasing")
   4807 

ValueError: index must be monotonic increasing or decreasing

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-110-eccbdf2cf843> in <module>
      1 # 将显式索引作为切片
----> 2 data['a':'c']

~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/frame.py in __getitem__(self, key)
   2777 
   2778         # Do we have a slicer (on rows)?
-> 2779         indexer = convert_to_index_sliceable(self, key)
   2780         if indexer is not None:
   2781             # either we have a slice or we have a string that can be converted

~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/indexing.py in convert_to_index_sliceable(obj, key)
   2265     idx = obj.index
   2266     if isinstance(key, slice):
-> 2267         return idx._convert_slice_indexer(key, kind="getitem")
   2268 
   2269     elif isinstance(key, str):

~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/indexes/base.py in _convert_slice_indexer(self, key, kind)
   2961             indexer = key
   2962         else:
-> 2963             indexer = self.slice_indexer(start, stop, step, kind=kind)
   2964 
   2965         return indexer

~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/indexes/base.py in slice_indexer(self, start, end, step, kind)
   4711         slice(1, 3)
   4712         """
-> 4713         start_slice, end_slice = self.slice_locs(start, end, step=step, kind=kind)
   4714 
   4715         # return a slice

~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/indexes/base.py in slice_locs(self, start, end, step, kind)
   4924         start_slice = None
   4925         if start is not None:
-> 4926             start_slice = self.get_slice_bound(start, "left", kind)
   4927         if start_slice is None:
   4928             start_slice = 0

~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/indexes/base.py in get_slice_bound(self, label, side, kind)
   4846             except ValueError:
   4847                 # raise the original KeyError
-> 4848                 raise err
   4849 
   4850         if isinstance(slc, np.ndarray):

~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/indexes/base.py in get_slice_bound(self, label, side, kind)
   4840         # we need to look up the label
   4841         try:
-> 4842             slc = self.get_loc(label)
   4843         except KeyError as err:
   4844             try:

~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2646                 return self._engine.get_loc(key)
   2647             except KeyError:
-> 2648                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   2649         indexer = self.get_indexer([key], method=method, tolerance=tolerance)
   2650         if indexer