前言

本文章转载自[作者丨Jack Stark@知乎](https://zhuanlan.zhihu.com/p/104119343)

Python数据分析主要用到numpy、pandas等库，虽然简单，但是没事多复习一下，可以减少使用时搜索查询的时间。

np.ndarray, pd.Series和pd.DataFrame的属性和方法

np.array的属性和方法见

https://docs.scipy.org/doc/numpy/reference/arrays.ndarray.htmldocs.scipy.org/doc/numpy/reference/arrays.ndarray.html

pd.Series的属性和方法见

pandas.Series - pandas 0.25.3 documentationpandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html

pd.DataFrame的属性和方法见

pandas.DataFrame - pandas 0.25.3 documentationpandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html

DataFrame的统计

DataFrame按照特定列排序

>>> df = pd.DataFrame({'col_1':['a','b','a','b'], 'col_2':['c','c','d','d'], 'col_3':[1, 2, 3, 4]})
>>> df
  col_1 col_2  col_3
0     a     c      1
1     b     c      2
2     a     d      3
3     b     d      4
>>> df.sort_values(by=['col_1','col_3'])
  col_1 col_2  col_3
0     a     c      1
2     a     d      3
1     b     c      2
3     b     d      4

注意，df.sort_values()之后的index还是没变，使用iloc切片是按照排序，使用loc切片是按照index的值，需要注意和排序不一样。

>>> df.sort_values(by=['col_1','col_3'], ignore_index=False).iloc[2,:]
col_1    b
col_2    c
col_3    2
Name: 1, dtype: object
>>> df.sort_values(by=['col_1','col_3'], ignore_index=False).loc[2,:]
col_1    a
col_2    d
col_3    3
Name: 2, dtype: object

可以设置sort_values的ignore_index=True，这样返回的DataFrame的index就是重制了的。也可以使用df = df.reset_index(drop=True)来重置index。

DataFrame的groupby方法

Pandas的groupby方法可以根据一个或多个键拆分pandas对象，分组计算需要的值，比如计数、均值、标准差等，也可以使用apply和自定义函数。

>>> df = pd.DataFrame({'col_1':['a','b','a','b'], 'col_2':['c','c','d','d'], 'col_3':[1, 2, 3, 4]})
>>> df
  col_1 col_2  col_3
0     a     c      1
1     b     c      2
2     a     d      3
3     b     d      4
>>> df.groupby('col_1').mean() # 针对col_1中不同值分别求均值
       col_3
col_1
a          2
b          3
>>> df.groupby('col_1').apply(np.mean) # np.mean没有括号，也可以是自定义函数
       col_3
col_1
a        2.0
b        3.0
>>> df.groupby(['col_1','col_2']).count() # 计数，count不包含NaN值，而size计数时包含NaN值
             col_3
col_1 col_2
a     c          1
      d          1
b     c          1
      d          1
>>> df.groupby('col_1').size()
col_1
a    2
b    2
dtype: int64

DataFrame的索引切片

loc: Access a group of rows and columns by label(s) or a boolean array.
iloc: Purely integer-location based indexing for selection by position.
at: Access a single value for a row/column label pair.
iat: Access a single value for1 a row/column pair by integer position.
ix: A primarily label-location based indexer, with integer position fallback. (已经被删除，被loc和iloc替代)

>>> df
  col_1 col_2  col_3
0     a     c      1
1     b     c      2
2     a     d      3
3     b     d      4

>>> df.loc[1]
col_1    b
col_2    c
col_3    2
Name: 1, dtype: object

>>> df.loc[1, 'col_1']
'b'

>>> df.loc[[3,1,0]]
  col_1 col_2  col_3
3     b     d      4
1     b     c      2
0     a     c      1

>>> df.iloc[1,1]
'c'

>>> df.at[1, 'col_1']
'b'

>>> df.iat[1,1]
'c'

# 直接通过列标签选取
>>> df['col_1']
0    a
1    b
2    a
3    b
Name: col_1, dtype: object

# 根据条件选取
>>> df[df['col_3'] == 3]
  col_1 col_2  col_3
2     a     d      3

需要注意的是，使用iloc时，冒号右边取不到，使用loc时，冒号右边可以取到。

>>> df = pd.DataFrame({'col_1':['a','b','a','b'], 'col_2':['c','c','d','d'], 'col_3':[1, 2, 3, 4]})
>>> df
  col_1 col_2  col_3
0     a     c      1
1     b     c      2
2     a     d      3
3     b     d      4
>>> df.iloc[:1]
  col_1 col_2  col_3
0     a     c      1
>>> df.loc[:1]
  col_1 col_2  col_3
0     a     c      1
1     b     c      2

比较两个dataframe是否相等

>>> df = pd.DataFrame({'col_1':['a','b','a','b'], 'col_2':['c','c','d','d'], 'col_3':[1, 2, 3, 4]})
>>> df2 = pd.DataFrame({'col_1':['a','b','a','b'], 'col_2':['c','c','d','d'], 'col_3':[1, 2, 3, 4]})
>>> df.equals(df2)
True

list、dict、np.array和pd.Series、pd.DataFrame的转换

list、dict、np.array -> pd.Series、pd.DataFrame

list -> 其他格式

import numpy as np
import pandas as pd

lis = [[1, 'a', 1.0], [2, 'b', '2.0']]

# list -> array
arr = np.array(lis)
print(arr,'\n')

# list -> series
seri = pd.Series(lis)
print(seri, '\n', seri.shape, '\n')

# list -> DataFrame
df = pd.DataFrame(lis)
print(df,'\n', df.shape, '\n')

结果为

[['1' 'a' '1.0']
 ['2' 'b' '2.0']] 

0    [1, a, 1.0]
1    [2, b, 2.0]
dtype: object 
 (2,) 

   0  1    2
0  1  a    1
1  2  b  2.0 
 (2, 3)

dict -> 其他格式

d = {'a':[0,1], 'b':[2,3], 'c':[4,5]}
# dict -> series
seri = pd.Series(d)
print(seri, '\n', seri.shape, '\n')

seri = pd.Series(d, index=['row_1', 'row_2', 'row_3']) # 已经有索引的不要重新指定，否者数据为NaN
print(seri, '\n', seri.shape, '\n')

# dict -> DataFrame
df = pd.DataFrame(d, index=['row_1', 'row_2'], columns=['col_1', 'col_2', 'col_3'])
print(df,'\n', df.shape, '\n')

结果为

a    [0, 1]
b    [2, 3]
c    [4, 5]
dtype: object 
 (3,) 

row_1    NaN
row_2    NaN
row_3    NaN
dtype: object 
 (3,) 

      col_1 col_2 col_3
row_1   NaN   NaN   NaN
row_2   NaN   NaN   NaN 
 (2, 3)

array -> 其他格式

arr1 = np.array([1,2,3])
arr2 = np.array([[1, 'a', 1.0], [2, 'b', '2.0']]) # array中元素的数据类型需要相同
print(arr1,'\n', arr2, '\n')

# array -> Series
seri = pd.Series(arr1)
print(seri, '\n')

seri = pd.Series(arr1, index=['row_1', 'row_2', 'row_3']) # 没有索引的可以指定
print(seri, '\n')

# array -> DataFrame
df = pd.DataFrame(arr2, index=['row_1', 'row_2'], columns=['col_1', 'col_2', 'col_3'])
print(df,'\n', df.shape, '\n')

结果为

[1 2 3] 
 [['1' 'a' '1.0']
 ['2' 'b' '2.0']] 

0    1
1    2
2    3
dtype: int64 

row_1    1
row_2    2
row_3    3
dtype: int64 

      col_1 col_2 col_3
row_1     1     a   1.0
row_2     2     b   2.0 
 (2, 3)

pd.Series、pd.DataFrame -> list、dict、np.array

series -> array

# series -> array
seri = pd.Series([1, 'a', 1.0])
arr = seri.as_matrix()
print(arr, '\n')
arr = pd.Series.as_matrix(seri)
print(arr, '\n')

结果为

1
2
3

[1 'a' 1.0] 

[1 'a' 1.0]

DataFrame -> array

# DataFrame -> array
df = pd.DataFrame([[1, 'a', 1.0], [2, 'b', '2.0']])
arr1 = df.as_matrix()
print(arr1, '\n')
arr2 = pd.DataFrame.as_matrix(df)
print(arr2, '\n')
arr3 = df.values
print(arr3, '\n')
arr4 = np.array(df)
print(arr4, '\n')
arr5 = df.as_matrix([1])
print(arr5, '\n')

结果为

[[1 'a' 1.0]
 [2 'b' '2.0']] 

[[1 'a' 1.0]
 [2 'b' '2.0']] 

[[1 'a' 1.0]
 [2 'b' '2.0']] 

[[1 'a' 1.0]
 [2 'b' '2.0']] 

[['a']
 ['b']]

dataframe转成dict的需求是多样的，因此它的to_dict方法需要注意orient的取值：

orient默认值是’dic’，返回的是字典的字典；
list返回的是列表的字典；
series返回的是序列的字典；
records返回的是字典的列表。

# DataFrame -> dic
df = pd.DataFrame([[1, 'a', 1.0], [2, 'b', 2.0]],index=['row_1', 'row_2'], columns=['int', 'string', 'decimal'])
print(df, '\n')

dic = df.to_dict()
print(dic, '\n')

dic = df.to_dict(orient='dic')
print(dic, '\n')

dic = df.to_dict(orient='list')
print(dic, '\n')

dic = df.to_dict(orient='series')
print(dic, '\n')

dic = df.to_dict(orient='records')
print(dic, '\n')

结果为

       int string  decimal
row_1    1      a      1.0
row_2    2      b      2.0 

{'int': {'row_1': 1, 'row_2': 2}, 'string': {'row_1': 'a', 'row_2': 'b'}, 'decimal': {'row_1': 1.0, 'row_2': 2.0}} 

{'int': {'row_1': 1, 'row_2': 2}, 'string': {'row_1': 'a', 'row_2': 'b'}, 'decimal': {'row_1': 1.0, 'row_2': 2.0}} 

{'int': [1, 2], 'string': ['a', 'b'], 'decimal': [1.0, 2.0]} 

{'int': row_1    1
row_2    2
Name: int, dtype: int64, 'string': row_1    a
row_2    b
Name: string, dtype: object, 'decimal': row_1    1.0
row_2    2.0
Name: decimal, dtype: float64} 

[{'int': 1, 'string': 'a', 'decimal': 1.0}, {'int': 2, 'string': 'b', 'decimal': 2.0}]