Pandas学习笔记 | YRen's Blog

读取与存储
read_csv, to_csv, read_pickle,to_pickle
当某列值为list的时候，采用to_pickle存储，如果用to_csv存储，list会被存为str。
缺损值填充
- data = data.fillna('-1', )
- data = data.fillna(data.mean(), )
- 填充[]: 不过，尽量不要用DataFrame存储list
  1
  2
  nan_index = data[data.isnull()].index
  data.loc[nan_index,] = [[]]
计算缺损值个数：
Nan_num = data.shape[1]-data.count(axis=1)
去掉缺损值过多的行：
data.drop(Nan_num[Nan_num>256.index.tolist(),inplace=True)
drop函数默认删除行，列需要加axis = 1
df.drop('column_name', axis=1, inplace=True)
items_one = items.drop_duplicates('ITEMID','first',inplace=False)
df.info()查看数据type

数据转换
astype()强制转换，仅返回数据的副本而不原地修改。

自定义转换：

def convert_currency(val):
    """
    Convert the string number value to a float
     - Remove $
     - Remove commas
     - Convert to float type
    """
    new_val = val.replace(',','').replace('$', '')
    return float(new_val)

df['2016'].apply(convert_currency)

使用lambda进行转换
df['2016'].apply(lambda x: x.replace('$', '').replace(',', '')).astype('float')
使用to_numeric进行转换
pd.to_numeric(df['Jan Units'], errors='coerce').fillna(0)

选择数据
- loc根据标签
- iloc根据序列
- ix混合
数据合并
统计个数
- df.groupby(['id'],as_index=False)['id'].agg({'cnt':'count'})
- df['id'].value_counts()
datetime
- pd.to_datetime(df)：将str转换为datetime
- df.dt.year: 获得datetime数据中的year
- df.map(lambda x:x.strftime('%Y'))
对行和列的操作
去掉含nan的行/列
dropna()
删去重复列
DataFrame.drop_duplicates(subset=None, keep='first', inplace=False)
- subset: 指定特定的列，默认全部
- keep: {‘first’, ‘last’, False}, default ‘first’: 删除重复项，并保留，默认第一次出现的
删除某值
data2 = data[~data.isin(['\\N'])]