_[ https://www.kesci.com/apps/home/project/5a8afe517f2d695222327e14

](https://www.kesci.com/apps/home/project/5a8afe517f2d695222327e14) __
_

练习1-开始了解你的数据

步骤6 数据集中有多少个列(columns)： chipo . shape [ 1 ]

步骤9 被下单数最多商品(item)是什么： chipo . item_name . value_counts ()

. head ( 1 ) value_counts 默认从大到小排序

步骤10 在item_name这一列中，一共有多少商品被下单： chipo . item_name . nunique

() nunique()？？？

步骤13 将item_price转换为浮点数： dollarizer = lambda x : float (

x [ 1 : – 1 ]) ？？？

chipo . item_price = chipo . item_price . apply (

dollarizer )

练习2-数据过滤与排序

步骤5 有多少球队参与了2012欧洲杯： euro12 . shape [ 0 ] 与练习题1步骤6的区别

步骤6 该数据集中一共有多少列(columns)： euro12.info() 与练习题1步骤6的区别

步骤8 对数据框discipline按照先Red Cards再Yellow Cards进行排序： discipline .

sort_values ([ ‘Red Cards’ , ‘Yellow Cards’ ], ascending =
False )

步骤9 计算每个球队拿到的黄牌数的平均值： round ( discipline [ ‘Yellow Cards’ ]

. mean ())

步骤11 选取以字母G开头的球队数据： euro12 [ euro12 . Tea m . str .

s tartswith ( ‘G’ )]

步骤14 找到英格兰(England)、意大利(Italy)和俄罗斯(Russia)的射正率(Shooting Accuracy)：

euro12 . loc [ euro12 . Team . isin ([ ‘England’ ,
‘Italy’ , ‘Russia’ ]), [ ‘Team’ , ‘Shooting Accuracy’ ]]

练习3-数据分组

步骤8 打印出每个大陆对spirit饮品消耗的平均值，最大值和最小值： drinks . groupby (

‘continent’ ) . spirit_servings . agg ([ ‘mean’ , ‘min’
, ‘max’ ])

练习4-Apply函数

步骤4 每一列(column)的数据类型是什么样的： crime . info ()

步骤5 将Year的数据类型转换为 __ datetime64： crime . Year = pd .

to_datetime ( crime . Year , format = ‘%Y’ )

crime . info ()

步骤6 将列Year设置为数据框的索引： crime = crime . set_index ( ‘Year’

, drop = True )

步骤7 删除名为Total的列： del crime [ ‘Total’ ]

步骤8 按照Year对数据框进行分组并求和： *注意Population这一列，若直接对其求和，是不正确的**

crimes = crime . resample ( ‘10AS’ ) . sum ()
#先将可以加总的部分，每十年一次加总

population = crime [ ‘Population’ ] . resample ( ‘10AS’ ) . max
() #每十年的加总人口为每十年中的最大数

crimes [ ‘Population’ ] = population #将原本的Population替换成population

crimes

步骤9 何时是美国历史上生存最危险的年代： crime . idxmax ( 0 ) idxmax()：

最大值的索引值

练习5-合并

步骤3 将上述的数据框分别命名为 data1, data2, data3：

data1 = pd . DataFrame ( raw_data_1 , columns = [

‘subject_id’ , ‘first_name’ , ‘last_name’ ])

data2 = pd . DataFrame ( raw_data_2 , columns = [

‘subject_id’ , ‘first_name’ , ‘last_name’ ])

data3 = pd . DataFrame ( raw_data_3 , columns = [

‘subject_id’ , ‘test_id’ ])

步骤4 将 data1 和 data2 两个数据框按照行的维度进行合并，命名为 all_data： all_data

= pd . concat ([ data1 , data2 ])

步骤9 找到 data1 和 data2 合并之后的所有匹配结果： pd . merge

( data1 , data2 , on = ‘subject_id’ , how = ‘outer’ )
how=‘outer’

练习6-统计

步骤3 将数据作存储并且设置前三列为合适的索引： data = pd . read_table ( path6

, sep = “\s+” , parse_dates = [[ 0 , 1 , 2 ]])
parse_dates？？？

步骤4 2061年？我们真的有这一年的数据？创建一个函数并用它去修复这个bug： __

def fix_century ( x ):

year = x . year – 100 if x . year > 1989 else x . year

return datetime . date ( year , x . month , x . day )

data [ ‘Yr_Mo_Dy’ ] = data [ ‘Yr_Mo_Dy’ ] . apply

( fix_century )

步骤5 将日期设为索引，注意数据类型，应该是 datetime64[ns]：

data [ “Yr_Mo_Dy” ] = pd . to_datetime ( data [ “Yr_Mo_Dy” ])

transform Yr_Mo_Dy it to date type datetime64

data = data . set_index ( ‘Yr_Mo_Dy’ ) # set ‘Yr_Mo_Dy’ as the

index

步骤6 对应每一个location，一共有多少数据值缺失： data . isnull () . sum () isnull()应用

步骤7 对应每一个location，一共有多少完整的数据值： data . shape [ 1 ] – data .

isnull () . sum ()？？？？

步骤9 创建一个名为 loc_stats 的数据框去计算并存储每个location的风速最小值，最大值，平均值和标准差：

loc_stats = pd . DataFrame ()

loc_stats [ ‘min’ ] = data . min () # min

loc_stats [ ‘max’ ] = data . max () # max

loc_stats [ ‘mean’ ] = data . mean () # mean

loc_stats [ ‘std’ ] = data . std () # standard deviations

步骤10 创建一个名为 day_stats 的数据框去计算并存储所有location的风速最小值，最大值，平均值和标准差：

day_stats = pd . DataFrame () # create the dataframe

day_stats [ ‘min’ ] = data . min ( axis = 1 ) # min

day_stats [ ‘max’ ] = data . max ( axis = 1 ) # max

day_stats [ ‘mean’ ] = data . mean ( axis = 1 ) # mean

day_stats [ ‘std’ ] = data . std ( axis = 1 ) # standard
deviations

步骤11 对于每一个location，计算一月份的平均风速：

data [ ‘date’ ] = data . index # creates a new column ‘date’ and gets
the values from the index

data [ ‘month’ ] = data [ ‘date’ ] . apply ( lambda date : date
. month )

data [ ‘year’ ] = data [ ‘date’ ] . apply ( lambda date : date
. year )

data [ ‘day’ ] = data [ ‘date’ ] . apply ( lambda date : date
. day )

january_winds = data . query ( ‘month == 1’ )

january_winds . loc [:, ‘RPT’ : “MAL” ] . mean ()

步骤12 对于数据记录按照年为频率取样： data . query ( ‘month == 1 and day == 1’ )

步骤13 对于数据记录按照月为频率取样： data . query ( ‘day == 1’ )

练习7-可视化

步骤5 绘制一个展示男女乘客比例的扇形图：

sum the instances of males and females

males = ( titanic [ ‘Sex’ ] == ‘male’ ) . sum () females = (
titanic [ ‘Sex’ ] == ‘female’ ) . sum ()

put them into a list called proportions proportions = [ males ,

females ]

Create a pie chart plt . pie (

using proportions

proportions ,

with the labels being officer names

labels = [ ‘Males’ , ‘Females’ ],

with no shadows

shadow = False ,

with colors

colors = [ ‘blue’ , ‘red’ ],

with one slide exploded out

explode = ( 0.15 , 0 ),

with the start angle at 90%

startangle = 90 ,

with the percent listed as a fraction

autopct = ’ %1.1f%% ’

)

View the plot drop above plt . axis ( ‘equal’ )

Set labels plt . title ( “Sex Proportion” )

View the plot plt . tight_layout ()

plt . show ()

步骤6 绘制一个展示船票 Fare , 与乘客年龄和性别的散点图：

lm = sns . lmplot ( x = ‘Age’ , y = ‘Fare’ , data = titanic ,
hue = ‘Sex’ , fit_reg = False ) # creates the plot using

lm . set ( title = ‘Fare x Age’ ) # set title

axes = lm . axes axes [ 0 , 0 ] . set_ylim ( – 5 ,) axes [
0 , 0 ] . set_xlim ( – 5 , 85 ) # get the axes object and tweak it

步骤8 绘制一个展示船票价格的直方图：

df = titanic . Fare . sort_values ( ascending = False ) df

binsVal = np . arange ( 0 , 600 , 10 ) binsVal # create bins
interval using numpy

plt . hist ( df , bins = binsVal ) # create the plot

plt . xlabel ( ‘Fare’ ) plt . ylabel ( ‘Frequency’ ) plt . title
( ‘Fare Payed Histrogram’ ) # Set the title and labels

plt . show () # show the plot

练习8-创建数据框

步骤4 数据框的列排序是字母顺序，请重新修改为 name, type, hp, evolution, pokedex 这个顺序：

pokemon = pokemon [[ ‘name’ , ‘type’ , ‘hp’ , ‘evolution’ ,
‘pokedex’ ]]

步骤6 查看每个列的数据类型： pokemon . dtypes

练习9-时间序列

步骤5 将 Date 这个列转换为 datetime 类型： apple . Date = pd . to_datetime

( apple . Date )

步骤7 有重复的日期吗： apple . index . is_unique is_unique？？？

步骤8 将index设置为升序： apple . sort_index ( ascending = True ) . head

()

步骤9 找到每个月的最后一个交易日(business day)： apple_month = apple . resample (

‘BM’ ) . mean ()

步骤10 数据集中最早的日期和最晚的日期相差多少天： ( apple . index . max () – apple .

index . min ()) . days

步骤12 按照时间顺序可视化 Adj Close 值：

appl_open = apple [ ‘Adj Close’ ] . plot ( title = "Apple

Stock" ) # makes the plot and assign it to a variable

fig = appl_open . get_figure () # changes the size of the graph

fig . set_size_inches ( 13.5 , 9 )

练习10-删除数据

步骤4 创建数据框的列名称： iris . columns = [ ‘sepal_length’ ,

‘sepal_width’ , ‘petal_length’ , ‘petal_width’ , ‘class’ ]

步骤5 数据框中有缺失值吗： pd . isnull ( iris ) . sum ()

步骤6 将列 petal_length 的第10到19行设置为缺失值： iris . iloc [ 10 :

20 , 2 : 3 ] = np . nan

步骤7 将缺失值全部替换为1.0: iris . petal_length . fillna ( 1 ,

inplace = True )

步骤8 删除列 class: del iris [ ‘class’ ]

步骤10 删除有缺失值的行: iris = iris . dropna ( how = ‘any’ )

dropna()

————-evernote