_[ https://www.kesci.com/apps/home/project/5a8afe517f2d695222327e14
](https://www.kesci.com/apps/home/project/5a8afe517f2d695222327e14) __
_
练习1-开始了解你的数据
步骤6 数据集中有多少个列(columns): chipo . shape [ 1 ]
步骤9 被下单数最多商品(item)是什么: chipo . item_name . value_counts ()
. head ( 1 ) value_counts 默认从大到小排序
步骤10 在item_name这一列中,一共有多少商品被下单: chipo . item_name . nunique
() nunique()???
步骤13 将item_price转换为浮点数: dollarizer = lambda x : float (
x [ 1 : – 1 ]) ???
chipo . item_price = chipo . item_price . apply (
dollarizer )
练习2-数据过滤与排序
步骤5 有多少球队参与了2012欧洲杯: euro12 . shape [ 0 ] 与练习题1步骤6的区别
步骤6 该数据集中一共有多少列(columns): euro12.info() 与练习题1步骤6的区别
步骤8 对数据框discipline按照先Red Cards再Yellow Cards进行排序: discipline .
sort_values ([ ‘Red Cards’ , ‘Yellow Cards’ ], ascending =
False )
步骤9 计算每个球队拿到的黄牌数的平均值: round ( discipline [ ‘Yellow Cards’ ]
. mean ())
步骤11 选取以字母G开头的球队数据: euro12 [ euro12 . Tea m . str .
s tartswith ( ‘G’ )]
步骤14 找到英格兰(England)、意大利(Italy)和俄罗斯(Russia)的射正率(Shooting Accuracy):
euro12 . loc [ euro12 . Team . isin ([ ‘England’ ,
‘Italy’ , ‘Russia’ ]), [ ‘Team’ , ‘Shooting Accuracy’ ]]
练习3-数据分组
步骤8 打印出每个大陆对spirit饮品消耗的平均值,最大值和最小值: drinks . groupby (
‘continent’ ) . spirit_servings . agg ([ ‘mean’ , ‘min’
, ‘max’ ])
练习4-Apply函数
步骤4 每一列(column)的数据类型是什么样的: crime . info ()
步骤5 将Year的数据类型转换为 __ datetime64: crime . Year = pd .
to_datetime ( crime . Year , format = ‘%Y’ )
crime . info ()
步骤6 将列Year设置为数据框的索引: crime = crime . set_index ( ‘Year’
, drop = True )
步骤7 删除名为Total的列: del crime [ ‘Total’ ]
步骤8 按照Year对数据框进行分组并求和: 注意Population这一列,若直接对其求和,是不正确的*
__
crimes = crime . resample ( ‘10AS’ ) . sum ()
#先将可以加总的部分,每十年一次加总
population = crime [ ‘Population’ ] . resample ( ‘10AS’ ) . max
() #每十年的加总人口为每十年中的最大数
crimes [ ‘Population’ ] = population #将原本的Population替换成population
crimes
步骤9 何时是美国历史上生存最危险的年代: crime . idxmax ( 0 ) idxmax():
最大值的索引值
练习5-合并
步骤3 将上述的数据框分别命名为 data1, data2, data3:
data1 = pd . DataFrame ( raw_data_1 , columns = [
‘subject_id’ , ‘first_name’ , ‘last_name’ ])
data2 = pd . DataFrame ( raw_data_2 , columns = [
‘subject_id’ , ‘first_name’ , ‘last_name’ ])
data3 = pd . DataFrame ( raw_data_3 , columns = [
‘subject_id’ , ‘test_id’ ])
步骤4 将 data1 和 data2 两个数据框按照行的维度进行合并,命名为 all_data: all_data
= pd . concat ([ data1 , data2 ])
步骤9 找到 __ data1 __ 和 __ data2 __ 合并之后的所有匹配结果: pd . merge
( data1 , data2 , on = ‘subject_id’ , how = ‘outer’ )
how=‘outer’
练习6-统计
步骤3 将数据作存储并且设置前三列为合适的索引: data = pd . read_table ( path6
, sep = “\s+” , parse_dates = [[ 0 , 1 , 2 ]])
parse_dates???
步骤4 2061年?我们真的有这一年的数据?创建一个函数并用它去修复这个bug: __
def fix_century ( x ):
year = x . year – 100 if x . year > 1989 else x . year
return datetime . date ( year , x . month , x . day )
data [ ‘Yr_Mo_Dy’ ] = data [ ‘Yr_Mo_Dy’ ] . apply
( fix_century )
步骤5 将日期设为索引,注意数据类型,应该是 datetime64[ns]:
data [ “Yr_Mo_Dy” ] = pd . to_datetime ( data [ “Yr_Mo_Dy” ])
transform Yr_Mo_Dy it to date type datetime64
data = data . set_index ( ‘Yr_Mo_Dy’ ) # set ‘Yr_Mo_Dy’ as the
index
步骤6 对应每一个location,一共有多少数据值缺失: data . isnull () . sum () isnull()应用
步骤7 对应每一个location,一共有多少完整的数据值: data . shape [ 1 ] – data .
isnull () . sum ()????
步骤9 创建一个名为 loc_stats 的数据框去计算并存储每个location的风速最小值,最大值,平均值和标准差:
loc_stats = pd . DataFrame ()
loc_stats [ ‘min’ ] = data . min () # min
loc_stats [ ‘max’ ] = data . max () # max
loc_stats [ ‘mean’ ] = data . mean () # mean
loc_stats [ ‘std’ ] = data . std () # standard deviations
步骤10 创建一个名为 day_stats 的数据框去计算并存储所有location的风速最小值,最大值,平均值和标准差:
day_stats = pd . DataFrame () # create the dataframe
day_stats [ ‘min’ ] = data . min ( axis = 1 ) # min
day_stats [ ‘max’ ] = data . max ( axis = 1 ) # max
day_stats [ ‘mean’ ] = data . mean ( axis = 1 ) # mean
day_stats [ ‘std’ ] = data . std ( axis = 1 ) # standard
deviations
步骤11 对于每一个location,计算一月份的平均风速:
data [ ‘date’ ] = data . index # creates a new column ‘date’ and gets
the values from the index
data [ ‘month’ ] = data [ ‘date’ ] . apply ( lambda date : date
. month )
data [ ‘year’ ] = data [ ‘date’ ] . apply ( lambda date : date
. year )
data [ ‘day’ ] = data [ ‘date’ ] . apply ( lambda date : date
. day )
january_winds = data . query ( ‘month == 1’ )
january_winds . loc [:, ‘RPT’ : “MAL” ] . mean ()
步骤12 对于数据记录按照年为频率取样: data . query ( ‘month == 1 and day == 1’ )
步骤13 对于数据记录按照月为频率取样: data . query ( ‘day == 1’ )
练习7-可视化
步骤5 绘制一个展示男女乘客比例的扇形图:
sum the instances of males and females
males = ( titanic [ ‘Sex’ ] == ‘male’ ) . sum () females = (
titanic [ ‘Sex’ ] == ‘female’ ) . sum ()
put them into a list called proportions proportions = [ males ,
females ]
Create a pie chart plt . pie (
using proportions
proportions ,
with the labels being officer names
labels = [ ‘Males’ , ‘Females’ ],
with no shadows
shadow = False ,
with colors
colors = [ ‘blue’ , ‘red’ ],
with one slide exploded out
explode = ( 0.15 , 0 ),
with the start angle at 90%
startangle = 90 ,
with the percent listed as a fraction
autopct = ’ %1.1f%% ’
)
View the plot drop above plt . axis ( ‘equal’ )
Set labels plt . title ( “Sex Proportion” )
View the plot plt . tight_layout ()
plt . show ()
步骤6 绘制一个展示船票 Fare , 与乘客年龄和性别的散点图:
lm = sns . lmplot ( x = ‘Age’ , y = ‘Fare’ , data = titanic ,
hue = ‘Sex’ , fit_reg = False ) # creates the plot using
lm . set ( title = ‘Fare x Age’ ) # set title
axes = lm . axes axes [ 0 , 0 ] . set_ylim ( – 5 ,) axes [
0 , 0 ] . set_xlim ( – 5 , 85 ) # get the axes object and tweak it
步骤8 绘制一个展示船票价格的直方图:
df = titanic . Fare . sort_values ( ascending = False ) df
binsVal = np . arange ( 0 , 600 , 10 ) binsVal # create bins
interval using numpy
plt . hist ( df , bins = binsVal ) # create the plot
plt . xlabel ( ‘Fare’ ) plt . ylabel ( ‘Frequency’ ) plt . title
( ‘Fare Payed Histrogram’ ) # Set the title and labels
plt . show () # show the plot
练习8-创建数据框
步骤4 数据框的列排序是字母顺序,请重新修改为 name, type, hp, evolution, pokedex 这个顺序:
pokemon = pokemon [[ ‘name’ , ‘type’ , ‘hp’ , ‘evolution’ ,
‘pokedex’ ]]
步骤6 查看每个列的数据类型: pokemon . dtypes
练习9-时间序列
步骤5 将 Date 这个列转换为 datetime 类型: apple . Date = pd . to_datetime
( apple . Date )
步骤7 有重复的日期吗: apple . index . is_unique is_unique???
步骤8 将index设置为升序: apple . sort_index ( ascending = True ) . head
()
步骤9 找到每个月的最后一个交易日(business day): apple_month = apple . resample (
‘BM’ ) . mean ()
步骤10 数据集中最早的日期和最晚的日期相差多少天: ( apple . index . max () – apple .
index . min ()) . days
步骤12 按照时间顺序可视化 Adj Close 值:
appl_open = apple [ ‘Adj Close’ ] . plot ( title = "Apple
Stock" ) # makes the plot and assign it to a variable
fig = appl_open . get_figure () # changes the size of the graph
fig . set_size_inches ( 13.5 , 9 )
练习10-删除数据
步骤4 创建数据框的列名称: iris . columns = [ ‘sepal_length’ ,
‘sepal_width’ , ‘petal_length’ , ‘petal_width’ , ‘class’ ]
步骤5 数据框中有缺失值吗: pd . isnull ( iris ) . sum ()
步骤6 将列 petal_length 的第10到19行设置为缺失值: iris . iloc [ 10 :
20 , 2 : 3 ] = np . nan
步骤7 将缺失值全部替换为1.0: iris . petal_length . fillna ( 1 ,
inplace = True )
步骤8 删除列 class: del iris [ ‘class’ ]
步骤10 删除有缺失值的行: iris = iris . dropna ( how = ‘any’ )
dropna()
————-evernote