WeRateDog—分析推特数据

数据收集

导入需要的库

In [60]:

import pandas as pd

import numpy as np
import matplotlib.pyplot as plt
import requests
import json
import os

打开并评估twitter-archive-enhanced

In [61]:twitter_archive_enhanced = pd.read_csv('twitter-archive-enhanced.csv')

In [62]:twitter_archive_enhanced.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):#   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  0   tweet_id                    2356 non-null   int64  1   in_reply_to_status_id       78 non-null     float642   in_reply_to_user_id         78 non-null     float643   timestamp                   2356 non-null   object 4   source                      2356 non-null   object 5   text                        2356 non-null   object 6   retweeted_status_id         181 non-null    float647   retweeted_status_user_id    181 non-null    float648   retweeted_status_timestamp  181 non-null    object 9   expanded_urls               2297 non-null   object 10  rating_numerator            2356 non-null   int64  11  rating_denominator          2356 non-null   int64  12  name                        2356 non-null   object 13  doggo                       2356 non-null   object 14  floofer                     2356 non-null   object 15  pupper                      2356 non-null   object 16  puppo                       2356 non-null   object 
dtypes: float64(4), int64(3), object(10)
memory usage: 313.0+ KB

通过上面的info,可以看出tweet_id, timestamp 类型错误,in_reply_to_status_id,in_reply_to_user_id 仅有78列,expanded_urls 含有空值,是没有照片的数据, 根据项目要求,这些数据后面需要删除

In [63]:twitter_archive_enhanced.retweeted_status_id.notnull().value_counts()

Out[63]:

False    2175
True      181
Name: retweeted_status_id, dtype: int64

retweeted_status_id 不为nan的为转发数据,181条转发数据,根据项目要求,这些数据后面需要删除

In [64]:twitter_archive_enhanced.name.value_counts()

Out[64]:

None        745
a            55
Charlie      12
Oliver       11
Lucy         11... 
Karll         1
Tiger         1
old           1
Meatball      1
Stormy        1
Name: name, Length: 957, dtype: int64

In [65]:twitter_archive_enhanced.text[twitter_archive_enhanced.name=='a'].iloc[1]

Out[65]:

'Here is a perfect example of someone who has their priorities in order. 13/10 for both owner and Forrest https://t.co/LRyMrU7Wfq'

*55个名字为a的狗狗,调用一个名字为a的看了下,显然a不是狗狗的名字,是为质量问题
*text里面含有链接

In [66]:twitter_archive_enhanced.rating_denominator.value_counts()

Out[66]:

10     2333
11        3
50        3
80        2
20        2
2         1
16        1
40        1
70        1
15        1
90        1
110       1
120       1
130       1
150       1
170       1
7         1
0         1
Name: rating_denominator, dtype: int64

可见,rating_denominator不全为10

In [67]:twitter_archive_enhanced.source.iloc[0]

Out[67]:

'<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>'

source中含有html文本

另外,这个数据集还有个整洁度问题,狗狗地位是一个变量,doggo,floofer, pupper, puppo应为一列

收集并评估image-predictions

In [68]:folder_name ='pred-image'

if not os.path.exists(folder_name):
os.makedirs(folder_name) url='https://raw.githubusercontent.com/udacity/new-dand-advanced-china/master/%E6%95%B0%E6%8D%AE%E6%B8%85%E6%B4%97/WeRateDogs%E9%A1%B9%E7%9B%AE/image-predictions.tsv'

response = requests.get(url)

response

Out[68]:

<Response [200]>

In [69]:

with open(os.path.join(folder_name, url.split('/')[-1]), mode='wb') as file:

file.write(response.content)

In [70]:os.listdir(folder_name)

Out[70]:

['image-predictions.tsv']

In [71]:image_predictions = pd.read_csv('image-predictions.tsv',sep='\t')

In [72]:image_predictions.head()

Out[72]:

  tweet_id jpg_url img_num p1 p1_conf p1_dog p2 p2_conf p2_dog p3 p3_conf p3_dog
0 666020888022790149 https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg 1 Welsh_springer_spaniel 0.465074 True collie 0.156665 True Shetland_sheepdog 0.061428 True
1 666029285002620928 https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg 1 redbone 0.506826 True miniature_pinscher 0.074192 True Rhodesian_ridgeback 0.072010 True
2 666033412701032449 https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg 1 German_shepherd 0.596461 True malinois 0.138584 True bloodhound 0.116197 True
3 666044226329800704 https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg 1 Rhodesian_ridgeback 0.408143 True redbone 0.360687 True miniature_pinscher 0.222752 True
4 666049248165822465 https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg 1 miniature_pinscher 0.560311 True Rottweiler 0.243682 True Doberman 0.154629 True

In [73]:image_predictions.jpg_url.duplicated().value_counts()

Out[73]:

False    2009
True       66
Name: jpg_url, dtype: int64

有66条重复的图片链接

tweet_id类型错误

打开并评估tweet_json

In [74]:tweet_json = pd.DataFrame()

In [75]:

file = open('tweet_json.txt','r')

for line in file.readlines():

dic = json.loads(line)

tweet_id = dic['id']

retweet_count = dic['retweet_count']

favorite_count = dic['favorite_count']

tem_df = pd.DataFrame({'tweet_id':tweet_id,

'retweet_count':retweet_count,

'favorite_count':favorite_count},index=[0])

tweet_json = pd.concat([tweet_json,tem_df])

In [76]:

tweet_json

Out[76]:

  tweet_id retweet_count favorite_count
0 892420643555336193 8842 39492
0 892177421306343426 6480 33786
0 891815181378084864 4301 25445
0 891689557279858688 8925 42863
0 891327558926688256 9721 41016
0 666049248165822465 41 111
0 666044226329800704 147 309
0 666033412701032449 47 128
0 666029285002620928 48 132
0 666020888022790149 530 2528

2352 rows × 3 columns

tweet_id 类型错误

综上,

#*数据集里的质量问题:

  1. tweet_id,timestamp类型错误
  2. jpg_url有66条重复的链接
  3. source中含有html文本
  4. rating_denominator不全为10,还有分母为0的情况出现
  5. 55个名字为a的狗狗,调用一个名字为a的看了下,显然a不是狗狗的名字,是为质量问题
  6. text里面含有链接
  7. retweeted_status_id 不为nan的为转发数据,181条转发数据,根据项目要求,这些数据后面需要删除
  8. in_reply_to_status_id,in_reply_to_user_id 仅有78列
  9. 没有照片的数据, 根据项目要求,这些数据后面需要删除

#*整洁度问题:

  1. 狗狗地位是一个变量,doggo,floofer, pupper, puppo应为一列
  2. 三个数据集有一个观察对象tweet_id,可以合为一个数据集

数据清洗

In [77]:

twitter_archive_enhanced_clean = twitter_archive_enhanced.copy()

image_predictions_clean = image_predictions.copy()

tweet_json_clean = tweet_json.copy()

issue: tweet_id类型错误

define: 修改tweet_id为str

code:

In [78]:twitter_archive_enhanced_clean['tweet_id'] = twitter_archive_enhanced_clean['tweet_id'].astype('str')

In [79]:image_predictions_clean['tweet_id'] = image_predictions_clean['tweet_id'].astype('str')

In [80]:tweet_json_clean['tweet_id'] = tweet_json_clean['tweet_id'].astype('str')

Test

In [81]:twitter_archive_enhanced_clean['tweet_id']

Out[81]:

0       892420643555336193
1       892177421306343426
2       891815181378084864
3       891689557279858688
4       891327558926688256...        
2351    666049248165822465
2352    666044226329800704
2353    666033412701032449
2354    666029285002620928
2355    666020888022790149
Name: tweet_id, Length: 2356, dtype: object

In [82]:image_predictions_clean['tweet_id']

Out[82]:

0       666020888022790149
1       666029285002620928
2       666033412701032449
3       666044226329800704
4       666049248165822465...        
2070    891327558926688256
2071    891689557279858688
2072    891815181378084864
2073    892177421306343426
2074    892420643555336193
Name: tweet_id, Length: 2075, dtype: object

In [83]:tweet_json_clean['tweet_id']

Out[83]:

0    892420643555336193
0    892177421306343426
0    891815181378084864
0    891689557279858688
0    891327558926688256...        
0    666049248165822465
0    666044226329800704
0    666033412701032449
0    666029285002620928
0    666020888022790149
Name: tweet_id, Length: 2352, dtype: object

issue: timestamp类型错误

define: 修改为datetime

code:

In [84]:twitter_archive_enhanced_clean['timestamp'] = pd.to_datetime(twitter_archive_enhanced_clean['timestamp'])

Test

In [85]:twitter_archive_enhanced_clean['timestamp']

Out[85]:

0      2017-08-01 16:23:56+00:00
1      2017-08-01 00:17:27+00:00
2      2017-07-31 00:18:03+00:00
3      2017-07-30 15:58:51+00:00
4      2017-07-29 16:00:24+00:00...           
2351   2015-11-16 00:24:50+00:00
2352   2015-11-16 00:04:52+00:00
2353   2015-11-15 23:21:54+00:00
2354   2015-11-15 23:05:30+00:00
2355   2015-11-15 22:32:08+00:00
Name: timestamp, Length: 2356, dtype: datetime64[ns, UTC]

issue: 55个名字为a的狗狗,调用一个名字为a的看了下,显然a不是狗狗的名字

define: 将a用None代替

code:

In [86]:twitter_archive_enhanced_clean['name']= twitter_archive_enhanced_clean['name'].replace('a',np.nan)

Test

In [88]:twitter_archive_enhanced_clean['name'].value_counts()

Out[88]:

None        745
Charlie      12
Lucy         11
Oliver       11
Cooper       11... 
Karll         1
Tiger         1
old           1
Meatball      1
Stormy        1
Name: name, Length: 956, dtype: int64

Issue:

分母不全为10

define: Create new column rating=rating_numerator/rating_denominator. Drop rating_numerator and rating_denominator.

Code:

In [90]:twitter_archive_enhanced_clean=twitter_archive_enhanced_clean[twitter_archive_enhanced_clean.rating_denominator!= 0]

In [91]:twitter_archive_enhanced_clean['rating']=twitter_archive_enhanced_clean.rating_numerator/twitter_archive_enhanced_clean.rating_denominator

In [92]:twitter_archive_enhanced_clean=twitter_archive_enhanced_clean.drop(['rating_numerator','rating_denominator'],axis=1)

Test:

In [93]:twitter_archive_enhanced_clean

Out[93]:

  tweet_id in_reply_to_status_id in_reply_to_user_id timestamp source text retweeted_status_id retweeted_status_user_id retweeted_status_timestamp expanded_urls name doggo floofer pupper puppo rating
0 892420643555336193 NaN NaN 2017-08-01 16:23:56+00:00 <a href="http://twitter.com/download/iphone" r… This is Phineas. He's a mystical boy. Only eve… NaN NaN NaN https://twitter.com/dog_rates/status/892420643… Phineas None None None None 1.3
1 892177421306343426 NaN NaN 2017-08-01 00:17:27+00:00 <a href="http://twitter.com/download/iphone" r… This is Tilly. She's just checking pup on you…. NaN NaN NaN https://twitter.com/dog_rates/status/892177421… Tilly None None None None 1.3
2 891815181378084864 NaN NaN 2017-07-31 00:18:03+00:00 <a href="http://twitter.com/download/iphone" r… This is Archie. He is a rare Norwegian Pouncin… NaN NaN NaN https://twitter.com/dog_rates/status/891815181… Archie None None None None 1.2
3 891689557279858688 NaN NaN 2017-07-30 15:58:51+00:00 <a href="http://twitter.com/download/iphone" r… This is Darla. She commenced a snooze mid meal… NaN NaN NaN https://twitter.com/dog_rates/status/891689557… Darla None None None None 1.3
4 891327558926688256 NaN NaN 2017-07-29 16:00:24+00:00 <a href="http://twitter.com/download/iphone" r… This is Franklin. He would like you to stop ca… NaN NaN NaN https://twitter.com/dog_rates/status/891327558… Franklin None None None None 1.2
2351 666049248165822465 NaN NaN 2015-11-16 00:24:50+00:00 <a href="http://twitter.com/download/iphone" r… Here we have a 1949 1st generation vulpix. Enj… NaN NaN NaN https://twitter.com/dog_rates/status/666049248… None None None None None 0.5
2352 666044226329800704 NaN NaN 2015-11-16 00:04:52+00:00 <a href="http://twitter.com/download/iphone" r… This is a purebred Piers Morgan. Loves to Netf… NaN NaN NaN https://twitter.com/dog_rates/status/666044226… NaN None None None None 0.6
2353 666033412701032449 NaN NaN 2015-11-15 23:21:54+00:00 <a href="http://twitter.com/download/iphone" r… Here is a very happy pup. Big fan of well-main… NaN NaN NaN https://twitter.com/dog_rates/status/666033412… NaN None None None None 0.9
2354 666029285002620928 NaN NaN 2015-11-15 23:05:30+00:00 <a href="http://twitter.com/download/iphone" r… This is a western brown Mitsubishi terrier. Up… NaN NaN NaN https://twitter.com/dog_rates/status/666029285… NaN None None None None 0.7
2355 666020888022790149 NaN NaN 2015-11-15 22:32:08+00:00 <a href="http://twitter.com/download/iphone" r… Here we have a Japanese Irish Setter. Lost eye… NaN NaN NaN https://twitter.com/dog_rates/status/666020888… None None None None None 0.8

2355 rows × 16 columns

Issue: duplicated of jpg_url

define: delete the duplicated

code:

In [94]:image_predictions_clean=image_predictions_clean[~image_predictions_clean.jpg_url.duplicated()]

Test:

In [95]:sum(image_predictions_clean.jpg_url.duplicated())

Out[95]:

Issue: in_reply_to_status_id in_reply_to_user_id only 23

Define: drop them directly

Code:

In [96]:twitter_archive_enhanced_clean.drop(twitter_archive_enhanced_clean[['in_reply_to_status_id','in_reply_to_user_id']],axis=1,inplace=True)

Test:

In [97]:twitter_archive_enhanced_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2355 entries, 0 to 2355
Data columns (total 14 columns):#   Column                      Non-Null Count  Dtype              
---  ------                      --------------  -----              0   tweet_id                    2355 non-null   object             1   timestamp                   2355 non-null   datetime64[ns, UTC]2   source                      2355 non-null   object             3   text                        2355 non-null   object             4   retweeted_status_id         181 non-null    float64            5   retweeted_status_user_id    181 non-null    float64            6   retweeted_status_timestamp  181 non-null    object             7   expanded_urls               2297 non-null   object             8   name                        2300 non-null   object             9   doggo                       2355 non-null   object             10  floofer                     2355 non-null   object             11  pupper                      2355 non-null   object             12  puppo                       2355 non-null   object             13  rating                      2355 non-null   float64            
dtypes: datetime64[ns, UTC](1), float64(3), object(10)
memory usage: 276.0+ KB

Issue: html content in source

define: delete html

Code:

In [98]:twitter_archive_enhanced_clean.source= twitter_archive_enhanced_clean.source.str.extract('>(.+)<',expand = True)

Test

In [99]:twitter_archive_enhanced_clean['source'].value_counts()

Out[99]:

Twitter for iPhone     2220
Vine - Make a Scene      91
Twitter Web Client       33
TweetDeck                11
Name: source, dtype: int64

Issue: text column contain url

define: delete url

code:

In [100]:twitter_archive_enhanced_clean.text.replace(r'https.*','',regex=True, inplace=True)

test

In [101]:twitter_archive_enhanced_clean.text

Out[101]:

0       This is Phineas. He's a mystical boy. Only eve...
1       This is Tilly. She's just checking pup on you....
2       This is Archie. He is a rare Norwegian Pouncin...
3       This is Darla. She commenced a snooze mid meal...
4       This is Franklin. He would like you to stop ca......                        
2351    Here we have a 1949 1st generation vulpix. Enj...
2352    This is a purebred Piers Morgan. Loves to Netf...
2353    Here is a very happy pup. Big fan of well-main...
2354    This is a western brown Mitsubishi terrier. Up...
2355    Here we have a Japanese Irish Setter. Lost eye...
Name: text, Length: 2355, dtype: object

issue: 含有转发数据

define: 删除转发数据

code:

In [102]:twitter_archive_enhanced_clean=twitter_archive_enhanced_clean[twitter_archive_enhanced_clean.retweeted_status_id.isnull()]

twitter_archive_enhanced_clean=twitter_archive_enhanced_clean.drop(['retweeted_status_id'],axis=1)

Test

In [103]:twitter_archive_enhanced_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2174 entries, 0 to 2355
Data columns (total 13 columns):#   Column                      Non-Null Count  Dtype              
---  ------                      --------------  -----              0   tweet_id                    2174 non-null   object             1   timestamp                   2174 non-null   datetime64[ns, UTC]2   source                      2174 non-null   object             3   text                        2174 non-null   object             4   retweeted_status_user_id    0 non-null      float64            5   retweeted_status_timestamp  0 non-null      object             6   expanded_urls               2117 non-null   object             7   name                        2119 non-null   object             8   doggo                       2174 non-null   object             9   floofer                     2174 non-null   object             10  pupper                      2174 non-null   object             11  puppo                       2174 non-null   object             12  rating                      2174 non-null   float64            
dtypes: datetime64[ns, UTC](1), float64(2), object(10)
memory usage: 237.8+ KB

issue: 狗狗地位是一个变量,应该为一列

define 将其放在一列

code

In [104]:

twitter_archive_enhanced_clean['stage']= twitter_archive_enhanced_clean.text.str.findall('(doggo|pupper|puppo|floofer)')twitter_archive_enhanced_clean['stage'] = twitter_archive_enhanced_clean['stage'].apply(lambda x: ','.join(set(x)))

In [105]:

twitter_archive_enhanced_clean['stage']=twitter_archive_enhanced_clean['stage'].replace('',np.nan)

In [106]:

twitter_archive_enhanced_clean.drop(twitter_archive_enhanced_clean[['doggo','puppo','pupper','floofer']],axis=1,inplace=True)

Test

In [107]:

twitter_archive_enhanced_clean.stage.value_counts()

Out[107]:

pupper          242
doggo            78
puppo            30
pupper,doggo      8
floofer           4
puppo,doggo       2
Name: stage, dtype: int64

ISSUE: 三个数据集共有一个观察对象,可以合并为一个数据集. 无照片的数据也可以删除。

define: 将3个数据集合并在一起,并且删除无照片的数据

code

In [108]:

df1_clean = twitter_archive_enhanced_clean.merge(image_predictions_clean,how='inner',on='tweet_id')

In [109]:

df_clean = df1_clean.merge(tweet_json_clean,how='left',on='tweet_id')

test

In [110]:

df_clean

Out[110]:

  tweet_id timestamp source text retweeted_status_user_id retweeted_status_timestamp expanded_urls name rating stage p1_conf p1_dog p2 p2_conf p2_dog p3 p3_conf p3_dog retweet_count favorite_count
0 892420643555336193 2017-08-01 16:23:56+00:00 Twitter for iPhone This is Phineas. He's a mystical boy. Only eve… NaN NaN https://twitter.com/dog_rates/status/892420643… Phineas 1.3 NaN 0.097049 False bagel 0.085851 False banana 0.076110 False 8842 39492
1 892177421306343426 2017-08-01 00:17:27+00:00 Twitter for iPhone This is Tilly. She's just checking pup on you…. NaN NaN https://twitter.com/dog_rates/status/892177421… Tilly 1.3 NaN 0.323581 True Pekinese 0.090647 True papillon 0.068957 True 6480 33786
2 891815181378084864 2017-07-31 00:18:03+00:00 Twitter for iPhone This is Archie. He is a rare Norwegian Pouncin… NaN NaN https://twitter.com/dog_rates/status/891815181… Archie 1.2 NaN 0.716012 True malamute 0.078253 True kelpie 0.031379 True 4301 25445
3 891689557279858688 2017-07-30 15:58:51+00:00 Twitter for iPhone This is Darla. She commenced a snooze mid meal… NaN NaN https://twitter.com/dog_rates/status/891689557… Darla 1.3 NaN 0.170278 False Labrador_retriever 0.168086 True spatula 0.040836 False 8925 42863
4 891327558926688256 2017-07-29 16:00:24+00:00 Twitter for iPhone This is Franklin. He would like you to stop ca… NaN NaN https://twitter.com/dog_rates/status/891327558… Franklin 1.2 NaN 0.555712 True English_springer 0.225770 True German_short-haired_pointer 0.175219 True 9721 41016
1989 666049248165822465 2015-11-16 00:24:50+00:00 Twitter for iPhone Here we have a 1949 1st generation vulpix. Enj… NaN NaN https://twitter.com/dog_rates/status/666049248… None 0.5 NaN 0.560311 True Rottweiler 0.243682 True Doberman 0.154629 True 41 111
1990 666044226329800704 2015-11-16 00:04:52+00:00 Twitter for iPhone This is a purebred Piers Morgan. Loves to Netf… NaN NaN https://twitter.com/dog_rates/status/666044226… NaN 0.6 NaN 0.408143 True redbone 0.360687 True miniature_pinscher 0.222752 True 147 309
1991 666033412701032449 2015-11-15 23:21:54+00:00 Twitter for iPhone Here is a very happy pup. Big fan of well-main… NaN NaN https://twitter.com/dog_rates/status/666033412… NaN 0.9 NaN 0.596461 True malinois 0.138584 True bloodhound 0.116197 True 47 128
1992 666029285002620928 2015-11-15 23:05:30+00:00 Twitter for iPhone This is a western brown Mitsubishi terrier. Up… NaN NaN https://twitter.com/dog_rates/status/666029285… NaN 0.7 NaN 0.506826 True miniature_pinscher 0.074192 True Rhodesian_ridgeback 0.072010 True 48 132
1993 666020888022790149 2015-11-15 22:32:08+00:00 Twitter for iPhone Here we have a Japanese Irish Setter. Lost eye… NaN NaN https://twitter.com/dog_rates/status/666020888… None 0.8 NaN 0.465074 True collie 0.156665 True Shetland_sheepdog 0.061428 True 530 2528

1994 rows × 23 columns

保存数据集

In [112]:

#save the file

save_file_name = 'twitter_archive_master.csv'

df_clean.to_csv(save_file_name, encoding='utf-8',index=False)

分析与可视化

In [114]:

#data analysisdata = pd.read_csv('twitter_archive_master.csv', encoding='utf-8')

In [115]:

data.head(10)

Out[115]:

  tweet_id timestamp source text retweeted_status_user_id retweeted_status_timestamp expanded_urls name rating stage p1_conf p1_dog p2 p2_conf p2_dog p3 p3_conf p3_dog retweet_count favorite_count
0 892420643555336193 2017-08-01 16:23:56+00:00 Twitter for iPhone This is Phineas. He's a mystical boy. Only eve… NaN NaN https://twitter.com/dog_rates/status/892420643… Phineas 1.3 NaN 0.097049 False bagel 0.085851 False banana 0.076110 False 8842 39492
1 892177421306343426 2017-08-01 00:17:27+00:00 Twitter for iPhone This is Tilly. She's just checking pup on you…. NaN NaN https://twitter.com/dog_rates/status/892177421… Tilly 1.3 NaN 0.323581 True Pekinese 0.090647 True papillon 0.068957 True 6480 33786
2 891815181378084864 2017-07-31 00:18:03+00:00 Twitter for iPhone This is Archie. He is a rare Norwegian Pouncin… NaN NaN https://twitter.com/dog_rates/status/891815181… Archie 1.2 NaN 0.716012 True malamute 0.078253 True kelpie 0.031379 True 4301 25445
3 891689557279858688 2017-07-30 15:58:51+00:00 Twitter for iPhone This is Darla. She commenced a snooze mid meal… NaN NaN https://twitter.com/dog_rates/status/891689557… Darla 1.3 NaN 0.170278 False Labrador_retriever 0.168086 True spatula 0.040836 False 8925 42863
4 891327558926688256 2017-07-29 16:00:24+00:00 Twitter for iPhone This is Franklin. He would like you to stop ca… NaN NaN https://twitter.com/dog_rates/status/891327558… Franklin 1.2 NaN 0.555712 True English_springer 0.225770 True German_short-haired_pointer 0.175219 True 9721 41016
5 891087950875897856 2017-07-29 00:08:17+00:00 Twitter for iPhone Here we have a majestic great white breaching … NaN NaN https://twitter.com/dog_rates/status/891087950… None 1.3 NaN 0.425595 True Irish_terrier 0.116317 True Indian_elephant 0.076902 False 3240 20548
6 890971913173991426 2017-07-28 16:27:12+00:00 Twitter for iPhone Meet Jax. He enjoys ice cream so much he gets … NaN NaN https://gofundme.com/ydvmve-surgery-for-jax,ht… Jax 1.3 NaN 0.341703 True Border_collie 0.199287 True ice_lolly 0.193548 False 2142 12053
7 890729181411237888 2017-07-28 00:22:40+00:00 Twitter for iPhone When you watch your owner call another dog a g… NaN NaN https://twitter.com/dog_rates/status/890729181… None 1.3 NaN 0.566142 True Eskimo_dog 0.178406 True Pembroke 0.076507 True 19548 66596
8 890609185150312448 2017-07-27 16:25:51+00:00 Twitter for iPhone This is Zoey. She doesn't want to be one of th… NaN NaN https://twitter.com/dog_rates/status/890609185… Zoey 1.3 NaN 0.487574 True Irish_setter 0.193054 True Chesapeake_Bay_retriever 0.118184 True 4403 28187
9 890240255349198849 2017-07-26 15:59:51+00:00 Twitter for iPhone This is Cassie. She is a college pup. Studying… NaN NaN https://twitter.com/dog_rates/status/890240255… Cassie 1.4 doggo 0.511319 True Cardigan 0.451038 True Chihuahua 0.029248 True 7684 32467

10 rows × 23 columns

In [116]:data.favorite_count.describe()

Out[116]:

count      1994.000000
mean       8923.133400
std       12400.238808
min          81.000000
25%        1972.250000
50%        4117.000000
75%       11275.500000
max      132318.000000
Name: favorite_count, dtype: float64

In [117]:data.retweet_count.describe()

Out[117]:

count     1994.000000
mean      2770.021063
std       4715.961325
min         15.000000
25%        622.250000
50%       1348.500000
75%       3202.750000
max      79116.000000
Name: retweet_count, dtype: float64

In [118]:

import matplotlib.pyplot as plt

%matplotlib inline

In [119]:

plt.bar(x=['favorite_count','retweet_count'], height = [data.favorite_count.sum(),data.retweet_count.sum()])plt.title('Number of Favorite count VS Retweet Count')

Out[119]:

Text(0.5, 1.0, 'Number of Favorite count VS Retweet Count')

* So the first conclusion is : favorate count more than retweet count

In [120]:data[data.p1_conf > 0.5].p1.value_counts()

Out[120]:

golden_retriever       116
Pembroke                70
Labrador_retriever      65
Chihuahua               47
pug                     43... 
scorpion                 1
Appenzeller              1
flamingo                 1
axolotl                  1
Irish_water_spaniel      1
Name: p1, Length: 245, dtype: int64

the second conclusion: the most dog: golden_retriever

In [121]:data['rating'].value_counts()

Out[121]:

1.200000      454
1.000000      421
1.100000      402
1.300000      261
0.900000      151
0.800000       95
0.700000       51
1.400000       35
0.500000       34
0.600000       32
0.300000       19
0.400000       15
0.200000       10
0.100000        4
0.000000        2
177.600000      1
2.600000        1
3.428571        1
0.636364        1
0.818182        1
42.000000       1
7.500000        1
2.700000        1
Name: rating, dtype: int64

#the third conclusion: most numerator are more than 10

Published by

风君子

独自遨游何稽首 揭天掀地慰生平

发表回复

您的电子邮箱地址不会被公开。 必填项已用 * 标注