数据收集
导入需要的库
In [60]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import requests
import json
import os
打开并评估twitter-archive-enhanced
In [61]:twitter_archive_enhanced = pd.read_csv('twitter-archive-enhanced.csv')
In [62]:twitter_archive_enhanced.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):# Column Non-Null Count Dtype
--- ------ -------------- ----- 0 tweet_id 2356 non-null int64 1 in_reply_to_status_id 78 non-null float642 in_reply_to_user_id 78 non-null float643 timestamp 2356 non-null object 4 source 2356 non-null object 5 text 2356 non-null object 6 retweeted_status_id 181 non-null float647 retweeted_status_user_id 181 non-null float648 retweeted_status_timestamp 181 non-null object 9 expanded_urls 2297 non-null object 10 rating_numerator 2356 non-null int64 11 rating_denominator 2356 non-null int64 12 name 2356 non-null object 13 doggo 2356 non-null object 14 floofer 2356 non-null object 15 pupper 2356 non-null object 16 puppo 2356 non-null object
dtypes: float64(4), int64(3), object(10)
memory usage: 313.0+ KB
通过上面的info,可以看出tweet_id, timestamp 类型错误,in_reply_to_status_id,in_reply_to_user_id 仅有78列,expanded_urls 含有空值,是没有照片的数据, 根据项目要求,这些数据后面需要删除
In [63]:twitter_archive_enhanced.retweeted_status_id.notnull().value_counts()
Out[63]:
False 2175
True 181
Name: retweeted_status_id, dtype: int64
retweeted_status_id 不为nan的为转发数据,181条转发数据,根据项目要求,这些数据后面需要删除
In [64]:twitter_archive_enhanced.name.value_counts()
Out[64]:
None 745
a 55
Charlie 12
Oliver 11
Lucy 11...
Karll 1
Tiger 1
old 1
Meatball 1
Stormy 1
Name: name, Length: 957, dtype: int64
In [65]:twitter_archive_enhanced.text[twitter_archive_enhanced.name=='a'].iloc[1]
Out[65]:
'Here is a perfect example of someone who has their priorities in order. 13/10 for both owner and Forrest https://t.co/LRyMrU7Wfq'
*55个名字为a的狗狗,调用一个名字为a的看了下,显然a不是狗狗的名字,是为质量问题
*text里面含有链接
In [66]:twitter_archive_enhanced.rating_denominator.value_counts()
Out[66]:
10 2333
11 3
50 3
80 2
20 2
2 1
16 1
40 1
70 1
15 1
90 1
110 1
120 1
130 1
150 1
170 1
7 1
0 1
Name: rating_denominator, dtype: int64
可见,rating_denominator不全为10
In [67]:twitter_archive_enhanced.source.iloc[0]
Out[67]:
'<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>'
source中含有html文本
另外,这个数据集还有个整洁度问题,狗狗地位是一个变量,doggo,floofer, pupper, puppo应为一列
收集并评估image-predictions
In [68]:folder_name ='pred-image'
if not os.path.exists(folder_name):
os.makedirs(folder_name) url='https://raw.githubusercontent.com/udacity/new-dand-advanced-china/master/%E6%95%B0%E6%8D%AE%E6%B8%85%E6%B4%97/WeRateDogs%E9%A1%B9%E7%9B%AE/image-predictions.tsv'
response = requests.get(url)
response
Out[68]:
<Response [200]>
In [69]:
with open(os.path.join(folder_name, url.split('/')[-1]), mode='wb') as file:
file.write(response.content)
In [70]:os.listdir(folder_name)
Out[70]:
['image-predictions.tsv']
In [71]:image_predictions = pd.read_csv('image-predictions.tsv',sep='\t')
In [72]:image_predictions.head()
Out[72]:
tweet_id | jpg_url | img_num | p1 | p1_conf | p1_dog | p2 | p2_conf | p2_dog | p3 | p3_conf | p3_dog | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 666020888022790149 | https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg | 1 | Welsh_springer_spaniel | 0.465074 | True | collie | 0.156665 | True | Shetland_sheepdog | 0.061428 | True |
1 | 666029285002620928 | https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg | 1 | redbone | 0.506826 | True | miniature_pinscher | 0.074192 | True | Rhodesian_ridgeback | 0.072010 | True |
2 | 666033412701032449 | https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg | 1 | German_shepherd | 0.596461 | True | malinois | 0.138584 | True | bloodhound | 0.116197 | True |
3 | 666044226329800704 | https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg | 1 | Rhodesian_ridgeback | 0.408143 | True | redbone | 0.360687 | True | miniature_pinscher | 0.222752 | True |
4 | 666049248165822465 | https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg | 1 | miniature_pinscher | 0.560311 | True | Rottweiler | 0.243682 | True | Doberman | 0.154629 | True |
In [73]:image_predictions.jpg_url.duplicated().value_counts()
Out[73]:
False 2009
True 66
Name: jpg_url, dtype: int64
有66条重复的图片链接
tweet_id类型错误
打开并评估tweet_json
In [74]:tweet_json = pd.DataFrame()
In [75]:
file = open('tweet_json.txt','r')
for line in file.readlines():
dic = json.loads(line)
tweet_id = dic['id']
retweet_count = dic['retweet_count']
favorite_count = dic['favorite_count']
tem_df = pd.DataFrame({'tweet_id':tweet_id,
'retweet_count':retweet_count,
'favorite_count':favorite_count},index=[0])
tweet_json = pd.concat([tweet_json,tem_df])
In [76]:
tweet_json
Out[76]:
tweet_id | retweet_count | favorite_count | |
---|---|---|---|
0 | 892420643555336193 | 8842 | 39492 |
0 | 892177421306343426 | 6480 | 33786 |
0 | 891815181378084864 | 4301 | 25445 |
0 | 891689557279858688 | 8925 | 42863 |
0 | 891327558926688256 | 9721 | 41016 |
… | … | … | … |
0 | 666049248165822465 | 41 | 111 |
0 | 666044226329800704 | 147 | 309 |
0 | 666033412701032449 | 47 | 128 |
0 | 666029285002620928 | 48 | 132 |
0 | 666020888022790149 | 530 | 2528 |
2352 rows × 3 columns
tweet_id 类型错误
综上,
#*数据集里的质量问题:
- tweet_id,timestamp类型错误
- jpg_url有66条重复的链接
- source中含有html文本
- rating_denominator不全为10,还有分母为0的情况出现
- 55个名字为a的狗狗,调用一个名字为a的看了下,显然a不是狗狗的名字,是为质量问题
- text里面含有链接
- retweeted_status_id 不为nan的为转发数据,181条转发数据,根据项目要求,这些数据后面需要删除
- in_reply_to_status_id,in_reply_to_user_id 仅有78列
- 没有照片的数据, 根据项目要求,这些数据后面需要删除
#*整洁度问题:
- 狗狗地位是一个变量,doggo,floofer, pupper, puppo应为一列
- 三个数据集有一个观察对象tweet_id,可以合为一个数据集
数据清洗
In [77]:
twitter_archive_enhanced_clean = twitter_archive_enhanced.copy()
image_predictions_clean = image_predictions.copy()
tweet_json_clean = tweet_json.copy()
issue: tweet_id类型错误
define: 修改tweet_id为str
code:
In [78]:twitter_archive_enhanced_clean['tweet_id'] = twitter_archive_enhanced_clean['tweet_id'].astype('str')
In [79]:image_predictions_clean['tweet_id'] = image_predictions_clean['tweet_id'].astype('str')
In [80]:tweet_json_clean['tweet_id'] = tweet_json_clean['tweet_id'].astype('str')
Test
In [81]:twitter_archive_enhanced_clean['tweet_id']
Out[81]:
0 892420643555336193
1 892177421306343426
2 891815181378084864
3 891689557279858688
4 891327558926688256...
2351 666049248165822465
2352 666044226329800704
2353 666033412701032449
2354 666029285002620928
2355 666020888022790149
Name: tweet_id, Length: 2356, dtype: object
In [82]:image_predictions_clean['tweet_id']
Out[82]:
0 666020888022790149
1 666029285002620928
2 666033412701032449
3 666044226329800704
4 666049248165822465...
2070 891327558926688256
2071 891689557279858688
2072 891815181378084864
2073 892177421306343426
2074 892420643555336193
Name: tweet_id, Length: 2075, dtype: object
In [83]:tweet_json_clean['tweet_id']
Out[83]:
0 892420643555336193
0 892177421306343426
0 891815181378084864
0 891689557279858688
0 891327558926688256...
0 666049248165822465
0 666044226329800704
0 666033412701032449
0 666029285002620928
0 666020888022790149
Name: tweet_id, Length: 2352, dtype: object
issue: timestamp类型错误
define: 修改为datetime
code:
In [84]:twitter_archive_enhanced_clean['timestamp'] = pd.to_datetime(twitter_archive_enhanced_clean['timestamp'])
Test
In [85]:twitter_archive_enhanced_clean['timestamp']
Out[85]:
0 2017-08-01 16:23:56+00:00
1 2017-08-01 00:17:27+00:00
2 2017-07-31 00:18:03+00:00
3 2017-07-30 15:58:51+00:00
4 2017-07-29 16:00:24+00:00...
2351 2015-11-16 00:24:50+00:00
2352 2015-11-16 00:04:52+00:00
2353 2015-11-15 23:21:54+00:00
2354 2015-11-15 23:05:30+00:00
2355 2015-11-15 22:32:08+00:00
Name: timestamp, Length: 2356, dtype: datetime64[ns, UTC]
issue: 55个名字为a的狗狗,调用一个名字为a的看了下,显然a不是狗狗的名字
define: 将a用None代替
code:
In [86]:twitter_archive_enhanced_clean['name']= twitter_archive_enhanced_clean['name'].replace('a',np.nan)
Test
In [88]:twitter_archive_enhanced_clean['name'].value_counts()
Out[88]:
None 745
Charlie 12
Lucy 11
Oliver 11
Cooper 11...
Karll 1
Tiger 1
old 1
Meatball 1
Stormy 1
Name: name, Length: 956, dtype: int64
Issue:
分母不全为10
define: Create new column rating=rating_numerator/rating_denominator. Drop rating_numerator and rating_denominator.
Code:
In [90]:twitter_archive_enhanced_clean=twitter_archive_enhanced_clean[twitter_archive_enhanced_clean.rating_denominator!= 0]
In [91]:twitter_archive_enhanced_clean['rating']=twitter_archive_enhanced_clean.rating_numerator/twitter_archive_enhanced_clean.rating_denominator
In [92]:twitter_archive_enhanced_clean=twitter_archive_enhanced_clean.drop(['rating_numerator','rating_denominator'],axis=1)
Test:
In [93]:twitter_archive_enhanced_clean
Out[93]:
tweet_id | in_reply_to_status_id | in_reply_to_user_id | timestamp | source | text | retweeted_status_id | retweeted_status_user_id | retweeted_status_timestamp | expanded_urls | name | doggo | floofer | pupper | puppo | rating | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 892420643555336193 | NaN | NaN | 2017-08-01 16:23:56+00:00 | <a href="http://twitter.com/download/iphone" r… | This is Phineas. He's a mystical boy. Only eve… | NaN | NaN | NaN | https://twitter.com/dog_rates/status/892420643… | Phineas | None | None | None | None | 1.3 |
1 | 892177421306343426 | NaN | NaN | 2017-08-01 00:17:27+00:00 | <a href="http://twitter.com/download/iphone" r… | This is Tilly. She's just checking pup on you…. | NaN | NaN | NaN | https://twitter.com/dog_rates/status/892177421… | Tilly | None | None | None | None | 1.3 |
2 | 891815181378084864 | NaN | NaN | 2017-07-31 00:18:03+00:00 | <a href="http://twitter.com/download/iphone" r… | This is Archie. He is a rare Norwegian Pouncin… | NaN | NaN | NaN | https://twitter.com/dog_rates/status/891815181… | Archie | None | None | None | None | 1.2 |
3 | 891689557279858688 | NaN | NaN | 2017-07-30 15:58:51+00:00 | <a href="http://twitter.com/download/iphone" r… | This is Darla. She commenced a snooze mid meal… | NaN | NaN | NaN | https://twitter.com/dog_rates/status/891689557… | Darla | None | None | None | None | 1.3 |
4 | 891327558926688256 | NaN | NaN | 2017-07-29 16:00:24+00:00 | <a href="http://twitter.com/download/iphone" r… | This is Franklin. He would like you to stop ca… | NaN | NaN | NaN | https://twitter.com/dog_rates/status/891327558… | Franklin | None | None | None | None | 1.2 |
… | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
2351 | 666049248165822465 | NaN | NaN | 2015-11-16 00:24:50+00:00 | <a href="http://twitter.com/download/iphone" r… | Here we have a 1949 1st generation vulpix. Enj… | NaN | NaN | NaN | https://twitter.com/dog_rates/status/666049248… | None | None | None | None | None | 0.5 |
2352 | 666044226329800704 | NaN | NaN | 2015-11-16 00:04:52+00:00 | <a href="http://twitter.com/download/iphone" r… | This is a purebred Piers Morgan. Loves to Netf… | NaN | NaN | NaN | https://twitter.com/dog_rates/status/666044226… | NaN | None | None | None | None | 0.6 |
2353 | 666033412701032449 | NaN | NaN | 2015-11-15 23:21:54+00:00 | <a href="http://twitter.com/download/iphone" r… | Here is a very happy pup. Big fan of well-main… | NaN | NaN | NaN | https://twitter.com/dog_rates/status/666033412… | NaN | None | None | None | None | 0.9 |
2354 | 666029285002620928 | NaN | NaN | 2015-11-15 23:05:30+00:00 | <a href="http://twitter.com/download/iphone" r… | This is a western brown Mitsubishi terrier. Up… | NaN | NaN | NaN | https://twitter.com/dog_rates/status/666029285… | NaN | None | None | None | None | 0.7 |
2355 | 666020888022790149 | NaN | NaN | 2015-11-15 22:32:08+00:00 | <a href="http://twitter.com/download/iphone" r… | Here we have a Japanese Irish Setter. Lost eye… | NaN | NaN | NaN | https://twitter.com/dog_rates/status/666020888… | None | None | None | None | None | 0.8 |
2355 rows × 16 columns
Issue: duplicated of jpg_url
define: delete the duplicated
code:
In [94]:image_predictions_clean=image_predictions_clean[~image_predictions_clean.jpg_url.duplicated()]
Test:
In [95]:sum(image_predictions_clean.jpg_url.duplicated())
Out[95]:
Issue: in_reply_to_status_id in_reply_to_user_id only 23
Define: drop them directly
Code:
In [96]:twitter_archive_enhanced_clean.drop(twitter_archive_enhanced_clean[['in_reply_to_status_id','in_reply_to_user_id']],axis=1,inplace=True)
Test:
In [97]:twitter_archive_enhanced_clean.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2355 entries, 0 to 2355
Data columns (total 14 columns):# Column Non-Null Count Dtype
--- ------ -------------- ----- 0 tweet_id 2355 non-null object 1 timestamp 2355 non-null datetime64[ns, UTC]2 source 2355 non-null object 3 text 2355 non-null object 4 retweeted_status_id 181 non-null float64 5 retweeted_status_user_id 181 non-null float64 6 retweeted_status_timestamp 181 non-null object 7 expanded_urls 2297 non-null object 8 name 2300 non-null object 9 doggo 2355 non-null object 10 floofer 2355 non-null object 11 pupper 2355 non-null object 12 puppo 2355 non-null object 13 rating 2355 non-null float64
dtypes: datetime64[ns, UTC](1), float64(3), object(10)
memory usage: 276.0+ KB
Issue: html content in source
define: delete html
Code:
In [98]:twitter_archive_enhanced_clean.source= twitter_archive_enhanced_clean.source.str.extract('>(.+)<',expand = True)
Test
In [99]:twitter_archive_enhanced_clean['source'].value_counts()
Out[99]:
Twitter for iPhone 2220
Vine - Make a Scene 91
Twitter Web Client 33
TweetDeck 11
Name: source, dtype: int64
Issue: text column contain url
define: delete url
code:
In [100]:twitter_archive_enhanced_clean.text.replace(r'https.*','',regex=True, inplace=True)
test
In [101]:twitter_archive_enhanced_clean.text
Out[101]:
0 This is Phineas. He's a mystical boy. Only eve...
1 This is Tilly. She's just checking pup on you....
2 This is Archie. He is a rare Norwegian Pouncin...
3 This is Darla. She commenced a snooze mid meal...
4 This is Franklin. He would like you to stop ca......
2351 Here we have a 1949 1st generation vulpix. Enj...
2352 This is a purebred Piers Morgan. Loves to Netf...
2353 Here is a very happy pup. Big fan of well-main...
2354 This is a western brown Mitsubishi terrier. Up...
2355 Here we have a Japanese Irish Setter. Lost eye...
Name: text, Length: 2355, dtype: object
issue: 含有转发数据
define: 删除转发数据
code:
In [102]:twitter_archive_enhanced_clean=twitter_archive_enhanced_clean[twitter_archive_enhanced_clean.retweeted_status_id.isnull()]
twitter_archive_enhanced_clean=twitter_archive_enhanced_clean.drop(['retweeted_status_id'],axis=1)
Test
In [103]:twitter_archive_enhanced_clean.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2174 entries, 0 to 2355
Data columns (total 13 columns):# Column Non-Null Count Dtype
--- ------ -------------- ----- 0 tweet_id 2174 non-null object 1 timestamp 2174 non-null datetime64[ns, UTC]2 source 2174 non-null object 3 text 2174 non-null object 4 retweeted_status_user_id 0 non-null float64 5 retweeted_status_timestamp 0 non-null object 6 expanded_urls 2117 non-null object 7 name 2119 non-null object 8 doggo 2174 non-null object 9 floofer 2174 non-null object 10 pupper 2174 non-null object 11 puppo 2174 non-null object 12 rating 2174 non-null float64
dtypes: datetime64[ns, UTC](1), float64(2), object(10)
memory usage: 237.8+ KB
issue: 狗狗地位是一个变量,应该为一列
define 将其放在一列
code
In [104]:
twitter_archive_enhanced_clean['stage']= twitter_archive_enhanced_clean.text.str.findall('(doggo|pupper|puppo|floofer)')twitter_archive_enhanced_clean['stage'] = twitter_archive_enhanced_clean['stage'].apply(lambda x: ','.join(set(x)))
In [105]:
twitter_archive_enhanced_clean['stage']=twitter_archive_enhanced_clean['stage'].replace('',np.nan)
In [106]:
twitter_archive_enhanced_clean.drop(twitter_archive_enhanced_clean[['doggo','puppo','pupper','floofer']],axis=1,inplace=True)
Test
In [107]:
twitter_archive_enhanced_clean.stage.value_counts()
Out[107]:
pupper 242
doggo 78
puppo 30
pupper,doggo 8
floofer 4
puppo,doggo 2
Name: stage, dtype: int64
ISSUE: 三个数据集共有一个观察对象,可以合并为一个数据集. 无照片的数据也可以删除。
define: 将3个数据集合并在一起,并且删除无照片的数据
code
In [108]:
df1_clean = twitter_archive_enhanced_clean.merge(image_predictions_clean,how='inner',on='tweet_id')
In [109]:
df_clean = df1_clean.merge(tweet_json_clean,how='left',on='tweet_id')
test
In [110]:
df_clean
Out[110]:
tweet_id | timestamp | source | text | retweeted_status_user_id | retweeted_status_timestamp | expanded_urls | name | rating | stage | … | p1_conf | p1_dog | p2 | p2_conf | p2_dog | p3 | p3_conf | p3_dog | retweet_count | favorite_count | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 892420643555336193 | 2017-08-01 16:23:56+00:00 | Twitter for iPhone | This is Phineas. He's a mystical boy. Only eve… | NaN | NaN | https://twitter.com/dog_rates/status/892420643… | Phineas | 1.3 | NaN | … | 0.097049 | False | bagel | 0.085851 | False | banana | 0.076110 | False | 8842 | 39492 |
1 | 892177421306343426 | 2017-08-01 00:17:27+00:00 | Twitter for iPhone | This is Tilly. She's just checking pup on you…. | NaN | NaN | https://twitter.com/dog_rates/status/892177421… | Tilly | 1.3 | NaN | … | 0.323581 | True | Pekinese | 0.090647 | True | papillon | 0.068957 | True | 6480 | 33786 |
2 | 891815181378084864 | 2017-07-31 00:18:03+00:00 | Twitter for iPhone | This is Archie. He is a rare Norwegian Pouncin… | NaN | NaN | https://twitter.com/dog_rates/status/891815181… | Archie | 1.2 | NaN | … | 0.716012 | True | malamute | 0.078253 | True | kelpie | 0.031379 | True | 4301 | 25445 |
3 | 891689557279858688 | 2017-07-30 15:58:51+00:00 | Twitter for iPhone | This is Darla. She commenced a snooze mid meal… | NaN | NaN | https://twitter.com/dog_rates/status/891689557… | Darla | 1.3 | NaN | … | 0.170278 | False | Labrador_retriever | 0.168086 | True | spatula | 0.040836 | False | 8925 | 42863 |
4 | 891327558926688256 | 2017-07-29 16:00:24+00:00 | Twitter for iPhone | This is Franklin. He would like you to stop ca… | NaN | NaN | https://twitter.com/dog_rates/status/891327558… | Franklin | 1.2 | NaN | … | 0.555712 | True | English_springer | 0.225770 | True | German_short-haired_pointer | 0.175219 | True | 9721 | 41016 |
… | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
1989 | 666049248165822465 | 2015-11-16 00:24:50+00:00 | Twitter for iPhone | Here we have a 1949 1st generation vulpix. Enj… | NaN | NaN | https://twitter.com/dog_rates/status/666049248… | None | 0.5 | NaN | … | 0.560311 | True | Rottweiler | 0.243682 | True | Doberman | 0.154629 | True | 41 | 111 |
1990 | 666044226329800704 | 2015-11-16 00:04:52+00:00 | Twitter for iPhone | This is a purebred Piers Morgan. Loves to Netf… | NaN | NaN | https://twitter.com/dog_rates/status/666044226… | NaN | 0.6 | NaN | … | 0.408143 | True | redbone | 0.360687 | True | miniature_pinscher | 0.222752 | True | 147 | 309 |
1991 | 666033412701032449 | 2015-11-15 23:21:54+00:00 | Twitter for iPhone | Here is a very happy pup. Big fan of well-main… | NaN | NaN | https://twitter.com/dog_rates/status/666033412… | NaN | 0.9 | NaN | … | 0.596461 | True | malinois | 0.138584 | True | bloodhound | 0.116197 | True | 47 | 128 |
1992 | 666029285002620928 | 2015-11-15 23:05:30+00:00 | Twitter for iPhone | This is a western brown Mitsubishi terrier. Up… | NaN | NaN | https://twitter.com/dog_rates/status/666029285… | NaN | 0.7 | NaN | … | 0.506826 | True | miniature_pinscher | 0.074192 | True | Rhodesian_ridgeback | 0.072010 | True | 48 | 132 |
1993 | 666020888022790149 | 2015-11-15 22:32:08+00:00 | Twitter for iPhone | Here we have a Japanese Irish Setter. Lost eye… | NaN | NaN | https://twitter.com/dog_rates/status/666020888… | None | 0.8 | NaN | … | 0.465074 | True | collie | 0.156665 | True | Shetland_sheepdog | 0.061428 | True | 530 | 2528 |
1994 rows × 23 columns
保存数据集
In [112]:
#save the file
save_file_name = 'twitter_archive_master.csv'
df_clean.to_csv(save_file_name, encoding='utf-8',index=False)
分析与可视化
In [114]:
#data analysisdata = pd.read_csv('twitter_archive_master.csv', encoding='utf-8')
In [115]:
data.head(10)
Out[115]:
tweet_id | timestamp | source | text | retweeted_status_user_id | retweeted_status_timestamp | expanded_urls | name | rating | stage | … | p1_conf | p1_dog | p2 | p2_conf | p2_dog | p3 | p3_conf | p3_dog | retweet_count | favorite_count | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 892420643555336193 | 2017-08-01 16:23:56+00:00 | Twitter for iPhone | This is Phineas. He's a mystical boy. Only eve… | NaN | NaN | https://twitter.com/dog_rates/status/892420643… | Phineas | 1.3 | NaN | … | 0.097049 | False | bagel | 0.085851 | False | banana | 0.076110 | False | 8842 | 39492 |
1 | 892177421306343426 | 2017-08-01 00:17:27+00:00 | Twitter for iPhone | This is Tilly. She's just checking pup on you…. | NaN | NaN | https://twitter.com/dog_rates/status/892177421… | Tilly | 1.3 | NaN | … | 0.323581 | True | Pekinese | 0.090647 | True | papillon | 0.068957 | True | 6480 | 33786 |
2 | 891815181378084864 | 2017-07-31 00:18:03+00:00 | Twitter for iPhone | This is Archie. He is a rare Norwegian Pouncin… | NaN | NaN | https://twitter.com/dog_rates/status/891815181… | Archie | 1.2 | NaN | … | 0.716012 | True | malamute | 0.078253 | True | kelpie | 0.031379 | True | 4301 | 25445 |
3 | 891689557279858688 | 2017-07-30 15:58:51+00:00 | Twitter for iPhone | This is Darla. She commenced a snooze mid meal… | NaN | NaN | https://twitter.com/dog_rates/status/891689557… | Darla | 1.3 | NaN | … | 0.170278 | False | Labrador_retriever | 0.168086 | True | spatula | 0.040836 | False | 8925 | 42863 |
4 | 891327558926688256 | 2017-07-29 16:00:24+00:00 | Twitter for iPhone | This is Franklin. He would like you to stop ca… | NaN | NaN | https://twitter.com/dog_rates/status/891327558… | Franklin | 1.2 | NaN | … | 0.555712 | True | English_springer | 0.225770 | True | German_short-haired_pointer | 0.175219 | True | 9721 | 41016 |
5 | 891087950875897856 | 2017-07-29 00:08:17+00:00 | Twitter for iPhone | Here we have a majestic great white breaching … | NaN | NaN | https://twitter.com/dog_rates/status/891087950… | None | 1.3 | NaN | … | 0.425595 | True | Irish_terrier | 0.116317 | True | Indian_elephant | 0.076902 | False | 3240 | 20548 |
6 | 890971913173991426 | 2017-07-28 16:27:12+00:00 | Twitter for iPhone | Meet Jax. He enjoys ice cream so much he gets … | NaN | NaN | https://gofundme.com/ydvmve-surgery-for-jax,ht… | Jax | 1.3 | NaN | … | 0.341703 | True | Border_collie | 0.199287 | True | ice_lolly | 0.193548 | False | 2142 | 12053 |
7 | 890729181411237888 | 2017-07-28 00:22:40+00:00 | Twitter for iPhone | When you watch your owner call another dog a g… | NaN | NaN | https://twitter.com/dog_rates/status/890729181… | None | 1.3 | NaN | … | 0.566142 | True | Eskimo_dog | 0.178406 | True | Pembroke | 0.076507 | True | 19548 | 66596 |
8 | 890609185150312448 | 2017-07-27 16:25:51+00:00 | Twitter for iPhone | This is Zoey. She doesn't want to be one of th… | NaN | NaN | https://twitter.com/dog_rates/status/890609185… | Zoey | 1.3 | NaN | … | 0.487574 | True | Irish_setter | 0.193054 | True | Chesapeake_Bay_retriever | 0.118184 | True | 4403 | 28187 |
9 | 890240255349198849 | 2017-07-26 15:59:51+00:00 | Twitter for iPhone | This is Cassie. She is a college pup. Studying… | NaN | NaN | https://twitter.com/dog_rates/status/890240255… | Cassie | 1.4 | doggo | … | 0.511319 | True | Cardigan | 0.451038 | True | Chihuahua | 0.029248 | True | 7684 | 32467 |
10 rows × 23 columns
In [116]:data.favorite_count.describe()
Out[116]:
count 1994.000000
mean 8923.133400
std 12400.238808
min 81.000000
25% 1972.250000
50% 4117.000000
75% 11275.500000
max 132318.000000
Name: favorite_count, dtype: float64
In [117]:data.retweet_count.describe()
Out[117]:
count 1994.000000
mean 2770.021063
std 4715.961325
min 15.000000
25% 622.250000
50% 1348.500000
75% 3202.750000
max 79116.000000
Name: retweet_count, dtype: float64
In [118]:
import matplotlib.pyplot as plt
%matplotlib inline
In [119]:
plt.bar(x=['favorite_count','retweet_count'], height = [data.favorite_count.sum(),data.retweet_count.sum()])plt.title('Number of Favorite count VS Retweet Count')
Out[119]:
Text(0.5, 1.0, 'Number of Favorite count VS Retweet Count')
* So the first conclusion is : favorate count more than retweet count
In [120]:data[data.p1_conf > 0.5].p1.value_counts()
Out[120]:
golden_retriever 116
Pembroke 70
Labrador_retriever 65
Chihuahua 47
pug 43...
scorpion 1
Appenzeller 1
flamingo 1
axolotl 1
Irish_water_spaniel 1
Name: p1, Length: 245, dtype: int64
the second conclusion: the most dog: golden_retriever
In [121]:data['rating'].value_counts()
Out[121]:
1.200000 454
1.000000 421
1.100000 402
1.300000 261
0.900000 151
0.800000 95
0.700000 51
1.400000 35
0.500000 34
0.600000 32
0.300000 19
0.400000 15
0.200000 10
0.100000 4
0.000000 2
177.600000 1
2.600000 1
3.428571 1
0.636364 1
0.818182 1
42.000000 1
7.500000 1
2.700000 1
Name: rating, dtype: int64
#the third conclusion: most numerator are more than 10