bit.ly 的 1.usa.gov 数据
2011年 bit.ly 和美国政府合作, 提供了一份从匿名数据, 以每小时快照为例, 文件中各行的格式为JSON. 例如: 我们只读取其中一行, 那么看到如下信息:
path = 'ch02/usagov_bitly_data2012-03-16-1331923249.txt'
open(path).readline()
'{ "a": "Mozilla\\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\\/535.11\n\n(KHTML, like Gecko) Chrome\\/17.0.963.78 Safari\\/535.11", "c": "US", "nk": 1,"tz": "America\\/New_York", "gr": "MA", "g": "A6qOVH", "h": "wfLQtf", "l":"orofrog", "al": "en-US,en;q=0.8", "hh": "1.usa.gov", "r":"http:\\/\\/www.facebook.com\\/l\\/7AQEFzjSi\\/1.usa.gov\\/wfLQtf", "u":"http:\\/\\/www.ncbi.nlm.nih.gov\\/pubmed\\/22415991", "t": 1331923247, "hc":1331822918, "cy": "Danvers", "ll": [ 42.576698, -70.954903 ] }\n't json
import json
path = 'ch02/usagov_bitly_data2012-03-16-1331923249.txt'
records = [json.loads(line) for line in open(path)]
records[0]
{u'a': u'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, likeGecko) Chrome/17.0.963.78 Safari/535.11',u'al': u'en-US,en;q=0.8',u'c': u'US',u'cy': u'Danvers',u'g': u'A6qOVH',u'gr': u'MA',u'h': u'wfLQtf',u'hc': 1331822918,u'hh': u'1.usa.gov',u'l': u'orofrog',u'll': [42.576698, -70.954903],u'nk': 1,u'r': u'http://www.facebook.com/l/7AQEFzjSi/1.usa.gov/wfLQtf',u't': 1331923247,u'tz': u'America/New_York',u'u': u'http://www.ncbi.nlm.nih.gov/pubmed/22415991'}
records[0]['tz']
u'America/New_York'
print records[0]['tz']
America/New_York
对时区进行计数
time_zones = [rec['tz'] for rec in records]
KeyErrorTraceback (most recent call last)/home/wesm/book_scripts/whetting/<ipython> in <module>()----> 1 time_zones = [rec['tz'] for rec in records]KeyError: 'tz'
time_zones = [rec['tz'] for rec in records if 'tz' in rec]
time_zones[:10]
[u'America/New_York',u'America/Denver',u'America/New_York',u'America/Sao_Paulo',u'America/New_York',u'America/New_York',u'Europe/Warsaw',u'',u'',u'']
只看前10个时区发现有些时空的. 虽然可以将它们过滤掉, 但是为说明方法暂时先保留, 下面有2中办法对数据进行过滤: 1种采用 python 标准库, 一种使用 pandas. 先看 python 标准库
def get_counts(sequence):
counts = {}
for x in sequence:
if x in counts:
counts[x] += 1
else:
counts[x] = 1
return counts
还有另一种更加简洁的方法:
from collections import defaultdict
def get_counts2(sequence):
counts = defaultdict(int) # 初始化为0
for x in sequence:
counts[x] += 1
return counts
下面对时区进行计数:
counts = get_counts(time_zones)
counts['America/New_York']
1251
len(counts)
3440
如果想得到前10位的时区及其计数, 我们需要额外处理:
def top_counts(count_dict, n=10):
value_key_pairs = [(count, tz) for tz, count in count_dict.items()]
value_key_pairs.sort()
return value_key_pairs[:10]
现在可以看到如下结果:
top_counts(counts)
[(33, u'America/Sao_Paulo'),
(35, u'Europe/Madrid'),
(36, u'Pacific/Honolulu'),
(37, u'Asia/Tokyo'),
(74, u'Europe/London'),
(191, u'America/Denver'),
(382, u'America/Los_Angeles'),
(400, u'America/Chicago'),
(521, u''),
(1251, u'America/New_York')]
另一种方法是使用python内置的方法Counter
from collections import Counter
counts = Counter(time_zones)
counts.most_common(10)
[(u'America/New_York', 1251),
(u'', 521),
(u'America/Chicago', 400),
(u'America/Los_Angeles', 382),
(u'America/Denver', 191),
(u'Europe/London', 74),
(u'Asia/Tokyo', 37),
(u'Pacific/Honolulu', 36),
(u'Europe/Madrid', 35),
(u'America/Sao_Paulo', 33)]
用pandas对时区进行计数
DadaFrame是pandas中最主要的数据结构, 它可以表示一张表. 创建DataFrame也比较简单:
from pandas import DataFrame, Series
import pandas as pd
import numpy as np
frame = DataFrame(records)
frame
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3560 entries, 0 to 3559
Data columns:
_heartbeat_ 120 non-null values
a 3440 non-null values
al 3094 non-null values
c 2919 non-null values
cy 2919 non-null values
g 3440 non-null values
gr 2919 non-null values
h 3440 non-null values
hc 3440 non-null values
hh 3440 non-null values
kw 93 non-null values
l 3440 non-null values
ll 2919 non-null values
nk 3440 non-null values
r 3440 non-null values
t 3440 non-null values
tz 3440 non-null values
u 3440 non-null values
dtypes: float64(4), object(14)
frame['tz'][:10]
America/New_York 1251
521
America/Chicago 400
America/Los_Angeles 382
America/Denver 191
Europe/London 74
Asia/Tokyo 37
Pacific/Honolulu 36
Europe/Madrid 35
America/Sao_Paulo 33
Name: tz
此处frame输出的时摘要信息, 主要用于数据量较大时显示.
frame['tz']有一个value_counts方法, 此方法也可以得到想要的结果
tz_counts = frame['tz'].value_counts()
tz_counts[:10] # 结果同上
下面是使用matlibplot绘图
clean_tz = frame['tz'].fillna('Missing')
clean_tz[clean_tz == ''] = 'Unknown'
tz_counts = clean_tz.value_counts()
tz_counts[:10].plot(kind='bar', rot=0) # 这里得到的将是水平表
我们还可以对数据进行其他处理,比如:
frame['a'][1]
>>>u'GoogleMaps/RochesterNY'
frame['a'][50]
>>>u'Mozilla/5.0 (Windows NT 5.1; rv:10.0.2) Gecko/20100101 Firefox/10.0.2'
frame['a'][51]
>>>u'Mozilla/5.0 (Linux; U; Android 2.2.2; en-us; LG-P925/V10e Build/FRG83G) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1'
手工处理这些agent信息非常的郁闷. 不过只要掌握了Python内置的字符串函数和正则表达式, 事情就相对简单一些了. 比如说, 我们可以将这种字符串的第一节分离出来并得到另一份用户行为摘要:
results = Series([x.split()[0] for x in frame.a.dropna()])
results[:5]
>>>0 Mozilla/5.0
1 GoogleMaps/RochesterNY
2 Mozilla/4.0
3 Mozilla/5.0
4 Mozilla/5.0
dtype: object
results.value_counts()[:8]
>>>Mozilla/5.0 2594
Mozilla/4.0 601
GoogleMaps/RochesterNY 121
Opera/9.80 34
TEST_INTERNET_AGENT 24
GoogleProducer 21
Mozilla/6.0 5
BlackBerry8520/5.0.0.681 4
dtype: int64
假设按Windows用户和非Windows用户对时区统计信息进行分解. 简单一点,只要包含"Windows",就认为时Windows用户:
cframe = frame[frame.a.notnull()] # 首先移除空值
operating_system = np.where(cframe['a'].str.contains('Windows'), 'Windows', 'Not Windows')
operating_system[:5]
>>>array(['Windows', 'Not Windows', 'Windows', 'Not Windows', 'Windows'], dtype='|S11')
by_tz_os = cframe.groupby(['tz', operating_system]) # 根据时区和新得到的操作系统列表对数据进行分组
agg_counts = by_tz_os.size().unstack().fillna(0) # 通过size对分组结果进行计数, 并利用unstack对结果进行重塑
agg_counts[:10]
>>> Not Windows Windows tz 245.0 276.0 Africa/Cairo 0.0 3.0 Africa/Casablanca 0.0 1.0 Africa/Ceuta 0.0 2.0 Africa/Johannesburg 0.0 1.0 Africa/Lusaka 0.0 1.0 America/Anchorage 4.0 1.0 America/Argentina/Buenos_Aires 1.0 0.0 America/Argentina/Cordoba 0.0 1.0 America/Argentina/Mendoza 0.0 1.0
indexer = agg_counts.sum(1).argsort() # 构造一个计数索引
indexer[:10] # 默认时升序排列
>>>tz 24 Africa/Cairo 20 Africa/Casablanca 21 Africa/Ceuta 92 Africa/Johannesburg 87 Africa/Lusaka 53 America/Anchorage 54 America/Argentina/Buenos_Aires 57 America/Argentina/Cordoba 26 America/Argentina/Mendoza 55 dtype: int64
count_subset = agg_counts.take(indexer)[-10:] # 截取最后10行
>>> Not Windows Windows tz America/Sao_Paulo 13.0 20.0 Europe/Madrid 16.0 19.0 Pacific/Honolulu 0.0 36.0 Asia/Tokyo 2.0 35.0 Europe/London 43.0 31.0 America/Denver 132.0 59.0 America/Los_Angeles 130.0 252.0 America/Chicago 115.0 285.0 245.0 276.0 America/New_York 339.0 912.0
plt.figure()
count_subset.plot(kind='barh', stacked=True) # 绘制一张横向图
这张图中可能不容易看清楚,可以将各行规范化为"总数为1"并重新绘图
normed_subset = count_subset.div(count_subset.sum(1), axis=0)
normed_subset.plot(kind='barh', stacked=True)