MovieLens 1M数据集

GroupLens Research收集的一组从90年代末到21世纪初有MovieLens用户提供的电影评分数据. 本节主要分析这份数据.

import pandas as pd
import os
encoding = 'latin1'

upath = os.path.expanduser('ch02/movielens/users.dat')
rpath = os.path.expanduser('ch02/movielens/ratings.dat')
mpath = os.path.expanduser('ch02/movielens/movies.dat')

unames = ['user_id', 'gender', 'age', 'occupation', 'zip']
rnames = ['user_id', 'movie_id', 'rating', 'timestamp']
mnames = ['movie_id', 'title', 'genres']

users = pd.read_csv(upath, sep='::', header=None, names=unames, encoding=encoding)
ratings = pd.read_csv(rpath, sep='::', header=None, names=rnames, encoding=encoding)
movies = pd.read_csv(mpath, sep='::', header=None, names=mnames, encoding=encoding)
# 以上时导入数据

假设需要根据性别和年龄计算某电影的平均得分,若将所有数据合并到一起就简化了分析过程. 我们先用pandas的merge方法将ratings和users合并, 然后再将movies也合并.

data = pd.merge(pd.merge(ratings, users), movies)    # 先合并ratings和users, 然后再合并movies
data.ix[0]
user_id                                            1
movie_id                                        1193
rating                                             5
timestamp                                  978300760
gender                                             F
age                                                1
occupation                                        10
zip                                            48067
title         One Flew Over the Cuckoo's Nest (1975)
genres                                         Drama
Name: 0, dtype: object
mean_ratings = data.pivot_table('rating', index='title', columns='gender', aggfunc='mean')    # 使用pivot_table按性别计算每部电影的平均得分
mean_ratings[:5]
gender     F     M
title         
$1,000,000 Duck (1971)     3.375000     2.761905
'Night Mother (1986)     3.388889     3.352941
'Til There Was You (1997)     2.675676     2.733333
'burbs, The (1989)     2.793478     2.962085
...And Justice for All (1979)     3.828571     3.689024

上面的操作产生了一个新的DataFrame, 其内容是电影平均得分, 行标为电影名称, 列标为性别.

现在需要过滤掉评分不够200条的电影

ratings_by_title = data.groupby('title').size()    # 先对title进行分组, 然后计算大小
active_titles = ratings_by_title.index[ratings_by_title >= 200]    # 选取条数大于200的电影
active_titles[:10]
>>>Index([u''burbs, The (1989)', u'10 Things I Hate About You (1999)',
       u'101 Dalmatians (1961)', u'101 Dalmatians (1996)',
       u'12 Angry Men (1957)', u'13th Warrior, The (1999)',
       u'2 Days in the Valley (1996)', u'20,000 Leagues Under the Sea (1954)',
       u'2001: A Space Odyssey (1968)', u'2010 (1984)'],
      dtype='object', name=u'title')

mean_ratings = mean_ratings.ix[active_titles]    # 依据active_titles从mean_ratings中选择所需要的行
active_titles[:10]
>>>Index([u''burbs, The (1989)', u'10 Things I Hate About You (1999)',
       u'101 Dalmatians (1961)', u'101 Dalmatians (1996)',
       u'12 Angry Men (1957)', u'13th Warrior, The (1999)',
       u'2 Days in the Valley (1996)', u'20,000 Leagues Under the Sea (1954)',
       u'2001: A Space Odyssey (1968)', u'2010 (1984)'],
      dtype='object', name=u'title')
top_female_ratings = mean_ratings.sort_values(by='F', ascending=False)    # 为了解女性观众最喜欢的电影, 可以对F列进行降序排列
top_female_ratings[:10]
gender F Mtitle Close Shave, A (1995) 4.644444 4.473795
Wrong Trousers, The (1993) 4.588235 4.478261
General, The (1927) 4.575758 4.329480
Sunset Blvd. (a.k.a. Sunset Boulevard) (1950) 4.572650 4.464589
Wallace & Gromit: The Best of Aardman Animation (1996) 4.563107 4.385075
Schindler's List (1993) 4.562602 4.491415
Shawshank Redemption, The (1994) 4.539075 4.560625
Grand Day Out, A (1992) 4.537879 4.293255
To Kill a Mockingbird (1962) 4.536667 4.372611
Creature Comforts (1990) 4.513889 4.272277

计算评分分歧

mean_ratings['diff'] = mean_ratings['M'] - mean_ratings['F']    # 为计算评分差, 添加一列diff, 保存男女评分差值
sorted_by_diff = mean_ratings.sort_values(by='diff')    # 按diff排序
sorted_by_diff[:15]    # 评分差最大且女性观众更喜欢的电影前15行数据

gender F M difftitle Dirty Dancing (1987) 3.790378 2.959596 -0.830782
To Wong Foo, Thanks for Everything! Julie Newmar (1995) 3.486842 2.795276 -0.691567
Jumpin' Jack Flash (1986) 3.254717 2.578358 -0.676359
Grease (1978) 3.975265 3.367041 -0.608224
Relic, The (1997) 3.309524 2.723077 -0.586447
Angels in the Outfield (1994) 3.162500 2.580838 -0.581662
Little Women (1994) 3.870588 3.321739 -0.548849
Son in Law (1993) 3.075472 2.535948 -0.539524
Steel Magnolias (1989) 3.901734 3.365957 -0.535777
Anastasia (1997) 3.800000 3.281609 -0.518391
Rocky Horror Picture Show, The (1975) 3.673016 3.160131 -0.512885
Santa Claus: The Movie (1985) 3.054545 2.541667 -0.512879
Color Purple, The (1985) 4.158192 3.659341 -0.498851
Nell (1994) 3.511111 3.016000 -0.495111
Age of Innocence, The (1993) 3.827068 3.339506 -0.487561

sorted_by_diff[::-1][:15]    # 男性观众更喜欢的电影

gender     F     M     diff
title             
Good, The Bad and The Ugly, The (1966)     3.494949     4.221300     0.726351
Kentucky Fried Movie, The (1977)     2.878788     3.555147     0.676359
Up in Smoke (1978)     2.944444     3.585227     0.640783
Dumb & Dumber (1994)     2.697987     3.336595     0.638608
Longest Day, The (1962)     3.411765     4.031447     0.619682
Cable Guy, The (1996)     2.250000     2.863787     0.613787
Evil Dead II (Dead By Dawn) (1987)     3.297297     3.909283     0.611985
Hidden, The (1987)     3.137931     3.745098     0.607167
Rocky III (1982)     2.361702     2.943503     0.581801
Transformers: The Movie, The (1986)     2.857143     3.433333     0.576190
Nutty Professor II: The Klumps (2000)     2.188679     2.764103     0.575423
Caddyshack (1980)     3.396135     3.969737     0.573602
For a Few Dollars More (1965)     3.409091     3.953795     0.544704
Porky's (1981)     2.296875     2.836364     0.539489
Animal House (1978)     3.628906     4.167192     0.538286

rating_std_by_title = data.groupby('title')['rating'].std()    # 只找评分差最大, 计算数据的标准差
rating_std_by_title = rating_std_by_title.ix[active_titles]    # 根菌artive_titles过滤
rating_std_by_title.sort_values(ascending=False)[:10]    # 根据值对Series进行降序排列

title
Plan 9 from Outer Space (1958)         1.455998
Texas Chainsaw Massacre, The (1974)    1.332448
Dumb & Dumber (1994)                   1.321333
Blair Witch Project, The (1999)        1.316368
Natural Born Killers (1994)            1.307198
Idle Hands (1999)                      1.298439
Transformers: The Movie, The (1986)    1.292917
Very Bad Things (1998)                 1.280074
Tank Girl (1995)                       1.277695
Hellraiser: Bloodline (1996)           1.271939
Name: rating, dtype: float64

results matching ""

    No results matching ""