精彩文章
文末免费领取500G干货教程
今日分享:电影评分数据
准备工作
本文所分析的电影评分数据:网站grouplens.org/datasets/movielens下载MovieLens 1M Dataset 即可。
同时须阅读说明:因为为操作方便,部分变量其实际意义是另外注明的,以便利于理解。
数据读取
以下数据的列名均是原数据集中的列名(具体什么意思,请看阅读须知),其中文件路径应根据实际的存放位置写入。
用户数据
In [7]: user_names = ['user_id','gender','age','occupation','zip']
In [8]: users = pd.read_table('F:/Anaconda个人文件/Jupyter/MaiZi_data/ml-1
...: m/users.dat',sep='::',header=None,names=user_names,engine='python'
...: )
评分数据
In [9]: rating_names = ['user_id', 'movie_id', 'rating', 'timestamp']
In [10]: ratings = pd.read_table('F:/Anaconda个人文件/Jupyter/MaiZi_data/m
...: l-1m/ratings.dat', sep='::', header=None, names=rating_names, eng
...: ine='python')
电影数据
In [11]: movie_names = ['movie_id', 'title', 'genres']
In [12]: movies = pd.read_table('F:/Anaconda个人文件/Jupyter/MaiZi_data/ml
...: -1m/movies.dat', sep='::', header=None, names=movie_names, engine
...: ='python')
简单查看三种数据的行数及输出
In [15]: print(len(users))
6040
In [16]: users.head(2)
Out[16]:
user_id gender age occupation zip
0 1 F 1 10 48067
1 2 M 56 16 70072
In [17]: print(len(ratings))
1000209
In [18]: ratings.head(2)
Out[18]:
user_id movie_id rating timestamp
In [19]: print(len(movies))
3883
In [20]: movies.head(2)
Out[20]:
movie_id title genres
0 1 Toy Story (1995) Animation|Children's|Comedy
1 2 Jumanji (1995) Adventure|Children's|Fantasy
数据合并
In [13]: data = pd.merge(pd.merge(users,ratings),movies)
In [14]: len(data)
Out[14]: 1000209
可以看出合并后的数据的行数取得是三者中的最大值,也就是6040个人共对3883个电影进行了1000209次评分,因为每个个体均有可能看多部电影及作出多个评价。
按性别查看各部电影的平均得分
In [23]: mean_ratings_gender = data.pivot_table(values='rating', index='ti
...: tle', columns='gender', aggfunc='mean')
In [24]: mean_ratings_gender.head(5)
Out[24]:
gender F M
title
$1,000,000 Duck (1971) 3.375000 2.761905
'Night Mother (1986) 3.388889 3.352941
'Til There Was You (1997) 2.675676 2.733333
'burbs, The (1989) 2.793478 2.962085
...And Justice for All (1979) 3.828571 3.689024
查看男女对相同电影的评分差别,间接反应出男性女性在电影所展示价值观上的冲突
In [25]: mean_ratings_gender['diff'] = mean_ratings_gender.F - mean_rating
...: s_gender.M
In [26]: mean_ratings_gender.head(2)
Out[26]:
gender F M diff
title
$1,000,000 Duck (1971) 3.375000 2.761905 0.613095
'Night Mother (1986) 3.388889 3.352941 0.035948
男女评分意见相差最大的前十部电影
In [27]: mean_ratings_gender.sort_values(by='diff', ascending=True).head(1
...: 0)
Out[27]:
gender F M diff
title
Tigrero: A Film That Was Never Made (1994) 1.0 4.333333 -3.333333
Neon Bible, The (1995) 1.0 4.000000 -3.000000
Enfer, L' (1994) 1.0 3.750000 -2.750000
Stalingrad (1993) 1.0 3.593750 -2.593750
Killer: A Journal of Murder (1995) 1.0 3.428571 -2.428571
Dangerous Ground (1997) 1.0 3.333333 -2.333333
In God's Hands (1998) 1.0 3.333333 -2.333333
Rosie (1998) 1.0 3.333333 -2.333333
Flying Saucer, The (1950) 1.0 3.300000 -2.300000
Jamaica Inn (1939) 1.0 3.142857 -2.142857
按照电影名称进行分组
In [28]: ratings_by_movie_title = data.groupby('title').size()
In [29]: ratings_by_movie_title.head(2)
Out[29]:
title
$1,000,000 Duck (1971) 37
'Night Mother (1986) 70
dtype: int64
参评人数超过1000的前十部电影排名
In [30]: top_ratings = ratings_by_movie_title[ratings_by_movie_title > 100
...: 0]
In [31]: top_10_ratings = top_ratings.sort_values(ascending=False).head(10
...: )
In [32]: top_10_ratings
Out[32]:
title
American Beauty (1999) 3428
Star Wars: Episode IV - A New Hope (1977) 2991
Star Wars: Episode V - The Empire Strikes Back (1980) 2990
Star Wars: Episode VI - Return of the Jedi (1983) 2883
Jurassic Park (1993) 2672
Saving Private Ryan (1998) 2653
Terminator 2: Judgment Day (1991) 2649
Matrix, The (1999) 2590
Back to the Future (1985) 2583
Silence of the Lambs, The (1991) 2578
dtype: int64
前二十部高分电影
In [33]: mean_ratings = data.pivot_table(values='rating', index='title', a
...: ggfunc='mean')
In [34]: top_20_mean_ratings = mean_ratings.sort_values(by='rating',ascend
...: ing=False).head(20)
In [35]: top_20_mean_ratings
Out[35]:
rating
title
Ulysses (Ulisse) (1954) 5.000000
Lured (1947) 5.000000
Follow the Bitch (1998) 5.000000
Bittersweet Motel (2000) 5.000000
Song of Freedom (1936) 5.000000
One Little Indian (1973) 5.000000
Smashing Time (1967) 5.000000
Schlafes Bruder (Brother of Sleep) (1995) 5.000000
Gate of Heavenly Peace, The (1995) 5.000000
Baby, The (1973) 5.000000
I Am Cuba (Soy Cuba/Ya Kuba) (1964) 4.800000
Lamerica (1994) 4.750000
Apple, The (Sib) (1998) 4.666667
Sanjuro (1962) 4.608696
Seven Samurai (The Magnificent Seven) (Shichini... 4.560510
Shawshank Redemption, The (1994) 4.554558
Godfather, The (1972) 4.524966
Close Shave, A (1995) 4.520548
Usual Suspects, The (1995) 4.517106
Schindler's List (1993) 4.510417
前十部参评人数超过1000的活跃电影平均评分
In [36]: mean_ratings.loc[top_10_ratings.index]
Out[36]:
rating
title
American Beauty (1999) 4.317386
Star Wars: Episode IV - A New Hope (1977) 4.453694
Star Wars: Episode V - The Empire Strikes Back ... 4.292977
Star Wars: Episode VI - Return of the Jedi (1983) 4.022893
Jurassic Park (1993) 3.763847
Saving Private Ryan (1998) 4.337354
Terminator 2: Judgment Day (1991) 4.058513
Matrix, The (1999) 4.315830
Back to the Future (1985) 3.990321
Silence of the Lambs, The (1991) 4.351823
前二十部评分最高电影的活跃程度即参评人数(评分较高的也许是由于参评人数少而造成的假象高分)
In [37]: ratings_by_movie_title.loc[top_20_mean_ratings.index]
Out[37]:
title
Ulysses (Ulisse) (1954) 1
Lured (1947) 1
Follow the Bitch (1998) 1
Bittersweet Motel (2000) 1
Song of Freedom (1936) 1
One Little Indian (1973) 1
Smashing Time (1967) 2
Schlafes Bruder (Brother of Sleep) (1995) 1
Gate of Heavenly Peace, The (1995) 3
Baby, The (1973) 1
I Am Cuba (Soy Cuba/Ya Kuba) (1964) 5
Lamerica (1994) 8
Apple, The (Sib) (1998) 9
Sanjuro (1962) 69
Seven Samurai (The Magnificent Seven) (Shichinin no samurai) (1954) 628
Shawshank Redemption, The (1994) 2227
Godfather, The (1972) 2223
Close Shave, A (1995) 657
Usual Suspects, The (1995) 1783
Schindler's List (1993) 2304
dtype: int64
参评人数超过1000的前十部高评分电影
In [40]: top_10_movies = mean_ratings.loc[top_ratings.index].sort_values(b
...: y='rating',ascending=False).head(10)
In [41]: top_10_movies
Out[41]:
rating
title
Shawshank Redemption, The (1994) 4.554558
Godfather, The (1972) 4.524966
Usual Suspects, The (1995) 4.517106
Schindler's List (1993) 4.510417
Raiders of the Lost Ark (1981) 4.477725
Rear Window (1954) 4.476190
Star Wars: Episode IV - A New Hope (1977) 4.453694
Dr. Strangelove or: How I Learned to Stop Worry... 4.449890
Casablanca (1942) 4.412822
Sixth Sense, The (1999) 4.406263
评分超高热度与参评人数超过1000的前十电影综合信息
In [42]: df_top_10_movies = pd.DataFrame(top_10_movies)
In [43]: df_top_10_movies['hot'] = top_ratings[top_10_movies.index]
In [44]: df_top_10_movies
Out[44]:
rating hot
title
Shawshank Redemption, The (1994) 4.554558 2227
Godfather, The (1972) 4.524966 2223
Usual Suspects, The (1995) 4.517106 1783
Schindler's List (1993) 4.510417 2304
Raiders of the Lost Ark (1981) 4.477725 2514
Rear Window (1954) 4.476190 1050
Star Wars: Episode IV - A New Hope (1977) 4.453694 2991
Dr. Strangelove or: How I Learned to Stop Worry... 4.449890 1367
Casablanca (1942) 4.412822 1669
Sixth Sense, The (1999) 4.406263 2459
干货免费分享
关注公众号即可一键领取
省去找资料的麻烦
为您的学习保驾护航
您的点赞与转发是我们前进的最大动力!
扫二维码进交流学习群
最新同步更新资料请到该QQ群获取
领取专属 10元无门槛券
私享最新 技术干货