欧几里得距离得分是一个很好的指标, 但它有一些缺点. 因此, Pearson相关分数经常用于推荐引擎. 让我们看看如何计算它.

怎么做...?

  • 创建文件并导入需要的包:
import json
import numpy as np
  • 我们将定义一个函数来计算数据库中两个用户之间的Pearson相关分数. 我们的第一步是确认这些用户存在于数据库中:
# Returns the Pearson correlation score between user1 and user2


def pearson_score(dataset, user1, user2):
    if user1 not in dataset:
        raise TypeError('User ' + user1 + ' not present in the dataset')

    if user2 not in dataset:
        raise TypeError('User ' + user2 + ' not present in the dataset')

    # 获取 user1 和 user2 都评价过的电影
    rated_by_both = {}

    for item in dataset[user1]:
        if item in dataset[user2]:
            rated_by_both[item] = 1

    num_ratings = len(rated_by_both)

    # 如果没找到相关的电影数据, 返回0
    if num_ratings == 0:
        return 0

    # 计算user1 和 user2 的评分总和
    user1_sum = np.sum([dataset[user1][item] for item in rated_by_both])
    user2_sum = np.sum([dataset[user2][item] for item in rated_by_both])

    # 计算平方根的总和
    user1_squared_sum = np.sum(
        [np.square(dataset[user1][item]) for item in rated_by_both])
    user2_squared_sum = np.sum(
        [np.square(dataset[user2][item]) for item in rated_by_both])

    # 计算乘积的总和
    product_sum = np.sum(
        [
            dataset[user1][item] * dataset[user2][item]
            for item in rated_by_both
        ]
    )

    # 最后, 计算 pearson相关系数
    Sxy = product_sum - (user1_sum * user2_sum / num_ratings)
    Sxx = user1_squared_sum - np.square(user1_sum) / num_ratings
    Syy = user2_squared_sum - np.square(user2_sum) / num_ratings

    # 保证分母不为 0
    if Sxx * Syy == 0:
        return 0

    return Sxy / np.sqrt(Sxx * Syy)
  • 我们来定义主要功能, 并计算两个用户之间的Pearson相关分数
if __name__ == '__main__':
    data_file = 'movie_ratings.json'

    with open(data_file, 'r') as f:
        data = json.loads(f.read())

    user1 = 'John Carson'
    user2 = 'Michelle Peterson'

    print ("Pearson score:")
    print (pearson_score(data, user1, user2))

results matching ""

    No results matching ""