baserec.base.evaluation package

Submodules

baserec.base.evaluation.evaluator module

@author: Maurizio Ferrari Dacrema & Ceshine Lee

class baserec.base.evaluation.evaluator.Evaluator(URM_test_list, cutoff_list, min_ratings_per_user=1, exclude_seen=True, diversity_object=None, ignore_items=None, ignore_users=None, verbose=True)

Bases: object

Abstract Evaluator

EVALUATOR_NAME = 'Evaluator_Base_Class'
evaluateRecommender(recommender_object)
Parameters
  • recommender_object – the trained recommender object, a BaseRecommender subclass

  • URM_test_list – list of URMs to test the recommender against, or a single URM object

  • cutoff_list – list of cutoffs to be use to report the scores, or a single cutoff

get_user_relevant_items(user_id)
get_user_test_ratings(user_id)
class baserec.base.evaluation.evaluator.EvaluatorHoldout(URM_test_list, cutoff_list, min_ratings_per_user=1, exclude_seen=True, diversity_object=None, ignore_items=None, ignore_users=None, verbose=True)

Bases: baserec.base.evaluation.evaluator.Evaluator

EVALUATOR_NAME = 'EvaluatorHoldout'
class baserec.base.evaluation.evaluator.EvaluatorMetrics(value)

Bases: enum.Enum

An enumeration.

ARHR = 'ARHR'
AVERAGE_POPULARITY = 'AVERAGE_POPULARITY'
COVERAGE_ITEM = 'COVERAGE_ITEM'
COVERAGE_USER = 'COVERAGE_USER'
DIVERSITY_GINI = 'DIVERSITY_GINI'
DIVERSITY_HERFINDAHL = 'DIVERSITY_HERFINDAHL'
DIVERSITY_MEAN_INTER_LIST = 'DIVERSITY_MEAN_INTER_LIST'
DIVERSITY_SIMILARITY = 'DIVERSITY_SIMILARITY'
F1 = 'F1'
HIT_RATE = 'HIT_RATE'
MAP = 'MAP'
MRR = 'MRR'
NDCG = 'NDCG'
NOVELTY = 'NOVELTY'
PRECISION = 'PRECISION'
PRECISION_RECALL_MIN_DEN = 'PRECISION_RECALL_MIN_DEN'
RECALL = 'RECALL'
ROC_AUC = 'ROC_AUC'
SHANNON_ENTROPY = 'SHANNON_ENTROPY'
class baserec.base.evaluation.evaluator.EvaluatorNegativeItemSample(URM_test_list, URM_test_negative, cutoff_list, min_ratings_per_user=1, exclude_seen=True, diversity_object=None, ignore_items=None, ignore_users=None)

Bases: baserec.base.evaluation.evaluator.Evaluator

Evaluator with Negative Item Sampling

The EvaluatorNegativeItemSample computes the recommendations by sorting the test items as well as the test_negative items.

It ensures that each item appears only once even if it is listed in both matrices

Parameters
  • URM_test_list – Positive samples

  • URM_test_negative – Items to rank together with the test items

  • cutoff_list – List of cutoffs to use

  • min_ratings_per_user (int, optional) – [TODO: description], by default 1

  • exclude_seen (bool, optional) – Don’t evaluate on seen entries, by default True

  • diversity_object ([TODO: type], optional) – [TODO: description], by default None

  • ignore_items (Sequence, optional) – [TODO:description], by default None

  • ignore_users (Sequence, optional) – [TODO:description], by default None

EVALUATOR_NAME = 'EvaluatorNegativeItemSample'
baserec.base.evaluation.evaluator.get_result_string(results_run, n_decimals=7)

baserec.base.evaluation.metrics module

@author: Maurizio Ferrari Dacrema & Ceshine Lee

class baserec.base.evaluation.metrics.AveragePopularity(URM_train)

Bases: baserec.base.evaluation.metrics._Metrics_Object

Average popularity the recommended items have in the train data. The popularity is normalized by setting as 1 the item with the highest popularity in the train data

add_recommendations(recommended_items_ids)
get_metric_value()
merge_with_other(other_metric_object)
class baserec.base.evaluation.metrics.Coverage_Item(n_items, ignore_items)

Bases: baserec.base.evaluation.metrics._Global_Item_Distribution_Counter

Item coverage represents the percentage of the overall items which were recommended https://gab41.lab41.org/recommender-systems-its-not-all-about-the-accuracy-562c7dceeaff

get_metric_value()
class baserec.base.evaluation.metrics.Coverage_Test_Correct(n_items, ignore_items)

Bases: baserec.base.evaluation.metrics._Global_Item_Distribution_Counter

Item coverage represents the percentage of the overall test items which were correctly recommended https://gab41.lab41.org/recommender-systems-its-not-all-about-the-accuracy-562c7dceeaff

add_recommendations(recommended_items_ids, is_relevant)
get_metric_value()
class baserec.base.evaluation.metrics.Coverage_User(n_users, ignore_users)

Bases: baserec.base.evaluation.metrics._Metrics_Object

User coverage represents the percentage of the overall users for which we can make recommendations. If there is at least one recommendation the user is considered as covered https://gab41.lab41.org/recommender-systems-its-not-all-about-the-accuracy-562c7dceeaff

add_recommendations(recommended_items_ids, user_id)
get_metric_value()
merge_with_other(other_metric_object)
class baserec.base.evaluation.metrics.Coverage_User_Correct(n_users, ignore_users)

Bases: baserec.base.evaluation.metrics._Metrics_Object

User coverage represents the percentage of the overall users for which we can make at least one correct recommendations. If there is at least one correct recommendation the user is considered as covered https://gab41.lab41.org/recommender-systems-its-not-all-about-the-accuracy-562c7dceeaff

add_recommendations(is_relevant, user_id)
get_metric_value()
merge_with_other(other_metric_object)
class baserec.base.evaluation.metrics.DiversityHerfindahl(n_items, ignore_items)

Bases: baserec.base.evaluation.metrics._Global_Item_Distribution_Counter

The Herfindahl index is also known as Concentration index, it is used in economy to determine whether the market quotas are such that an excessive concentration exists. It is here used as a diversity index, if high means high diversity.

It is known to have a small value range in recommender systems, between 0.9 and 1.0

The Herfindahl index is a function of the square of the probability an item has been recommended to any user, hence The Herfindahl index is equivalent to MeanInterList diversity as they measure the same quantity.

# http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.459.8174&rep=rep1&type=pdf

get_metric_value()
class baserec.base.evaluation.metrics.DiversityMeanInterList(n_items, cutoff)

Bases: baserec.base.evaluation.metrics._Metrics_Object

MeanInterList diversity measures the uniqueness of different users’ recommendation lists.

It can be used to measure how “diversified” are the recommendations different users receive.

While the original proposal called this metric “Personalization”, we do not use this name since the highest MeanInterList diversity is exhibited by a non personalized Random recommender.

It can be demonstrated that this metric does not require to compute the common items all possible couples of users have in common but rather it is only sensitive to the total amount of time each item has been recommended. Please refer to my PhD. Thesis Appendix B for references “An assessment of reproducibility and methodological issues in neural recommender systems research”

MeanInterList diversity is a function of the square of the probability an item has been recommended to any user, hence MeanInterList diversity is equivalent to the Herfindahl index as they measure the same quantity.

A TopPopular recommender that does not remove seen items will have 0.0 MeanInterList diversity.

pag. 3, http://www.pnas.org/content/pnas/107/10/4511.full.pdf

@article{zhou2010solving,

title={Solving the apparent diversity-accuracy dilemma of recommender systems}, author={Zhou, Tao and Kuscsik, Zolt{‘a}n and Liu, Jian-Guo and Medo, Mat{‘u}{

{s}} and Wakeling, Joseph Rushton and Zhang, Yi-Cheng},

journal={Proceedings of the National Academy of Sciences}, volume={107}, number={10}, pages={4511–4515}, year={2010}, publisher={National Acad Sciences}

}

# The formula is diversity_cumulative += 1 - common_recommendations(user1, user2)/cutoff # for each couple of users, except the diagonal. It is VERY computationally expensive # We can move the 1 and cutoff outside of the summation. Remember to exclude the diagonal # co_counts = URM_predicted.dot(URM_predicted.T) # co_counts[np.arange(0, n_user, dtype=np.int):np.arange(0, n_user, dtype=np.int)] = 0 # diversity = (n_user**2 - n_user) - co_counts.sum()/self.cutoff

# If we represent the summation of co_counts separating it for each item, we will have: # co_counts.sum() = co_counts_item1.sum() + co_counts_item2.sum() … # If we know how many times an item has been recommended, co_counts_item1.sum() can be computed as how many couples of # users have item1 in common. If item1 has been recommended n times, the number of couples is n*(n-1) # Therefore we can compute co_counts.sum() value as: # np.sum(np.multiply(item-occurrence, item-occurrence-1))

# The naive implementation URM_predicted.dot(URM_predicted.T) might require an hour of computation # The last implementation has a negligible computational time even for very big datasets

add_recommendations(recommended_items_ids)
get_metric_value()
get_theoretical_max()
merge_with_other(other_metric_object)
class baserec.base.evaluation.metrics.Diversity_similarity(item_diversity_matrix)

Bases: baserec.base.evaluation.metrics._Metrics_Object

Intra list diversity computes the diversity of items appearing in the recommendations received by each single user, by using an item_diversity_matrix.

It can be used, for example, to compute the diversity in terms of features for a collaborative recommender.

A content-based recommender will have low IntraList diversity if that is computed on the same features the recommender uses. A TopPopular recommender may exhibit high IntraList diversity.

add_recommendations(recommended_items_ids)
get_metric_value()
merge_with_other(other_metric_object)
class baserec.base.evaluation.metrics.GiniDiversity(n_items, ignore_items)

Bases: baserec.base.evaluation.metrics._Global_Item_Distribution_Counter

Gini diversity index, computed from the Gini Index but with inverted range, such that high values mean higher diversity This implementation ignores zero-occurrence items

# From https://github.com/oliviaguest/gini # based on bottom eq: http://www.statsdirect.com/help/content/image/stat0206_wmf.gif # from: http://www.statsdirect.com/help/default.htm#nonparametric_methods/gini.htm # # http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.459.8174&rep=rep1&type=pdf

get_metric_value()
class baserec.base.evaluation.metrics.MAP

Bases: baserec.base.evaluation.metrics._Metrics_Object

Mean Average Precision, defined as the mean of the AveragePrecision over all users

add_recommendations(is_relevant, pos_items)
get_metric_value()
merge_with_other(other_metric_object)
class baserec.base.evaluation.metrics.MRR

Bases: baserec.base.evaluation.metrics._Metrics_Object

Mean Reciprocal Rank, defined as the mean of the Reciprocal Rank over all users

add_recommendations(is_relevant)
get_metric_value()
merge_with_other(other_metric_object)
class baserec.base.evaluation.metrics.Novelty(URM_train)

Bases: baserec.base.evaluation.metrics._Metrics_Object

Novelty measures how “novel” a recommendation is in terms of how popular the item was in the train set.

Due to this definition, the novelty of a cold item (i.e. with no interactions in the train set) is not defined, in this implementation cold items are ignored and their contribution to the novelty is 0.

A recommender with high novelty will be able to recommend also long queue (i.e. unpopular) items.

Mean self-information (Zhou 2010)

add_recommendations(recommended_items_ids)
get_metric_value()
merge_with_other(other_metric_object)
class baserec.base.evaluation.metrics.RMSE(URM_all)

Bases: baserec.base.evaluation.metrics._Metrics_Object

Root Mean Squared Error

add_recommendations(all_items_predicted_ratings, relevant_items, relevant_items_rating)
get_metric_value()
merge_with_other(other_metric_object)
class baserec.base.evaluation.metrics.ShannonEntropy(n_items, ignore_items)

Bases: baserec.base.evaluation.metrics._Global_Item_Distribution_Counter

Shannon Entropy is a well known metric to measure the amount of information of a certain string of data. Here is applied to the global number of times an item has been recommended.

It has a lower bound and can reach values over 12.0 for random recommenders. A high entropy means that the distribution is random uniform across all users.

Note that while a random uniform distribution (hence all items with SIMILAR number of occurrences) will be highly diverse and have high entropy, a perfectly uniform distribution (hence all items with EXACTLY IDENTICAL number of occurrences) will have 0.0 entropy while being the most diverse possible.

get_metric_value()
class baserec.base.evaluation.metrics.TestAUC(methodName='runTest')

Bases: unittest.case.TestCase

runTest()
class baserec.base.evaluation.metrics.TestNDCG(methodName='runTest')

Bases: unittest.case.TestCase

runTest()
class baserec.base.evaluation.metrics.TestPrecision(methodName='runTest')

Bases: unittest.case.TestCase

runTest()
class baserec.base.evaluation.metrics.TestRR(methodName='runTest')

Bases: unittest.case.TestCase

runTest()
class baserec.base.evaluation.metrics.TestRecall(methodName='runTest')

Bases: unittest.case.TestCase

runTest()
baserec.base.evaluation.metrics.arhr(is_relevant)
baserec.base.evaluation.metrics.average_precision(is_relevant, pos_items)
baserec.base.evaluation.metrics.dcg(scores)
baserec.base.evaluation.metrics.ndcg(ranked_list, pos_items, relevance=None, at=None)
baserec.base.evaluation.metrics.pp_metrics(metric_names, metric_values, metric_at)

Pretty-prints metric values :param metrics_arr: :return:

baserec.base.evaluation.metrics.precision(is_relevant)
baserec.base.evaluation.metrics.precision_recall_min_denominator(is_relevant, n_test_items)
baserec.base.evaluation.metrics.recall(is_relevant, pos_items)
baserec.base.evaluation.metrics.roc_auc(is_relevant)
baserec.base.evaluation.metrics.rr(is_relevant)

baserec.base.evaluation.metrics_test module

@author: Maurizio Ferrari Dacrema & Ceshine Lee

class baserec.base.evaluation.metrics_test.MyTestCase(methodName='runTest')

Bases: unittest.case.TestCase

test_AUC()
test_Diversity_list()
test_Diversity_list_all_equals()
test_Gini_Index()
test_NDCG()
test_Precision()
test_RR()
test_Recall()
test_Shannon_Entropy()

Module contents