baserec.base.evaluation package¶
Submodules¶
baserec.base.evaluation.evaluator module¶
@author: Maurizio Ferrari Dacrema & Ceshine Lee
-
class
baserec.base.evaluation.evaluator.Evaluator(URM_test_list, cutoff_list, min_ratings_per_user=1, exclude_seen=True, diversity_object=None, ignore_items=None, ignore_users=None, verbose=True)¶ Bases:
objectAbstract Evaluator
-
EVALUATOR_NAME= 'Evaluator_Base_Class'¶
-
evaluateRecommender(recommender_object)¶ - Parameters
recommender_object – the trained recommender object, a BaseRecommender subclass
URM_test_list – list of URMs to test the recommender against, or a single URM object
cutoff_list – list of cutoffs to be use to report the scores, or a single cutoff
-
get_user_relevant_items(user_id)¶
-
get_user_test_ratings(user_id)¶
-
-
class
baserec.base.evaluation.evaluator.EvaluatorHoldout(URM_test_list, cutoff_list, min_ratings_per_user=1, exclude_seen=True, diversity_object=None, ignore_items=None, ignore_users=None, verbose=True)¶ Bases:
baserec.base.evaluation.evaluator.Evaluator-
EVALUATOR_NAME= 'EvaluatorHoldout'¶
-
-
class
baserec.base.evaluation.evaluator.EvaluatorMetrics(value)¶ Bases:
enum.EnumAn enumeration.
-
ARHR= 'ARHR'¶
-
AVERAGE_POPULARITY= 'AVERAGE_POPULARITY'¶
-
COVERAGE_ITEM= 'COVERAGE_ITEM'¶
-
COVERAGE_USER= 'COVERAGE_USER'¶
-
DIVERSITY_GINI= 'DIVERSITY_GINI'¶
-
DIVERSITY_HERFINDAHL= 'DIVERSITY_HERFINDAHL'¶
-
DIVERSITY_MEAN_INTER_LIST= 'DIVERSITY_MEAN_INTER_LIST'¶
-
DIVERSITY_SIMILARITY= 'DIVERSITY_SIMILARITY'¶
-
F1= 'F1'¶
-
HIT_RATE= 'HIT_RATE'¶
-
MAP= 'MAP'¶
-
MRR= 'MRR'¶
-
NDCG= 'NDCG'¶
-
NOVELTY= 'NOVELTY'¶
-
PRECISION= 'PRECISION'¶
-
PRECISION_RECALL_MIN_DEN= 'PRECISION_RECALL_MIN_DEN'¶
-
RECALL= 'RECALL'¶
-
ROC_AUC= 'ROC_AUC'¶
-
SHANNON_ENTROPY= 'SHANNON_ENTROPY'¶
-
-
class
baserec.base.evaluation.evaluator.EvaluatorNegativeItemSample(URM_test_list, URM_test_negative, cutoff_list, min_ratings_per_user=1, exclude_seen=True, diversity_object=None, ignore_items=None, ignore_users=None)¶ Bases:
baserec.base.evaluation.evaluator.EvaluatorEvaluator with Negative Item Sampling
The EvaluatorNegativeItemSample computes the recommendations by sorting the test items as well as the test_negative items.
It ensures that each item appears only once even if it is listed in both matrices
- Parameters
URM_test_list – Positive samples
URM_test_negative – Items to rank together with the test items
cutoff_list – List of cutoffs to use
min_ratings_per_user (int, optional) – [TODO: description], by default 1
exclude_seen (bool, optional) – Don’t evaluate on seen entries, by default True
diversity_object ([TODO: type], optional) – [TODO: description], by default None
ignore_items (Sequence, optional) – [TODO:description], by default None
ignore_users (Sequence, optional) – [TODO:description], by default None
-
EVALUATOR_NAME= 'EvaluatorNegativeItemSample'¶
-
baserec.base.evaluation.evaluator.get_result_string(results_run, n_decimals=7)¶
baserec.base.evaluation.metrics module¶
@author: Maurizio Ferrari Dacrema & Ceshine Lee
-
class
baserec.base.evaluation.metrics.AveragePopularity(URM_train)¶ Bases:
baserec.base.evaluation.metrics._Metrics_ObjectAverage popularity the recommended items have in the train data. The popularity is normalized by setting as 1 the item with the highest popularity in the train data
-
add_recommendations(recommended_items_ids)¶
-
get_metric_value()¶
-
merge_with_other(other_metric_object)¶
-
-
class
baserec.base.evaluation.metrics.Coverage_Item(n_items, ignore_items)¶ Bases:
baserec.base.evaluation.metrics._Global_Item_Distribution_CounterItem coverage represents the percentage of the overall items which were recommended https://gab41.lab41.org/recommender-systems-its-not-all-about-the-accuracy-562c7dceeaff
-
get_metric_value()¶
-
-
class
baserec.base.evaluation.metrics.Coverage_Test_Correct(n_items, ignore_items)¶ Bases:
baserec.base.evaluation.metrics._Global_Item_Distribution_CounterItem coverage represents the percentage of the overall test items which were correctly recommended https://gab41.lab41.org/recommender-systems-its-not-all-about-the-accuracy-562c7dceeaff
-
add_recommendations(recommended_items_ids, is_relevant)¶
-
get_metric_value()¶
-
-
class
baserec.base.evaluation.metrics.Coverage_User(n_users, ignore_users)¶ Bases:
baserec.base.evaluation.metrics._Metrics_ObjectUser coverage represents the percentage of the overall users for which we can make recommendations. If there is at least one recommendation the user is considered as covered https://gab41.lab41.org/recommender-systems-its-not-all-about-the-accuracy-562c7dceeaff
-
add_recommendations(recommended_items_ids, user_id)¶
-
get_metric_value()¶
-
merge_with_other(other_metric_object)¶
-
-
class
baserec.base.evaluation.metrics.Coverage_User_Correct(n_users, ignore_users)¶ Bases:
baserec.base.evaluation.metrics._Metrics_ObjectUser coverage represents the percentage of the overall users for which we can make at least one correct recommendations. If there is at least one correct recommendation the user is considered as covered https://gab41.lab41.org/recommender-systems-its-not-all-about-the-accuracy-562c7dceeaff
-
add_recommendations(is_relevant, user_id)¶
-
get_metric_value()¶
-
merge_with_other(other_metric_object)¶
-
-
class
baserec.base.evaluation.metrics.DiversityHerfindahl(n_items, ignore_items)¶ Bases:
baserec.base.evaluation.metrics._Global_Item_Distribution_CounterThe Herfindahl index is also known as Concentration index, it is used in economy to determine whether the market quotas are such that an excessive concentration exists. It is here used as a diversity index, if high means high diversity.
It is known to have a small value range in recommender systems, between 0.9 and 1.0
The Herfindahl index is a function of the square of the probability an item has been recommended to any user, hence The Herfindahl index is equivalent to MeanInterList diversity as they measure the same quantity.
# http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.459.8174&rep=rep1&type=pdf
-
get_metric_value()¶
-
-
class
baserec.base.evaluation.metrics.DiversityMeanInterList(n_items, cutoff)¶ Bases:
baserec.base.evaluation.metrics._Metrics_ObjectMeanInterList diversity measures the uniqueness of different users’ recommendation lists.
It can be used to measure how “diversified” are the recommendations different users receive.
While the original proposal called this metric “Personalization”, we do not use this name since the highest MeanInterList diversity is exhibited by a non personalized Random recommender.
It can be demonstrated that this metric does not require to compute the common items all possible couples of users have in common but rather it is only sensitive to the total amount of time each item has been recommended. Please refer to my PhD. Thesis Appendix B for references “An assessment of reproducibility and methodological issues in neural recommender systems research”
MeanInterList diversity is a function of the square of the probability an item has been recommended to any user, hence MeanInterList diversity is equivalent to the Herfindahl index as they measure the same quantity.
A TopPopular recommender that does not remove seen items will have 0.0 MeanInterList diversity.
pag. 3, http://www.pnas.org/content/pnas/107/10/4511.full.pdf
- @article{zhou2010solving,
title={Solving the apparent diversity-accuracy dilemma of recommender systems}, author={Zhou, Tao and Kuscsik, Zolt{‘a}n and Liu, Jian-Guo and Medo, Mat{‘u}{
- {s}} and Wakeling, Joseph Rushton and Zhang, Yi-Cheng},
journal={Proceedings of the National Academy of Sciences}, volume={107}, number={10}, pages={4511–4515}, year={2010}, publisher={National Acad Sciences}
}
# The formula is diversity_cumulative += 1 - common_recommendations(user1, user2)/cutoff # for each couple of users, except the diagonal. It is VERY computationally expensive # We can move the 1 and cutoff outside of the summation. Remember to exclude the diagonal # co_counts = URM_predicted.dot(URM_predicted.T) # co_counts[np.arange(0, n_user, dtype=np.int):np.arange(0, n_user, dtype=np.int)] = 0 # diversity = (n_user**2 - n_user) - co_counts.sum()/self.cutoff
# If we represent the summation of co_counts separating it for each item, we will have: # co_counts.sum() = co_counts_item1.sum() + co_counts_item2.sum() … # If we know how many times an item has been recommended, co_counts_item1.sum() can be computed as how many couples of # users have item1 in common. If item1 has been recommended n times, the number of couples is n*(n-1) # Therefore we can compute co_counts.sum() value as: # np.sum(np.multiply(item-occurrence, item-occurrence-1))
# The naive implementation URM_predicted.dot(URM_predicted.T) might require an hour of computation # The last implementation has a negligible computational time even for very big datasets
-
add_recommendations(recommended_items_ids)¶
-
get_metric_value()¶
-
get_theoretical_max()¶
-
merge_with_other(other_metric_object)¶
-
class
baserec.base.evaluation.metrics.Diversity_similarity(item_diversity_matrix)¶ Bases:
baserec.base.evaluation.metrics._Metrics_ObjectIntra list diversity computes the diversity of items appearing in the recommendations received by each single user, by using an item_diversity_matrix.
It can be used, for example, to compute the diversity in terms of features for a collaborative recommender.
A content-based recommender will have low IntraList diversity if that is computed on the same features the recommender uses. A TopPopular recommender may exhibit high IntraList diversity.
-
add_recommendations(recommended_items_ids)¶
-
get_metric_value()¶
-
merge_with_other(other_metric_object)¶
-
-
class
baserec.base.evaluation.metrics.GiniDiversity(n_items, ignore_items)¶ Bases:
baserec.base.evaluation.metrics._Global_Item_Distribution_CounterGini diversity index, computed from the Gini Index but with inverted range, such that high values mean higher diversity This implementation ignores zero-occurrence items
# From https://github.com/oliviaguest/gini # based on bottom eq: http://www.statsdirect.com/help/content/image/stat0206_wmf.gif # from: http://www.statsdirect.com/help/default.htm#nonparametric_methods/gini.htm # # http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.459.8174&rep=rep1&type=pdf
-
get_metric_value()¶
-
-
class
baserec.base.evaluation.metrics.MAP¶ Bases:
baserec.base.evaluation.metrics._Metrics_ObjectMean Average Precision, defined as the mean of the AveragePrecision over all users
-
add_recommendations(is_relevant, pos_items)¶
-
get_metric_value()¶
-
merge_with_other(other_metric_object)¶
-
-
class
baserec.base.evaluation.metrics.MRR¶ Bases:
baserec.base.evaluation.metrics._Metrics_ObjectMean Reciprocal Rank, defined as the mean of the Reciprocal Rank over all users
-
add_recommendations(is_relevant)¶
-
get_metric_value()¶
-
merge_with_other(other_metric_object)¶
-
-
class
baserec.base.evaluation.metrics.Novelty(URM_train)¶ Bases:
baserec.base.evaluation.metrics._Metrics_ObjectNovelty measures how “novel” a recommendation is in terms of how popular the item was in the train set.
Due to this definition, the novelty of a cold item (i.e. with no interactions in the train set) is not defined, in this implementation cold items are ignored and their contribution to the novelty is 0.
A recommender with high novelty will be able to recommend also long queue (i.e. unpopular) items.
Mean self-information (Zhou 2010)
-
add_recommendations(recommended_items_ids)¶
-
get_metric_value()¶
-
merge_with_other(other_metric_object)¶
-
-
class
baserec.base.evaluation.metrics.RMSE(URM_all)¶ Bases:
baserec.base.evaluation.metrics._Metrics_ObjectRoot Mean Squared Error
-
add_recommendations(all_items_predicted_ratings, relevant_items, relevant_items_rating)¶
-
get_metric_value()¶
-
merge_with_other(other_metric_object)¶
-
-
class
baserec.base.evaluation.metrics.ShannonEntropy(n_items, ignore_items)¶ Bases:
baserec.base.evaluation.metrics._Global_Item_Distribution_CounterShannon Entropy is a well known metric to measure the amount of information of a certain string of data. Here is applied to the global number of times an item has been recommended.
It has a lower bound and can reach values over 12.0 for random recommenders. A high entropy means that the distribution is random uniform across all users.
Note that while a random uniform distribution (hence all items with SIMILAR number of occurrences) will be highly diverse and have high entropy, a perfectly uniform distribution (hence all items with EXACTLY IDENTICAL number of occurrences) will have 0.0 entropy while being the most diverse possible.
-
get_metric_value()¶
-
-
class
baserec.base.evaluation.metrics.TestAUC(methodName='runTest')¶ Bases:
unittest.case.TestCase-
runTest()¶
-
-
class
baserec.base.evaluation.metrics.TestNDCG(methodName='runTest')¶ Bases:
unittest.case.TestCase-
runTest()¶
-
-
class
baserec.base.evaluation.metrics.TestPrecision(methodName='runTest')¶ Bases:
unittest.case.TestCase-
runTest()¶
-
-
class
baserec.base.evaluation.metrics.TestRR(methodName='runTest')¶ Bases:
unittest.case.TestCase-
runTest()¶
-
-
class
baserec.base.evaluation.metrics.TestRecall(methodName='runTest')¶ Bases:
unittest.case.TestCase-
runTest()¶
-
-
baserec.base.evaluation.metrics.arhr(is_relevant)¶
-
baserec.base.evaluation.metrics.average_precision(is_relevant, pos_items)¶
-
baserec.base.evaluation.metrics.dcg(scores)¶
-
baserec.base.evaluation.metrics.ndcg(ranked_list, pos_items, relevance=None, at=None)¶
-
baserec.base.evaluation.metrics.pp_metrics(metric_names, metric_values, metric_at)¶ Pretty-prints metric values :param metrics_arr: :return:
-
baserec.base.evaluation.metrics.precision(is_relevant)¶
-
baserec.base.evaluation.metrics.precision_recall_min_denominator(is_relevant, n_test_items)¶
-
baserec.base.evaluation.metrics.recall(is_relevant, pos_items)¶
-
baserec.base.evaluation.metrics.roc_auc(is_relevant)¶
-
baserec.base.evaluation.metrics.rr(is_relevant)¶
baserec.base.evaluation.metrics_test module¶
@author: Maurizio Ferrari Dacrema & Ceshine Lee