Best Practices on Recommendation Systems

Overview

Recommenders

Documentation Status

What's New (February 4, 2021)

We have a new relase Recommenders 2021.2!

It comes with lots of bug fixes, optimizations and 3 new algorithms, GeoIMC, Standard VAE and Multinomial VAE. We also added tools to facilitate the use of Microsoft News dataset (MIND). In addition, we publised our KDD2020 tutorial where we built a recommender of COVID papers using Microsoft Academic Graph.

We also changed the default branch from master to main. Now when you download the repo, you will get main branch.

See past announcements in NEWS.md.

Introduction

This repository contains examples and best practices for building recommendation systems, provided as Jupyter notebooks. The examples detail our learnings on five key tasks:

  • Prepare Data: Preparing and loading data for each recommender algorithm
  • Model: Building models using various classical and deep learning recommender algorithms such as Alternating Least Squares (ALS) or eXtreme Deep Factorization Machines (xDeepFM).
  • Evaluate: Evaluating algorithms with offline metrics
  • Model Select and Optimize: Tuning and optimizing hyperparameters for recommender models
  • Operationalize: Operationalizing models in a production environment on Azure

Several utilities are provided in reco_utils to support common tasks such as loading datasets in the format expected by different algorithms, evaluating model outputs, and splitting training/test data. Implementations of several state-of-the-art algorithms are included for self-study and customization in your own applications. See the reco_utils documentation.

For a more detailed overview of the repository, please see the documents on the wiki page.

Getting Started

Please see the setup guide for more details on setting up your machine locally, on a data science virtual machine (DSVM) or on Azure Databricks.

To setup on your local machine:

  1. Install Anaconda with Python >= 3.6. Miniconda is a quick way to get started.

  2. Clone the repository

git clone https://github.com/Microsoft/Recommenders
  1. Run the generate conda file script to create a conda environment: (This is for a basic python environment, see SETUP.md for PySpark and GPU environment setup)
cd Recommenders
python tools/generate_conda_file.py
conda env create -f reco_base.yaml  
  1. Activate the conda environment and register it with Jupyter:
conda activate reco_base
python -m ipykernel install --user --name reco_base --display-name "Python (reco)"
  1. Start the Jupyter notebook server
jupyter notebook
  1. Run the SAR Python CPU MovieLens notebook under the 00_quick_start folder. Make sure to change the kernel to "Python (reco)".

NOTE - The Alternating Least Squares (ALS) notebooks require a PySpark environment to run. Please follow the steps in the setup guide to run these notebooks in a PySpark environment. For the deep learning algorithms, it is recommended to use a GPU machine.

Algorithms

The table below lists the recommender algorithms currently available in the repository. Notebooks are linked under the Environment column when different implementations are available.

Algorithm Environment Type Description
Alternating Least Squares (ALS) PySpark Collaborative Filtering Matrix factorization algorithm for explicit or implicit feedback in large datasets, optimized by Spark MLLib for scalability and distributed computing capability
Attentive Asynchronous Singular Value Decomposition (A2SVD)* Python CPU / Python GPU Collaborative Filtering Sequential-based algorithm that aims to capture both long and short-term user preferences using attention mechanism
Cornac/Bayesian Personalized Ranking (BPR) Python CPU Collaborative Filtering Matrix factorization algorithm for predicting item ranking with implicit feedback
Convolutional Sequence Embedding Recommendation (Caser) Python CPU / Python GPU Collaborative Filtering Algorithm based on convolutions that aim to capture both user’s general preferences and sequential patterns
Deep Knowledge-Aware Network (DKN)* Python CPU / Python GPU Content-Based Filtering Deep learning algorithm incorporating a knowledge graph and article embeddings to provide powerful news or article recommendations
Extreme Deep Factorization Machine (xDeepFM)* Python CPU / Python GPU Hybrid Deep learning based algorithm for implicit and explicit feedback with user/item features
FastAI Embedding Dot Bias (FAST) Python CPU / Python GPU Collaborative Filtering General purpose algorithm with embeddings and biases for users and items
LightFM/Hybrid Matrix Factorization Python CPU Hybrid Hybrid matrix factorization algorithm for both implicit and explicit feedbacks
LightGBM/Gradient Boosting Tree* Python CPU / PySpark Content-Based Filtering Gradient Boosting Tree algorithm for fast training and low memory usage in content-based problems
LightGCN Python CPU / Python GPU Collaborative Filtering Deep learning algorithm which simplifies the design of GCN for predicting implicit feedback
GeoIMC Python CPU Hybrid Matrix completion algorithm that has into account user and item features using Riemannian conjugate gradients optimization and following a geometric approach.
GRU4Rec Python CPU / Python GPU Collaborative Filtering Sequential-based algorithm that aims to capture both long and short-term user preferences using recurrent neural networks
Multinomial VAE Python CPU / Python GPU Collaborative Filtering Generative Model for predicting user/item interactions
Neural Recommendation with Long- and Short-term User Representations (LSTUR)* Python CPU / Python GPU Content-Based Filtering Neural recommendation algorithm with long- and short-term user interest modeling
Neural Recommendation with Attentive Multi-View Learning (NAML)* Python CPU / Python GPU Content-Based Filtering Neural recommendation algorithm with attentive multi-view learning
Neural Collaborative Filtering (NCF) Python CPU / Python GPU Collaborative Filtering Deep learning algorithm with enhanced performance for implicit feedback
Neural Recommendation with Personalized Attention (NPA)* Python CPU / Python GPU Content-Based Filtering Neural recommendation algorithm with personalized attention network
Neural Recommendation with Multi-Head Self-Attention (NRMS)* Python CPU / Python GPU Content-Based Filtering Neural recommendation algorithm with multi-head self-attention
Next Item Recommendation (NextItNet) Python CPU / Python GPU Collaborative Filtering Algorithm based on dilated convolutions and residual network that aims to capture sequential patterns
Restricted Boltzmann Machines (RBM) Python CPU / Python GPU Collaborative Filtering Neural network based algorithm for learning the underlying probability distribution for explicit or implicit feedback
Riemannian Low-rank Matrix Completion (RLRMC)* Python CPU Collaborative Filtering Matrix factorization algorithm using Riemannian conjugate gradients optimization with small memory consumption.
Simple Algorithm for Recommendation (SAR)* Python CPU Collaborative Filtering Similarity-based algorithm for implicit feedback dataset
Short-term and Long-term preference Integrated Recommender (SLi-Rec)* Python CPU / Python GPU Collaborative Filtering Sequential-based algorithm that aims to capture both long and short-term user preferences using attention mechanism, a time-aware controller and a content-aware controller
Standard VAE Python CPU / Python GPU Collaborative Filtering Generative Model for predicting user/item interactions
Surprise/Singular Value Decomposition (SVD) Python CPU Collaborative Filtering Matrix factorization algorithm for predicting explicit rating feedback in datasets that are not very large
Term Frequency - Inverse Document Frequency (TF-IDF) Python CPU Content-Based Filtering Simple similarity-based algorithm for content-based recommendations with text datasets
Vowpal Wabbit (VW)* Python CPU (online training) Content-Based Filtering Fast online learning algorithms, great for scenarios where user features / context are constantly changing
Wide and Deep Python CPU / Python GPU Hybrid Deep learning algorithm that can memorize feature interactions and generalize user features
xLearn/Factorization Machine (FM) & Field-Aware FM (FFM) Python CPU Content-Based Filtering Quick and memory efficient algorithm to predict labels with user/item features

NOTE: * indicates algorithms invented/contributed by Microsoft.

Independent or incubating algorithms and utilities are candidates for the contrib folder. This will house contributions which may not easily fit into the core repository or need time to refactor or mature the code and add necessary tests.

Algorithm Environment Type Description
SARplus * PySpark Collaborative Filtering Optimized implementation of SAR for Spark

Preliminary Comparison

We provide a benchmark notebook to illustrate how different algorithms could be evaluated and compared. In this notebook, the MovieLens dataset is split into training/test sets at a 75/25 ratio using a stratified split. A recommendation model is trained using each of the collaborative filtering algorithms below. We utilize empirical parameter values reported in literature here. For ranking metrics we use k=10 (top 10 recommended items). We run the comparison on a Standard NC6s_v2 Azure DSVM (6 vCPUs, 112 GB memory and 1 P100 GPU). Spark ALS is run in local standalone mode. In this table we show the results on Movielens 100k, running the algorithms for 15 epochs.

Algo MAP [email protected] [email protected] [email protected] RMSE MAE R2 Explained Variance
ALS 0.004732 0.044239 0.048462 0.017796 0.965038 0.753001 0.255647 0.251648
BPR 0.105365 0.389948 0.349841 0.181807 N/A N/A N/A N/A
FastAI 0.025503 0.147866 0.130329 0.053824 0.943084 0.744337 0.285308 0.287671
LightGCN 0.088526 0.419846 0.379626 0.144336 N/A N/A N/A N/A
NCF 0.107720 0.396118 0.347296 0.180775 N/A N/A N/A N/A
SAR 0.110591 0.382461 0.330753 0.176385 1.253805 1.048484 -0.569363 0.030474
SVD 0.012873 0.095930 0.091198 0.032783 0.938681 0.742690 0.291967 0.291971

Contributing

This project welcomes contributions and suggestions. Before contributing, please see our contribution guidelines.

Build Status

These tests are the nightly builds, which compute the smoke and integration tests. main is our principal branch and staging is our development branch. We use pytest for testing python utilities in reco_utils and papermill for the notebooks. For more information about the testing pipelines, please see the test documentation.

DSVM Build Status

The following tests run on a Windows and Linux DSVM daily. These machines run 24/7.

Build Type Branch Status Branch Status
Linux CPU main Build Status staging Build Status
Linux GPU main Build Status staging Build Status
Linux Spark main Build Status staging Build Status

Related projects

Reference papers

  • A. Argyriou, M. González-Fierro, and L. Zhang, "Microsoft Recommenders: Best Practices for Production-Ready Recommendation Systems", WWW 2020: International World Wide Web Conference Taipei, 2020. Available online: https://dl.acm.org/doi/abs/10.1145/3366424.3382692
  • L. Zhang, T. Wu, X. Xie, A. Argyriou, M. González-Fierro and J. Lian, "Building Production-Ready Recommendation System at Scale", ACM SIGKDD Conference on Knowledge Discovery and Data Mining 2019 (KDD 2019), 2019.
  • S. Graham, J.K. Min, T. Wu, "Microsoft recommenders: tools to accelerate developing recommender systems", RecSys '19: Proceedings of the 13th ACM Conference on Recommender Systems, 2019. Available online: https://dl.acm.org/doi/10.1145/3298689.3346967
Comments
  • Wikidata

    Wikidata

    Description

    The final objetive if to use Wikidata as a new Knowledge Graph for Recommendation algorithms, and to extract entities description to use new datasets (like Movielens) with DKN in DKN. This is the first step in that direction. I have implemented:

    New utils functions to do specific queries in Wikidata:

    • Query list of related entities from a string representing the name of an entity. The goal is to be able to create a Knowledge Graph from the linked entities in Wikidata
    • Query entity description string representing the name of an entity

    To test the functions created I have added a new notebook. The first section consists on creating a Knowledge Graph from the linked entities in Wikidata, and visualising the result of the KG. The second part tests the enriching of the name of an entity with their description and list of related entities, the goal is using this enriching for new datasets (like Movielens) with DKN.

    Related Issues

    #525

    Checklist:

    • [x] My code follows the code style of this project, as detailed in our contribution guidelines.
    • [x] I have added tests. -> I have added tests in the notebook, should I add more?
    • [x] I have updated the documentation accordingly.
    opened by almudenasanz 29
  • [BUG] Spark smoke test error with Criteo

    [BUG] Spark smoke test error with Criteo

    Description

    After upgrading LightGBM and the Spark version, we got this error in the nightly smoke tests, however, we have been running the same code for a long time without this error. It looks like a performance degradation

    tests/smoke/examples/test_notebooks_pyspark.py .RRRRRF
    
    =================================== FAILURES ===================================
    _____________________ test_mmlspark_lightgbm_criteo_smoke ______________________
    
    notebooks = {'als_deep_dive': '/home/recocat/myagent/_work/10/s/examples/02_model_collaborative_filtering/als_deep_dive.ipynb', 'a..._dive': '/home/recocat/myagent/_work/10/s/examples/02_model_collaborative_filtering/cornac_bivae_deep_dive.ipynb', ...}
    output_notebook = 'output.ipynb', kernel_name = 'python3'
    
        @pytest.mark.flaky(reruns=5, reruns_delay=2)
        @pytest.mark.smoke
        @pytest.mark.spark
        @pytest.mark.skipif(sys.platform == "win32", reason="Not implemented on Windows")
        def test_mmlspark_lightgbm_criteo_smoke(notebooks, output_notebook, kernel_name):
            notebook_path = notebooks["mmlspark_lightgbm_criteo"]
            pm.execute_notebook(
                notebook_path,
                output_notebook,
                kernel_name=kernel_name,
                parameters=dict(DATA_SIZE="sample", NUM_ITERATIONS=50, EARLY_STOPPING_ROUND=10),
            )
    
                output_notebook,
                kernel_name=kernel_name,
                parameters=dict(DATA_SIZE="sample", NUM_ITERATIONS=50, EARLY_STOPPING_ROUND=10),
            )
        
    
    
    
            results = sb.read_notebook(output_notebook).scraps.dataframe.set_index("name")[
                "data"
            ]
    >       assert results["auc"] == pytest.approx(0.68895, rel=TOL, abs=ABS_TOL)
    E       assert 0.6292474883613918 == 0.68895 ± 5.0e-02
    E        +  where 0.68895 ± 5.0e-02 = <function approx at 0x7f46b6e30840>(0.68895, rel=0.05, abs=0.05)
    E        +    where <function approx at 0x7f46b6e30840> = pytest.approx
    

    In which platform does it happen?

    How do we replicate the issue?

    see details: https://dev.azure.com/best-practices/recommenders/_build/results?buildId=56132&view=logs&j=80b1c078-4399-5286-f869-6bc90f734ab9&t=5e8b8b4f-32ea-5957-d349-aae815b05487

    Expected behavior (i.e. solution)

    Other Comments

    This error is so weird, did LightGBM from SynapseML changed somehow? FYI @anargyri @simonzhaoms

    bug 
    opened by miguelgfierro 27
  • Docker Support

    Docker Support

    Description

    Initialize a PR of Docker support for PySpark environment. This PR is for discussion with team to brainstorm and optimize Docker support for the repo.

    NOTE

    • I did not use conda yaml file in the repo to build conda env because the base image from https://github.com/jupyter/docker-stacks handles Jupyter kernel separately and has already installed many packages that exist in our yaml file.
    • To make the image light weighted, the conda/pip packages duplicated in both the base image and those in our yaml file are removed

    TODO

    • Create pre-built image and publish in Docker hub
    • Test running the Docker container - I see this a good example. Maybe we want to adopt it?
    • Finish the rest of the three Docker images, i.e., CPU and GPU

    HOW-TO A sample image is created and publicitized on my own Docker hub account (we can and should create one for the team later on) In a Linux terminal or Windows powershell (assume Docker is pre-installed in the machine)

    docker pull yueguoguo/reco_pyspark:latest
    docker run --rm -p 8888 yueguoguo/reco_pyspark
    

    Open browser and go to localhost:8888 with the token generated in the above run of the image.

    UPDATE 2019-04-08

    • The pre-built image refers to a branch in the repo where there are only notebooks executable in the environment. For example, in the PySpark environment, the deep learning notebooks which are supposed to run in a GPU environment, are removed, because the Python packages for these notebooks are not installed for the light-weight consideration.

    2019-06-21

    • "one to bind all". The same Dockerfile can be used for building an image with different environment, i.e., CPU, GPU, and Spark. This can be done by using the specific build args, i.e., cpu, gpu, and pyspark, respectively
    • SETUP.md is updated accordingly
    • Master branch of the repo will be cloned

    Related Issues

    Discussed in #687

    Checklist:

    • [x] My code follows the code style of this project, as detailed in our contribution guidelines.
    • [ ] I have added tests.
    • [x] I have updated the documentation accordingly.
    opened by yueguoguo 27
  • Fix SAR normalization and add accuracy evaluation metrics

    Fix SAR normalization and add accuracy evaluation metrics

    Description

    • The normalization method in the SAR algorithm does not seem to be correct - it is currently implemented as a division of the computed scores by the item similarity matrix for each user we have ratings (unary affinity). If we actually use this normalization technique when evaluating SAR, we get extremely bad relevance and ranking metrics. Furthermore, this method gets rid of outliers and skews the relevance and ordinality of generated recommendations. This shouldn't be the case. Normalizing the scores to the original rating scale should yield identical metrics.
    • The above fix will allow us to correctly evaluate the accuracy measures like RMSE, MAE, and log loss. This PR also adds that in the sar_movielens.ipynb notebook.

    With this PR we get the same rank/relevance metrics as the non-normalized version and the following accuracy:

    Model:	
    Top K:	10
    MAP:	0.110591
    NDCG:	0.382461
    [email protected]:	0.330753
    [email protected]:	0.176385
    RMSE:	3.697559
    MAE:	3.513341
    R2:	-12.648769
    Exp var:	-0.442580
    Logloss:	3.268522
    

    To illustrate the problem with the current normalization, here are the current metrics with the (incorrect) normalization technique.

    Model:	
    Top K:	10
    MAP:	0.000045
    NDCG:	0.000736
    [email protected]:	0.000742
    [email protected]:	0.000118
    

    Preview notebook link

    We always need to normalize the scores so that RMSE and MAE are computed in the correct scale.

    Related Issues

    Closes https://github.com/microsoft/recommenders/issues/903

    Checklist:

    • [x] I have followed the contribution guidelines and code style for this project.
    • [x] I have added tests covering my contributions.
    • [x] I have updated the documentation accordingly.
    • [x] This PR is being made to staging and not master.
    opened by viktorku 26
  • [FEATURE] Create ADO pipeline for generating pypi package

    [FEATURE] Create ADO pipeline for generating pypi package

    Description

    Andreas:

    • [x] Remove the sys path include
    • [x] Change the current pipeline so it installs the package locally, instead of using the path
    • [x] Update setup.py
    • [x] Fix issue with spark tests
    • [x] Fix the issue with GPU version to match TF 1.15
    • [x] Update documentation to reflect the new installation process
    • [x] Review docs evaluation and datasets
    • [x] test if the wheel package works on Databricks

    Miguel:

    • [x] Add a new yaml file that when there is a new tag, builds a package with bdist, installs it, executes all tests (unit, smoke and integration)
    • [x] Create the github release draft
    • [x] Upload artifacts to the github release draft: wheel and compressed code
    • [x] Publish the wheel to the package limbo only when we are executing a release
    • [x] Add fix to the spark tests in the pipeline
    • [x] BUG: remove the wildcard in the installation and use the full name programmatically see comment
    • [x] Check why we are installing all the deps in CPU, instead of only the CPU ones see run
    • [x] Make sure we are using the package when running the tests, see comment here
    • [x] Fix issue with xlearn, it was using gpu deps that were not needed
    • [x] See if we can simplify the code when forcing exit on error by removing exit -1 on each line and add other instructions. See comment
    • [x] if smoke tests fail, we don't continue with the integration tests (nice to have feature). Working now
    • [x] Review docs common, reco, tuning:
      • [x] common
      • [x] recommender
      • [x] tunning
    • [x] Do a dry run with all the tests (it will take >8h) run worked on 3/6/2021
    • [x] Fix deeprec unit tests
    • [x] Analyze flaky tests and see if backoff lib can help (nice to have feature)
    • [x] Create a tag and check that all tests pass (it will take >8h)
    • [x] Check why automatic commit gives an error and manual trigger does not. See comment here
    • [x] test if the wheel package works on Synapse -> if we upload the wheel to Synapse pool, it installs the core deps
    • [x] Check if we can install extra deps (like spark or GPU) if we do pip install in the pool runtime
    • [ ] Reduce the time of the GPU smoke tests see details
    • [ ] Automatically add the tag name to the draft release (nice to have feature)

    Yan

    • [x] Update the documentation to make sure it reflects the latest code changes, see issue https://github.com/microsoft/recommenders/issues/942
    • [x] Update docs/README.md
    • [x] automatically build the documentation on 3 environments: latest (main branch), staging and stable (latest tag) using https://readthedocs.org/projects/microsoft-recommenders/

    Expected behavior with the suggested feature

    Other Comments

    enhancement 
    opened by miguelgfierro 24
  • Add Cornac BPR deep dive notebook

    Add Cornac BPR deep dive notebook

    Description

    Add Cornac Bayesian Personalized Ranking (BPR) deep dive notebook

    Related Issues

    #931

    Checklist:

    • [x] I have followed the contribution guidelines and code style for this project.
    • [x] I have added tests covering my contributions.
    • [ ] I have updated the documentation accordingly.
    opened by tqtg 19
  • [ASK] In NCF Deep dive and ncf_movielens notebook, I used my own dataset instead of movie lens, its has userID itemID and ratings (i used counts here as rating like implicit data). The notebook throws the following error? could someone help me out with this problem?

    [ASK] In NCF Deep dive and ncf_movielens notebook, I used my own dataset instead of movie lens, its has userID itemID and ratings (i used counts here as rating like implicit data). The notebook throws the following error? could someone help me out with this problem?

    Description

    Other Comments

    Data set looks like this

    rating	userID	itemID
    

    0 12 3468 3644 1 3 3816 3959 2 1 2758 2650 3 1 5056 1593 4 30 3029 192

    When I run this cell in the notebook I got the following error

    data = NCFDataset(train=train, test=test, seed=SEED)

    Error:


    TypeError Traceback (most recent call last) in 1 SEED = 10 ----> 2 data = NCFDataset(train=train, test=test, seed=SEED)

    ~/Recommenders/reco_utils/recommender/ncf/dataset.py in init(self, train, test, n_neg, n_neg_test, col_user, col_item, col_rating, binary, seed) 59 # initialize negative sampling for training and test data 60 self._init_train_data() ---> 61 self._init_test_data() 62 # set random seed 63 random.seed(seed)

    ~/Recommenders/reco_utils/recommender/ncf/dataset.py in _init_test_data(self) 183 test_interact_status = pd.merge(test_interact_status, self.interact_status, on=self.col_user, how="left") 184 --> 185 test_interact_status[ self.col_item + "_negative"] = test_interact_status.apply(lambda row: row[self.col_item + "_negative"] - row[self.col_item + "_interacted_test"], axis=1) 186 test_ratings = pd.merge(self.test, test_interact_status[[self.col_user, self.col_item + "_negative"]], on=self.col_user, how="left") 187

    ~/.local/lib/python3.6/site-packages/pandas-0.24.2-py3.6-macosx-10.7-x86_64.egg/pandas/core/frame.py in apply(self, func, axis, broadcast, raw, reduce, result_type, args, **kwds) 6485 args=args, 6486 kwds=kwds) -> 6487 return op.get_result() 6488 6489 def applymap(self, func):

    ~/.local/lib/python3.6/site-packages/pandas-0.24.2-py3.6-macosx-10.7-x86_64.egg/pandas/core/apply.py in get_result(self) 149 return self.apply_raw() 150 --> 151 return self.apply_standard() 152 153 def apply_empty_result(self):

    ~/.local/lib/python3.6/site-packages/pandas-0.24.2-py3.6-macosx-10.7-x86_64.egg/pandas/core/apply.py in apply_standard(self) 255 256 # compute the result using the series generator --> 257 self.apply_series_generator() 258 259 # wrap results

    ~/.local/lib/python3.6/site-packages/pandas-0.24.2-py3.6-macosx-10.7-x86_64.egg/pandas/core/apply.py in apply_series_generator(self) 284 try: 285 for i, v in enumerate(series_gen): --> 286 results[i] = self.f(v) 287 keys.append(v.name) 288 except Exception as e:

    ~/Recommenders/reco_utils/recommender/ncf/dataset.py in (row) 183 test_interact_status = pd.merge(test_interact_status, self.interact_status, on=self.col_user, how="left") 184 --> 185 test_interact_status[ self.col_item + "_negative"] = test_interact_status.apply(lambda row: row[self.col_item + "_negative"] - row[self.col_item + "_interacted_test"], axis=1) 186 test_ratings = pd.merge(self.test, test_interact_status[[self.col_user, self.col_item + "_negative"]], on=self.col_user, how="left") 187

    TypeError: ("unsupported operand type(s) for -: 'float' and 'set'", 'occurred at index 854')

    ...

    help wanted 
    opened by karthikraja95 19
  • [DISCUSSION] General folder structure for reco, cv, and forecasting repos

    [DISCUSSION] General folder structure for reco, cv, and forecasting repos

    I'm setting this discussion public in case any of our users or customers want to provide feedback.

    Context

    We are building repos around computer vision and time series forecasting. We would like to homogenise the structure between them and the recommenders repo. The CV repo is still starting and the forecast repo has been running for some time internally and it is focused on benchmarks.

    The idea is to have a common structure (and user experience) between the 3 repos. Trying to have the best of each: nice examples and utilities from recommenders, nice benchmarks from forecasting repo and support for CV, as well as the other solutions in reco and forecast.

    Question

    What will be the optimal structure that will help our users and us to build better solutions in recommendations, CV and forecasting?

    Please provide answers in detail ways, example: e1) I would take the recommenders structure (notebooks, reco_utils, tests) and rename the folders to X, Y, Z... e2) I would take the recommenders structure (notebooks, reco_utils, tests) and add a folder for benchmarks... e3) ...

    needs discussion style improvement 
    opened by miguelgfierro 18
  • Staging to main (SARplus, SASrec, NCF, RBM etc.)

    Staging to main (SARplus, SASrec, NCF, RBM etc.)

    Description

    Merge recent changes into main.

    Related Issues

    Checklist:

    • [x] I have followed the contribution guidelines and code style for this project.
    • [x] I have added tests covering my contributions.
    • [x] I have updated the documentation accordingly.
    • [ ] This PR is being made to staging branch and not to main branch.
    opened by anargyri 16
  • Unable to create an appropriately versioned cluster per instructions in reference architecture and als_movie_o16n

    Unable to create an appropriately versioned cluster per instructions in reference architecture and als_movie_o16n

    What is affected by this bug?

    • Creating an appropriate cluster
    • Running a notebook on that cluster
    • Unit tests.

    In which platform does it happen?

    • Azure Databricks.

    How do we replicate the issue?

    1. Create a databricks workspace
    2. Navigate to Clusters
    3. Click [+Create Cluster] In the Databricks Runtime Version, there is no longer an option for DB 4.1, Spark 2.3.0. It was deprecated on 2019-01-17. See deprecation schedule here.

    Expected behavior (i.e. solution)

    Workarounds:

    • It is still possible to create a cluster by cloning a cluster of the recommended version.
    • It is still possible to create a cluster through the API. Happy to do PR with appropriate json for creating with databricks CLI or directly through the REST API.

    Other Comments

    Have we tested whether the cosmosdb connector jar works with more current versions of ADB and spark?

    documentation 
    opened by jreynolds01 16
  • [FEATURE] Set up test machine linux

    [FEATURE] Set up test machine linux

    Description

    from https://github.com/microsoft/recommenders/tree/master/tests

    Make sure all tests pass:

    Unit:

    • [x] pytest tests/unit -m "not notebooks and not spark and not gpu" --durations 0
    • [x] pytest tests/unit -m "notebooks and not spark and not gpu"
    • [x] pytest tests/unit -m "not notebooks and not spark and gpu"
    • [x] pytest tests/unit -m "notebooks and not spark and gpu"
    • [x] pytest tests/unit -m "not notebooks and spark and not gpu"
    • [x] pytest tests/unit -m "notebooks and spark and not gpu"

    Smoke:

    • [x] pytest tests/smoke -m "smoke and not spark and not gpu" --durations 0
    • [x] pytest tests/smoke -m "smoke and not spark and gpu" --durations 0
    • [x] pytest tests/smoke -m "smoke and spark and not gpu" --durations 0

    Integration:

    • [x] pytest tests/integration -m "integration and not spark and not gpu" --durations 0
    • [x] pytest tests/integration -m "integration and not spark and gpu" --durations 0
    • [x] pytest tests/integration -m "integration and spark and not gpu" --durations 0

    Expected behavior with the suggested feature

    Other Comments

    enhancement 
    opened by miguelgfierro 15
  • [ASK]

    [ASK]

    I build a machine learning recommendation model with Wide And Deep based on 00_quick_start/wide_deep_movielens.ipynb and when I save the model I get three files [saved_model.pb, variables-data-00000-00001, variables.index]. I can then load this model with

    self.model = tf.saved_model.load(path_to_saved_model_and_variables, tags="serve")

    And I can make prediction with

    self.model.signatures["predict"]
    

    But is it also possible to train this saved model with new data?

    help wanted 
    opened by JeroenMBooij 0
  • [ASK] SAR with timedecay_formula= False won't work

    [ASK] SAR with timedecay_formula= False won't work

    Description

    How should I run SAR if I don't want to implement the time decay formula in SAR?

    My model is constructed as:

    model = SAR( col_user="userID", col_item="itemID", col_rating="rating", col_timestamp="timestamp", similarity_type= similarity_type, timedecay_formula = False )

    and when fitting the model, it shows:

    237 if df[select_columns].duplicated().any(): --> 238 raise ValueError("There should not be duplicates in the dataframe") 239 240 # generate continuous indices if this hasn't been done

    ValueError: There should not be duplicates in the dataframe

    Plese advise, thanks!

    Other Comments

    help wanted 
    opened by jamie613 1
  • AzureML tests: Durations, disable warnings and exit -1

    AzureML tests: Durations, disable warnings and exit -1

    Description

    Add parameters to pytest: durations and disable warnings It also adds an exit -1 if there is a failure in the tests

    Related Issues

    Fix https://github.com/microsoft/recommenders/issues/1857 and https://github.com/microsoft/recommenders/issues/1852

    References

    Checklist:

    • [ ] I have followed the contribution guidelines and code style for this project.
    • [ ] I have added tests covering my contributions.
    • [ ] I have updated the documentation accordingly.
    • [ ] This PR is being made to staging branch and not to main branch.
    opened by miguelgfierro 3
  • [ASK] Error in NCFDataset creation

    [ASK] Error in NCFDataset creation

    Description

    Hello all, i'm trying to use the NCF_deep_dive notebook with my own data. With the following structure

      | usr_id | code_id | amt_trx | bestelldatum -- | -- | -- | -- | -- 0 | 0 | 35 | 1 | 2022-03-01 1 | 0 | 2 | 1 | 2022-03-01 2 | 0 | 18 | 1 | 2022-03-01 3 | 0 | 9 | 1 | 2022-03-01 4 | 0 | 0 | 1 | 2022-03-01

    when I try to create the dataset i get the following error

    data = NCFDataset(train_file=train_file, test_file=leave_one_out_test_file, seed=SEED, overwrite_test_file_full=True, col_user='usr_id', col_item='code_id', col_rating='amt_trx', binary=False)

    ---------------------------------------------------------------------------
    MissingUserException                      Traceback (most recent call last)
    Cell In [39], line 1
    ----> 1 data = NCFDataset(train_file=train_file,
          2                     test_file=leave_one_out_test_file,
          3                     seed=SEED,
          4                     overwrite_test_file_full=True,
          5                     col_user='usr_id',
          6                     col_item='code_id',
          7                     col_rating='amt_trx',
          8                     binary=False)
    
    File /anaconda/envs/recsys/lib/python3.8/site-packages/recommenders/models/ncf/dataset.py:376, in Dataset.__init__(self, train_file, test_file, test_file_full, overwrite_test_file_full, n_neg, n_neg_test, col_user, col_item, col_rating, binary, seed, sample_with_replacement, print_warnings)
        374         self.test_file_full = os.path.splitext(self.test_file)[0] + "_full.csv"
        375     if self.overwrite_test_file_full or not os.path.isfile(self.test_file_full):
    --> 376         self._create_test_file()
        377     self.test_full_datafile = DataFile(
        378         filename=self.test_file_full,
        379         col_user=self.col_user,
       (...)
        383         binary=self.binary,
        384     )
        385 # set random seed
    
    File /anaconda/envs/recsys/lib/python3.8/site-packages/recommenders/models/ncf/dataset.py:417, in Dataset._create_test_file(self)
        415 if user in train_datafile.users:
        416     user_test_data = test_datafile.load_data(user)
    --> 417     user_train_data = train_datafile.load_data(user)
        418     # for leave-one-out evaluation, exclude items seen in both training and test sets
        419     # when sampling negatives
        420     user_positive_item_pool = set(
        421         user_test_data[self.col_item].unique()
        422     ).union(user_train_data[self.col_item].unique())
    
    File /anaconda/envs/recsys/lib/python3.8/site-packages/recommenders/models/ncf/dataset.py:194, in DataFile.load_data(self, key, by_user)
        192 while (self.line_num == 0) or (self.row[key_col] != key):
        193     if self.end_of_file:
    --> 194         raise MissingUserException("User {} not in file {}".format(key, self.filename))
        195     next(self)
        196 # collect user/test batch data
    
    MissingUserException: User 58422 not in file ./train_new.csv
    

    I made some checks print(train.usr_id.nunique()) --> output: 81062 print(test.usr_id.nunique()) --> output: 81062 print(leave.usr_id.nunique()) --> output: 81062

    also checked by hand and the user 58422 is in all the files. Also the types are the same i'm using int64 for usr_id, code_id and amt_trx like movielens dataset

    I can't understand the error, could you help me please?

    Update

    If i remove the parameter overwrite_test_file_full it creates the dataset but then I can't make predictions because the dataset object didn't create the user2id mapping

    data = NCFDataset(train_file=train_file,
                        test_file=leave_one_out_test_file,
                        seed=SEED,
                        col_user='usr_id',
                        col_item='code_id',
                        col_rating='amt_trx',
                        print_warnings=True)
    
    model = NCF (
        n_users=data.n_users, 
        n_items=data.n_items,
        model_type="NeuMF",
        n_factors=4,
        layer_sizes=[16,8,4],
        n_epochs=EPOCHS,
        batch_size=BATCH_SIZE,
        learning_rate=1e-3,
        verbose=99,
        seed=SEED
    )
    
    predictions = [[row.usr_id, row.code_id, model.predict(row.usr_id, row.code_id)]
                   for (_, row) in test.iterrows()]
    
    
    predictions = pd.DataFrame(predictions, columns=['usr_id', 'code_id', 'prediction'])
    predictions.head()
    
    AttributeError                            Traceback (most recent call last)
    Cell In [38], line 1
    ----> 1 predictions = [[row.usr_id, row.code_id, model.predict(row.usr_id, row.code_id)]
          2                for (_, row) in test.iterrows()]
          5 predictions = pd.DataFrame(predictions, columns=['usr_id', 'code_id', 'prediction'])
          6 predictions.head()
    
    Cell In [38], line 1, in <listcomp>(.0)
    ----> 1 predictions = [[row.usr_id, row.code_id, model.predict(row.usr_id, row.code_id)]
          2                for (_, row) in test.iterrows()]
          5 predictions = pd.DataFrame(predictions, columns=['usr_id', 'code_id', 'prediction'])
          6 predictions.head()
    
    File /anaconda/envs/recsys/lib/python3.8/site-packages/recommenders/models/ncf/ncf_singlenode.py:434, in NCF.predict(self, user_input, item_input, is_list)
        431     return list(output.reshape(-1))
        433 else:
    --> 434     output = self._predict(np.array([user_input]), np.array([item_input]))
        435     return float(output.reshape(-1)[0])
    
    File /anaconda/envs/recsys/lib/python3.8/site-packages/recommenders/models/ncf/ncf_singlenode.py:440, in NCF._predict(self, user_input, item_input)
        437 def _predict(self, user_input, item_input):
        438 
        439     # index converting
    --> 440     user_input = np.array([self.user2id[x] for x in user_input])
        441     item_input = np.array([self.item2id[x] for x in item_input])
        443     # get feed dict
    
    File /anaconda/envs/recsys/lib/python3.8/site-packages/recommenders/models/ncf/ncf_singlenode.py:440, in <listcomp>(.0)
        437 def _predict(self, user_input, item_input):
        438 
        439     # index converting
    --> 440     user_input = np.array([self.user2id[x] for x in user_input])
        441     item_input = np.array([self.item2id[x] for x in item_input])
        443     # get feed dict
    
    AttributeError: 'NCF' object has no attribute 'user2id'
    
    help wanted 
    opened by mrcmoresi 0
  • [FEATURE] Add duration flag to AzureML tests

    [FEATURE] Add duration flag to AzureML tests

    Description

    Add --durations 0 to the tests

    Expected behavior with the suggested feature

    Other Comments

    related to https://github.com/microsoft/recommenders/issues/1843

    enhancement 
    opened by miguelgfierro 0
Releases(1.1.1)
  • 1.1.1(Jul 20, 2022)

    New algorithms or improvements

    • Reduce iterations of W&D to reduce the integration tests time in https://github.com/microsoft/recommenders/pull/1698
    • Implementation of most frequent recommendation in https://github.com/microsoft/recommenders/pull/1666
    • Implement time_now for sarplus in #1719 #1721
    • Add a fast failure in SAR+ if the similarity metric is not within the options in https://github.com/microsoft/recommenders/pull/1743
    • SAR item similarity dtype correction in https://github.com/microsoft/recommenders/pull/1751
    • Simplify SAR test data loading functions in https://github.com/microsoft/recommenders/pull/1752
    • Reformat SAR+ SQL queries in https://github.com/microsoft/recommenders/pull/1772
    • Add new item similarity metrics for SAR in https://github.com/microsoft/recommenders/pull/1754

    New utilities or improvements

    • Rewrite get_top_k_items() to improve runtime in https://github.com/microsoft/recommenders/pull/1748
    • Optimized Spark recall_at_k time performance in https://github.com/microsoft/recommenders/pull/1796

    New notebooks or improvements

    • Fix missing import in FastAI notebook https://github.com/microsoft/recommenders/pull/1708
    • Review NCF notebook in #1703 #1712
    • Review LigthFM notebook and add test in https://github.com/microsoft/recommenders/pull/1706
    • Review BPR notebook in https://github.com/microsoft/recommenders/pull/1704
    • Review LightGCN notebook in https://github.com/microsoft/recommenders/pull/1714
    • Review DKN notebook in https://github.com/microsoft/recommenders/pull/1722
    • Review SAR notebook #1738 #1768

    Other features

    • Enable distributed tests with AzureML #1696 #1717 #1729 #1733 #1739 #1747 #1732 #1755 #1763 #1771 #1773 #1775 #1787 #1788 #1794
    • Added tests for Python 3.8 and 3.9 in https://github.com/microsoft/recommenders/pull/1756
    • Image of contributors in https://github.com/microsoft/recommenders/pull/1692
    • Update README.md in #1709 #1711 #1767
    • Error in codeowners file in https://github.com/microsoft/recommenders/pull/1699
    • Add test to check if CuDNN is enabled in https://github.com/microsoft/recommenders/pull/1715
    • Update docker image reference to internal registry in https://github.com/microsoft/recommenders/pull/1727
    • Fixed a link error in data_transform.ipynb in https://github.com/microsoft/recommenders/pull/1736
    • Added tests for ranking function get_top_k_items() in https://github.com/microsoft/recommenders/pull/1757
    • Fix memory error in CPU nightly workflow in https://github.com/microsoft/recommenders/pull/1759
    • Update test infrastructure explanation #1776 #1777
    • Added time performance tests in https://github.com/microsoft/recommenders/pull/1765
    • Add path filter to avoid triggering unit tests when we change a markdown in https://github.com/microsoft/recommenders/pull/1791

    Full Changelog: https://github.com/microsoft/recommenders/compare/1.1.0...1.1.1

    Source code(tar.gz)
    Source code(zip)
    recommenders-1.1.1-py3-none-any.whl(331.06 KB)
    recommenders-1.1.1.tar.gz(256.83 KB)
  • 1.1.0(Apr 1, 2022)

    New algorithms or improvements

    • SASRec and SSEPT in Tensorflow 2.x in https://github.com/microsoft/recommenders/pull/1530 #1621 #1678
    • RBM Code Cleanup, model save and other additions in #1599 #1618 #1622
    • Overwrite older test file in NCF deep dive to avoid bug in https://github.com/microsoft/recommenders/pull/1674
    • SAR+ improvement and bug fixes #1636 #1644 #1680 #1671
    • NCF improvement and bug fixes in #1612
    • Remove drop_duplicates() from SAR method fix #1464 in https://github.com/microsoft/recommenders/pull/1588
    • SAR literal fix in https://github.com/microsoft/recommenders/pull/1663

    New utilities or improvements

    • Update lightfm_utils.py in https://github.com/microsoft/recommenders/pull/1624
    • Change formats of user_ids and item_ids arg. in LigthFM in https://github.com/microsoft/recommenders/pull/1651
    • Fix randomness issue in spark_stratified_split() in https://github.com/microsoft/recommenders/pull/1654
    • Clarification for jaccard and lift similarity measures in https://github.com/microsoft/recommenders/pull/1668
    • Use numpy divide in explained variance in https://github.com/microsoft/recommenders/pull/1691
    • Change MovieLens URL from HTTP to HTTPS in https://github.com/microsoft/recommenders/pull/1677
    • Remove casting of user and item IDs in Spark evaluation in https://github.com/microsoft/recommenders/pull/1686
    • Persist intermediate data to avoid non-determinism caused by Spark lazy random evaluation in https://github.com/microsoft/recommenders/pull/1676 #1652

    New notebooks or improvements

    • Fix notebook build failure on Spark 3.2 in https://github.com/microsoft/recommenders/pull/1608
    • Remove early stopping round from LightGBM example notebook in https://github.com/microsoft/recommenders/pull/1620

    Other features

    • Enable Python 3.8 and 3.9 in https://github.com/microsoft/recommenders/pull/1626 #1617
    • Upgrade Python from 3.6 to 3.7 in ADO tests pipeline in https://github.com/microsoft/recommenders/pull/1627
    • Increase time out for GPU nightly tests in https://github.com/microsoft/recommenders/pull/1623
    • Lower LightGBM test AUC base value in https://github.com/microsoft/recommenders/pull/1619
    • Change timeouts for tests #1625 #1661 #1684
    • Scenario gaming in https://github.com/microsoft/recommenders/pull/1637
    • Limiting tests: reducing the time of the news recommendation GPU notebooks in https://github.com/microsoft/recommenders/pull/1656
    • Remove pydocumentdb in install_requires in https://github.com/microsoft/recommenders/pull/1629
    • Change and improve dependencies #1630 #1653
    • Fix Spark tuning test in https://github.com/microsoft/recommenders/pull/1635
    • Typos in markdown files and other files #1639 #1589 #1646 #1647 #1688
    • Update Dockerfile in https://github.com/microsoft/recommenders/pull/1645
    • Improve documentation #1648 #1669 #1682 #1690 #1672
    • Codecov Fix in https://github.com/microsoft/recommenders/pull/1665
    • Set Spark env variables in nightly test in https://github.com/microsoft/recommenders/pull/1655 #1659

    Full Changelog: https://github.com/microsoft/recommenders/compare/1.0.0...1.1.0

    Source code(tar.gz)
    Source code(zip)
    recommenders-1.1.0-py3-none-manylinux1_x86_64.whl(327.72 KB)
    recommenders-1.1.0.tar.gz(247.42 KB)
  • 1.0.0(Jan 13, 2022)

    Backwards incompatible changes

    • TensorFlow upgrade to 2.6.1 / 2.7 #1574 , #1565 , #1540

    New algorithms or improvements

    • Improve algos visibility #1542
    • LightGBM test improvement #1531
    • Fix Surprise and Python 3.7 #1540
    • TF-IDF runtime enhancement changes #1571
    • Add Spark 3.x support for SARplus #1566

    New utilities or improvements

    • Upgrade to Spark v3 #1555 , #1549 , #1543
    • Move scikit-surprise and pymanopt from setup.py #1602
    • Issue with pymanopt #1606

    New notebooks or improvements

    • Fix bugs in RBM notebooks #1581
    • Remove explicit mapping of ratings to integers from RBM notebooks #1585

    Other features

    • Fix nightly workflows #1576 , #1548
    • Stabilize more flaky tests #1558
    • Miscellaneous Pipeline Fixes #1545
    • Optimize Notebook Unit Tests #1538
    • Development status change to production/stable #1579
    • Update dependencies #1569, #1570
    • Fix Databricks installation script #1531
    • Adding codespace deployment #1521
    • Improve GitHub tests #1518, #1578, #1590, #1592
    • Flake8 Fixes #1552 , #1550
    • Improvement in documentation #1591, #1598, #1594, #1603
    • Update release pipeline #1596
    Source code(tar.gz)
    Source code(zip)
    recommenders-1.0.0-py3-none-manylinux1_x86_64.whl(311.20 KB)
    recommenders-1.0.0.tar.gz(238.60 KB)
  • 0.7.0(Sep 23, 2021)

    Backwards incompatible changes

    • Renaming of folders #1485, #1478
    • Change of the PyPI package name to recommenders #1477

    New algorithms or improvements

    • Missing import in VAE #1508

    New utilities or improvements

    • retrying import #1487
    • Addition of diversity, novelty, coverage and serendipity metrics #1536, #1535, #1522, #1505, #1491, #1470, #1465

    New notebooks or improvements

    • New notebook showcasing diversity, novelty, coverage, and serendipity metrics in Spark #1488, #1470, #1465

    Other features

    • Enablement of LightGBM version 3 #1527
    • Enablement of all Python 3.7 micro versions #1474
    • Installation in virtualenv and venv #1520, #1476
    • Installation from PyPI in docker container #1509
    • Read the Docs builds #1529, #1528
    • Documentation improvements #1515, #1469, #1462
    • CI pipelines on GitHub workflows (WIP) #1517, #1503, #1499, #1494, #1490
    Source code(tar.gz)
    Source code(zip)
    recommenders-0.7.0-py3-none-manylinux1_x86_64.whl(307.00 KB)
    recommenders-0.7.0.tar.gz(234.73 KB)
  • 0.6.0(Jun 18, 2021)

    New utilities or improvements

    • Fix URL in unit tests #1447
    • Improve documentation #1446 #1440 #1436 #1428 #1426 #1425 #1415
    • Add retry to maybe_downlad function #1427

    New notebooks or improvements

    • Notebook for diversity metrics #1416
    • Update evaluation notebook with new diversity metrics #1416
    • Fix xlearn notebook #1427

    Other features

    • Generate package for PyPi #1445 #1442 #1441 #1429
    • Improve installation process #1455 #1431
    • Fix tests #1452 #1427
    • Generate pipeline for release #1427
    Source code(tar.gz)
    Source code(zip)
    recommenders-0.6.0-py3-none-manylinux1_x86_64.whl(228.49 KB)
    recommenders-0.6.0.tar.gz(175.56 KB)
  • 0.5.0(Apr 30, 2021)

    Repo structure

    • Default branch renamed from master to main #1284 #1278

    New dataset and competition support

    New algorithms or improvements

    • Optimize GPU usage of news recommendation algorithms #1235
    • Optimize surprise utilities #1224
    • GeoIMC algorithm #1204
    • Standard VAE algorithm #1194
    • Multinomial VAE algorithm #1194

    New utilities or improvements

    • Operationalization example for sequential models #1254
    • Fix bug with fastai #1288
    • Fix bug in affinity matrix #1243
    • Fix conflict with MMLSpark version #1230
    • Fix negative feedback smapler #1200

    New notebooks or improvements

    • Update AzureML Designer notebooks #1286 #1253
    • KDD2020 tutorial: paper recommendation with Microsoft Academic Graph #1208
    • Update o16n notebook for real time scoring #1176
    • Reduce verbosity on tensorflow notebooks #1276

    Other features

    • Upgrade papermill and scrapbook for testing #1271 #1270 #1282 #1289
    • Fix tests #1244 #1242 #1226 #1218
    • Fix issue with spark installation #1186
    • Update python version #1202
    • Notice for java dependency #1209
    • Reactivate CICD pipelines #1284
    Source code(tar.gz)
    Source code(zip)
  • 0.4.0(Apr 30, 2021)

    New algorithms or improvements

    • DKN fix https://github.com/microsoft/recommenders/pull/1165
    • GeoIMC https://github.com/microsoft/recommenders/pull/1142
    • LSTUR #1137 #1080
    • NAML #1137 #1080
    • NPA #1137 #1080
    • NRMS #1137 #1080
    • LighGCN #1130 #1123
    • NextItNet #1130 #1126
    • Fix SAR #1128 #1023 #1018 #991
    • LightFM #1096
    • TFIDF recommender #1088
    • A2SVD #1010
    • GRU4Rec #1010
    • Caser #1010
    • SLi-Rec #1010
    • SARplus #955
    • BPR with cornac library #950 #944 #937

    New utilities or improvements

    • MIND dataset https://github.com/microsoft/recommenders/pull/1153
    • Fix Text iterator https://github.com/microsoft/recommenders/pull/1133
    • Fix NNI utils #1131
    • Azure Designer dependencies #1115 #1101 #1095 #1077 #1060
    • Fix tests #1057 #1004 #954 #935 #932

    New notebooks or improvements

    • DKN notebook with MIND dataset https://github.com/microsoft/recommenders/pull/1165 https://github.com/microsoft/recommenders/pull/1137
    • GeoIMC notebook https://github.com/microsoft/recommenders/pull/1142
    • LSTUR notebook #1137 #1080
    • NAML notebook #1137 #1080
    • NPA notebook #1137 #1080
    • NRMS notebook #1137 #1080
    • LighGCN notebook #1130 #1123
    • NextItNet notebook #1130 #1126
    • Implementation of Recommenders into Azure Designer #1115 #1101 #1095 #1060 #1036
    • NCF hyperparameter tunning notebook #1102 #1092
    • LightFM notebook #1096
    • TFIDF recommender notebook #1088
    • Add timer class into notebooks 1063
    • Fix xlearn notebook #1006 #974
    • o16n notebook fix #1003 #969
    • A2SVD notebook #1010
    • GRU4Rec notebook #1010
    • Caser notebook #1010
    • SLi-Rec notebook #1010
    • BPR with cornac notebook #950 #944 #937

    Other features

    • Fix installation on Databricks https://github.com/microsoft/recommenders/pull/1161 #965
    • Fix docker https://github.com/microsoft/recommenders/pull/1146 #1120 #1070 #1058 #1034
    • Fix Azure blob version #1119
    • Pin TensorFlow #1098
    • Code structure refactor #1086
    • Business scenarios and glossary #1086
    • ADO artifact #1069
    • Avoid pandas>1 #1052
    • CICD #1002 #998 #994 #980
    Source code(tar.gz)
    Source code(zip)
  • 0.3.1(Apr 30, 2021)

    New algorithms or improvements

    • Improved SAR performance #914 #922
    • Utils for wikidata knowledge graph #881 #902

    New utilities or improvements

    • Fixed bug in python evaluator #863
    • Updated nni version and utils #856
    • Updated sum check #874
    • Changed url download util to use requests #813

    New notebooks or improvements

    • Optimized spark notebooks #864
    • New notebook on knowledge graph generation with wikidata #881 #902
    • Wide-deep hyperdrive notebook AzureML API update #847

    Other features

    • Added Docker support (Docker file) for all of the three (CPU/GPU/Spark) environment
    • Added setup.py for pip installation #851
    • Added sphinx documentation #859
    • Published documentation on readthedocs #912
    • Fixed spark testing issues #850
    • Added tests with AzureML compute target #848 #846 #839 #823
    • Development of Xamarin app for movies recommendation using Recommenders engine https://github.com/microsoft/recommenders_engine_example_layout
    Source code(tar.gz)
    Source code(zip)
  • 0.3.0(Apr 30, 2021)

    New platform support

    • Windows support with tests #797 #726

    New algorithms or improvements

    • LightGBM #633 #735
    • RLRMC #729
    • Changed seed for GPU algos for reproducibility #785 #748
    • Added benchmark #715
    • Fixed bugs in SAR #697 #619

    New utilities or improvements

    • Python evaluation improvement by memoization #713
    • Improved tests #706
    • New algos for hyperparameter tuning with NNI #687
    • Criteo dataloader #642
    • Wrapper VW #592
    • Added more data formats #605
    • New metrics #580

    New notebooks or improvements

    • SAR remote execution through AzureML #728
    • SAR remote execution of notebook through AzureML #681
    • LightGBM with small criteo on CPU #633
    • LightGBM o16n on Databricks with MMLSpark #735 #714 #682 #680
    • Hyperparameter tuning with NNI on Surprise SVD #687
    • Hyperparameter tuning with Hyperdrive #546

    Other features

    • Fixed bugs in utilities, tests and notebooks
    • New unit, smoke and integration tests for the new algos
    Source code(tar.gz)
    Source code(zip)
  • 0.2.0(Apr 30, 2021)

    New Algorithms or improvements

    • Vowpal Wabbit (VW) https://github.com/Microsoft/Recommenders/pull/452
    • xDeepFM https://github.com/Microsoft/Recommenders/pull/453
    • DKN https://github.com/Microsoft/Recommenders/pull/453
    • NCF https://github.com/Microsoft/Recommenders/pull/392
    • RBM https://github.com/Microsoft/Recommenders/pull/390
    • FastAI Embedding dot Bias https://github.com/Microsoft/Recommenders/pull/411
    • Optimization of SAR

    New utilities or improvements

    • Improved the performance of python splitters https://github.com/Microsoft/Recommenders/pull/517
    • Added GPU utilities
    • Added utilities for hyperparameter tuning

    New Notebooks or improvements

    • Improved o16n notebook with ALS, Movielens and Databricks https://github.com/Microsoft/Recommenders/pull/475
    • Added a deep dive notebook on VW https://github.com/Microsoft/Recommenders/pull/452
    • Improved notebook for hyperparameter tuning on Spark https://github.com/Microsoft/Recommenders/pull/444
    • New notebook on FastAI Embedding dot Bias algo https://github.com/Microsoft/Recommenders/pull/411
    • New notebook of deep dive on NCF https://github.com/Microsoft/Recommenders/pull/392
    • New quick start notebook of RBM https://github.com/Microsoft/Recommenders/pull/390
    • New deep dive notebook of RBM https://github.com/Microsoft/Recommenders/pull/390
    • New quickstart notebook of xDeepFM with synthetic data
    • New quickstart notebook of DKN with synthetic data
    • New notebook on data transformation https://github.com/Microsoft/Recommenders/pull/384

    Other features

    • Fixed bugs in utilities, tests and notebooks
    • Added an installation script for Databricks https://github.com/Microsoft/Recommenders/pull/457
    • Changed installer from a bash to a python script https://github.com/Microsoft/Recommenders/pull/512
    • Added a parameter to control pyspark version in the installer https://github.com/Microsoft/Recommenders/pull/461
    • Optimized tests to be quicker https://github.com/Microsoft/Recommenders/pull/486
    • New unit, smoke and integration tests for the new algos
    • Added GPU test pipeline https://github.com/Microsoft/Recommenders/pull/408
    • Improved Github metrics tracker https://github.com/Microsoft/Recommenders/pull/400
    Source code(tar.gz)
    Source code(zip)
  • 0.1.1(Dec 12, 2018)

    New Algorithms or improvements

    • Improved SAR single node for top k recommendations. User can decide if the recommended top k items to be sorted or not.

    New utilities or improvements

    • Added data related utility functions like movielens data download in Python and PySpark.
    • Added new data split method (timestamp based split) added.

    New Notebooks or improvements

    • Added an O16N notebook for Spark ALS movie recommender on Azure production services such as Databricks, Cosmos DB, and Kubernetes Services.
    • Added SAR deep dive notebook with single-node implementation demonstrated.
    • Added Surprise SVD deep dive notebook.
    • Added Surprise SVD integration test.
    • Added Surprise SVD ranking metrics evaluation.
    • Made quick-start notebooks consistent in terms of running settings, i.e., experiment protocols (e.g., data split, evaluation metrics, etc.) and algorithm parameters (e.g., hyper parameters, remove seen items, etc.).
    • Added a comparison notebook for easy benchmarking different algorithms.

    Other features

    • Updated SETUP with Azure Databricks.
    • Added SETUP troubleshooting for Azure DSVM and Databricks.
    • Updated READMEs under each notebook directory to provide comprehensive guidelines.
    • Added smoke/integration tests on large movielens dataset (10mil and 20mil).
    • Updated the Spark settings of CI/CD machine to eliminate unexpected build failures such as "no space left issue".
    Source code(tar.gz)
    Source code(zip)
  • 0.1.0(Nov 12, 2018)

    New Algorithms or improvements

    Development of SAR algorithm on three implementations:

    New utilities or improvements

    New Notebooks or improvements

    Other features

    • Benchmark of the current algorithms.
    • Unit, smoke and integration tests for Python and PySpark environments.
    Source code(tar.gz)
    Source code(zip)
Owner
Microsoft
Open source projects and samples from Microsoft
Microsoft
Pytorch Implementation of Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic

Pytorch Implementation of Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic [Paper] [Colab is coming soon] Approach Example Usage To r

170 Jan 03, 2023
SeMask: Semantically Masked Transformers for Semantic Segmentation.

SeMask: Semantically Masked Transformers Jitesh Jain, Anukriti Singh, Nikita Orlov, Zilong Huang, Jiachen Li, Steven Walton, Humphrey Shi This repo co

Picsart AI Research (PAIR) 186 Dec 30, 2022
Optimized primitives for collective multi-GPU communication

NCCL Optimized primitives for inter-GPU communication. Introduction NCCL (pronounced "Nickel") is a stand-alone library of standard communication rout

NVIDIA Corporation 2k Jan 09, 2023
PyTorch Implementation of the SuRP algorithm by the authors of the AISTATS 2022 paper "An Information-Theoretic Justification for Model Pruning"

PyTorch Implementation of the SuRP algorithm by the authors of the AISTATS 2022 paper "An Information-Theoretic Justification for Model Pruning".

Berivan Isik 8 Dec 08, 2022
An example to implement a new backbone with OpenMMLab framework.

Backbone example on OpenMMLab framework English | 简体中文 Introduction This is an template repo about how to use OpenMMLab framework to develop a new bac

Ma Zerun 22 Dec 29, 2022
People movement type classifier with YOLOv4 detection and SORT tracking.

Movement classification The goal of this project would be movement classification of people, in other words, walking (normal and fast) and running. Yo

4 Sep 21, 2021
PyTorch implementation of Progressive Growing of GANs for Improved Quality, Stability, and Variation.

PyTorch implementation of Progressive Growing of GANs for Improved Quality, Stability, and Variation. Warning: the master branch might collapse. To ob

559 Dec 14, 2022
CS506-Spring2022 - Code and Slides for Boston University CS 506

CS 506 - Computational Tools for Data Science Code, slides, and notes for Boston

Lance Galletti 17 May 06, 2022
Imitating Deep Learning Dynamics via Locally Elastic Stochastic Differential Equations

Imitating Deep Learning Dynamics via Locally Elastic Stochastic Differential Equations This repo contains official code for the NeurIPS 2021 paper Imi

Jiayao Zhang 2 Oct 18, 2021
This repo implements a 3D segmentation task for an airport baggage dataset.

3D CT Scan Segmentation With Occupancy Network This repo implements a 3D superresolution segmentation task for an airport baggage dataset. Our final p

Christoph Reich 2 Mar 28, 2022
ICLR2021 (Under Review)

Self-Supervised Time Series Representation Learning by Inter-Intra Relational Reasoning This repository contains the official PyTorch implementation o

Haoyi Fan 58 Dec 30, 2022
PaddleRobotics is an open-source algorithm library for robots based on Paddle, including open-source parts such as human-robot interaction, complex motion control, environment perception, SLAM positioning, and navigation.

简体中文 | English PaddleRobotics paddleRobotics是基于paddle的机器人开源算法库集,包括人机交互、复杂运动控制、环境感知、slam定位导航等开源算法部分。 人机交互 主动多模交互技术TFVT-HRI 主动多模交互技术是通过视觉、语音、触摸传感器等输入机器人

185 Dec 26, 2022
Using Hotel Data to predict High Value And Potential VIP Guests

Description Using hotel data and AI to predict high value guests and potential VIP guests. Hotel can leverage on prediction resutls to run more effect

HCG 12 Feb 14, 2022
A Lighting Pytorch Framework for Recommendation System, Easy-to-use and Easy-to-extend.

Torch-RecHub A Lighting Pytorch Framework for Recommendation Models, Easy-to-use and Easy-to-extend. 安装 pip install torch-rechub 主要特性 scikit-learn风格易用

Mincai Lai 67 Jan 04, 2023
A general-purpose encoder-decoder framework for Tensorflow

READ THE DOCUMENTATION CONTRIBUTING A general-purpose encoder-decoder framework for Tensorflow that can be used for Machine Translation, Text Summariz

Google 5.5k Jan 07, 2023
Code for "Graph-Evolving Meta-Learning for Low-Resource Medical Dialogue Generation". [AAAI 2021]

Graph Evolving Meta-Learning for Low-resource Medical Dialogue Generation Code to be further cleaned... This repo contains the code of the following p

Shuai Lin 29 Nov 01, 2022
Graph Convolutional Neural Networks with Data-driven Graph Filter (GCNN-DDGF)

Graph Convolutional Gated Recurrent Neural Network (GCGRNN) Improved from Graph Convolutional Neural Networks with Data-driven Graph Filter (GCNN-DDGF

Lei Lin 21 Dec 18, 2022
scikit-learn: machine learning in Python

scikit-learn is a Python module for machine learning built on top of SciPy and is distributed under the 3-Clause BSD license. The project was started

scikit-learn 52.5k Jan 08, 2023
Robotics with GPU computing

Robotics with GPU computing Cupoch is a library that implements rapid 3D data processing for robotics using CUDA. The goal of this library is to imple

Shirokuma 625 Jan 07, 2023
AI Flow is an open source framework that bridges big data and artificial intelligence.

Flink AI Flow Introduction Flink AI Flow is an open source framework that bridges big data and artificial intelligence. It manages the entire machine

144 Dec 30, 2022