Pytorch domain library for recommendation systems

Overview

TorchRec (Experimental Release)

TorchRec is a PyTorch domain library built to provide common sparsity & parallelism primitives needed for large-scale recommender systems (RecSys). It allows authors to train models with large embedding tables sharded across many GPUs.

TorchRec contains:

  • Parallelism primitives that enable easy authoring of large, performant multi-device/multi-node models using hybrid data-parallelism/model-parallelism.
  • The TorchRec sharder can shard embedding tables with different sharding strategies including data-parallel, table-wise, row-wise, table-wise-row-wise, and column-wise sharding.
  • The TorchRec planner can automatically generate optimized sharding plans for models.
  • Pipelined training overlaps dataloading device transfer (copy to GPU), inter-device communications (input_dist), and computation (forward, backward) for increased performance.
  • Optimized kernels for RecSys powered by FBGEMM.
  • Quantization support for reduced precision training and inference.
  • Common modules for RecSys.
  • Production-proven model architectures for RecSys.
  • RecSys datasets (criteo click logs and movielens)
  • Examples of end-to-end training such the dlrm event prediction model trained on criteo click logs dataset.

Installation

We are currently iterating on the setup experience. For now, we provide manual instructions on how to build from source. The example below shows how to install with CUDA 11.1. This setup assumes you have conda installed.

  1. Save the following as environment.yml on your machine.
name: torchrec_py386_cuda111
channels:
  - pytorch-nightly
  - iopath
  - conda-forge
dependencies:
  - python=3.8.6
  - pytorch
  - cudatoolkit=11.1
  - iopath
  - numpy
  - pip:
    - "--editable=git+https://github.com/pytorch/torchx.g[email protected]#egg=torchx"
    - torchmetrics
    - pyre-extensions
  1. Create a new conda environment for torchrec.
mkdir torchrec_oss_install
cd torchrec_oss_install
conda env create -f environment.yml
conda activate torchrec_py386_cuda111
  1. Download the TorchRec repo.
cd src
git clone --recursive https://github.com/facebookresearch/torchrec
  1. Next, install FBGEMM_GPU from source (included in third_party folder of torchrec) by following the directions here. For CUDA 11.1 and SM80 (Ampere) architecture, the following instructions can be used:
conda install -c conda-forge scikit-build jinja2 ninja cmake
export TORCH_CUDA_ARCH_LIST=8.0
export CUB_DIR=/usr/local/cuda-11.1/include/cub
export CUDA_BIN_PATH=/usr/local/cuda-11.1/
export CUDACXX=/usr/local/cuda-11.1/bin/nvcc
python setup.py install -Dcuda_architectures="80" -DCUDNN_LIBRARY_PATH=/usr/local/cuda-11.1/lib64/libcudnn.so -DCUDNN_INCLUDE_PATH=/usr/local/cuda-11.1/include

The last line of the above code block (python setup.py install...) which manually installs fbgemm_gpu can be skipped if you do not need to build fbgemm_gpu with custom build-related flags. Skip to the next step if that is the case.

  1. Then, install TorchRec from source.
# cd to the directory where torchrec's setup.py is located. Then run one of the below:
python setup.py build develop --skip_fbgemm  # If you manually installed fbgemm_gpu in the previous step.
python setup.py build develop                # Otherwise. This will run the fbgemm_gpu install step for you behind the scenes.
  1. Next, we'll need to go into the TorchRec source to update some imports related to fbgemm_gpu. Append the line import fbgemm_gpu to the imports in the files torchrec/sparse/jagged_tensor.py, torchrec/distributed/comm_ops.py, torchrec/distributed/dist_data.py, and torchrec/quant/embedding_modules.py. We installed in develop mode so editing the source will reflect changes without rebuilding. We expect to remove the need for this step soon.

  2. Test the installation.

torchx run --scheduler local_cwd test_installation.py:test_installation
  1. If you want to run a more complex example, please take a look at the torchrec DLRM example.

That's it! In the near-to-mid future, we will simplify this process considerably. Stay tuned...

License

TorchRec is BSD licensed, as found in the LICENSE file.

Comments
  • Automated submodule update: FBGEMM

    Automated submodule update: FBGEMM

    This is an automated pull request to update the first-party submodule for pytorch/FBGEMM.

    New submodule commit: https://github.com/pytorch/FBGEMM/commit/f99e1616630e3b78e3bccd0ceb9e50e8e82409f1

    Test Plan: Ensure that CI jobs succeed on GitHub before landing.

    CLA Signed 
    opened by facebook-github-bot 82
  • Automated submodule update: FBGEMM

    Automated submodule update: FBGEMM

    This is an automated pull request to update the first-party submodule for pytorch/FBGEMM.

    New submodule commit: https://github.com/pytorch/FBGEMM/commit/e8d19d8ed9920d8e3a53c3baadb6cb239798646d

    Test Plan: Ensure that CI jobs succeed on GitHub before landing.

    CLA Signed 
    opened by facebook-github-bot 31
  • Automated submodule update: FBGEMM

    Automated submodule update: FBGEMM

    This is an automated pull request to update the first-party submodule for pytorch/FBGEMM.

    New submodule commit: https://github.com/pytorch/FBGEMM/commit/61de99ff78ae0c5b5b97c5d0c091e85c23cba453

    Test Plan: Ensure that CI jobs succeed on GitHub before landing.

    CLA Signed 
    opened by facebook-github-bot 30
  • KJT a2a redesign with input dist fusion

    KJT a2a redesign with input dist fusion

    Summary: We expand kjt interface to include these new calls: sdd_labels -> names of tensors to transmit sdd_splits -> which is shape of internal tensors, split by key sdd_tensors -> the actual tensor data to transmit sdd_init -> builds a new kjt from raw colllective output

    Next we changed the KJT a2a to have awaitables. The first awaitable is to transmit the tensor splits so each rank will know the size of the tensors it will receive (this collective call is blocking). The second awaitable is asynchronous and it transmits the tensor data to the correct ranks.

    Finally we allowed input dist fusion by combining the first awaitable in the KJT a2a for all KJTs to execute one after another synchronously, and then calling the second awaitable to transmit the actual tensor data asynchronously.

    TODO: Support gloo

    Differential Revision: D39520093

    LaMa Project: L1138451

    CLA Signed fb-exported 
    opened by joshuadeng 26
  • discarding model_parallel-batched_dense sharding combinations

    discarding model_parallel-batched_dense sharding combinations

    Summary: -> this diff discards following sharding combination: (RW/CW/TW/TWRW/TWCW) with batched_dense.

    Also, pruning is being removed from proposers.py as it is no longer needed.

    Differential Revision: D36644029

    CLA Signed fb-exported 
    opened by LBneus 15
  • Automated submodule update: FBGEMM

    Automated submodule update: FBGEMM

    This is an automated pull request to update the first-party submodule for pytorch/FBGEMM.

    New submodule commit: https://github.com/pytorch/FBGEMM/commit/41d598cd50700fa9e85f596d6c230205632c025d

    Test Plan: Ensure that CI jobs succeed on GitHub before landing.

    CLA Signed 
    opened by facebook-github-bot 15
  • Automated submodule update: FBGEMM

    Automated submodule update: FBGEMM

    This is an automated pull request to update the first-party submodule for pytorch/FBGEMM.

    New submodule commit: https://github.com/pytorch/FBGEMM/commit/5c804646571b13afd860a74897c8c99ca7c4e1b9

    Test Plan: Ensure that CI jobs succeed on GitHub before landing.

    CLA Signed 
    opened by facebook-github-bot 15
  • fbgemm_gpu_py.so not found

    fbgemm_gpu_py.so not found

    Hi team, I'm trying to use torchrec-nightly with torch 1.12 and CUDA 11.2. But when I import torchrec, I get the following:

    >>> import torchrec
    File fbgemm_gpu_py.so not found
    

    A similar issue was reported on the DLRM issue tracker https://github.com/facebookresearch/dlrm/issues/256. Any ideas?

    opened by lukepfister 14
  • TorchRec TorchArrow example

    TorchRec TorchArrow example

    From README

    Description

    This shows a prototype of integrating a TorchRec based training loop utilizing TorchArrow's on-the-fly preprocessing. The main motivation is to show the utilization of TorchArrow's specialized domain UDFs. Here we use bucketize, firstx, as well as sigrid_hash to do some last-mile preprocessing over the criteo dataset in parquet format.

    These three UDFs are extensively used in Meta's RecSys preprocessing stack. Notably, these UDFs can be used to easily adjust the proprocessing script to any model changes. For example, if we wish to change the size of our embedding tables, without sigrid_hash, we would need to rerun a bulk offline preproc to ensure that all indicies are within bounds. Bucketize lets us easily convert dense features into sparse features, with flexibility of what the bucket borders are. firstx lets us easily prune sparse ids (note, that this doesn't provide any functional features, but is in the preproc script as demonstration).

    Installations and Usage

    Download the criteo tsv files (see the README in the main DLRM example). Use the nvtabular script (in torchrec/datasets/scripts/nvt/) to convert the TSV files to parquet.

    Install torcharrow from https://github.com/facebookresearch/torcharrow1`

    pip install torchdata
    

    Usage

    torchx run -s local_cwd dist.ddp -j 1x4 --script examples/torcharrow/run.py -- --parquet_directory /home/criteo_parquet
    

    The preprocessing logic is in dataloader.py

    Extentions/Future work

    • We will eventually integrate with the up and coming DataLoader2, which will allow us to utilize a prebuilt solution to collate our dataframe batches to dense tensors, or TorchRec's KeyedJaggedTensors (rather than doing this by hand).
    • Building an easier solution/more performant to convert parquet -> IterableDataPipe[DataFrame]
    • Some functional abilities are not yet available (such as make_named_row, etc).
    • More RecSys UDFs to come! Please let us know if you have any suggestions.
    CLA Signed 
    opened by YLGH 14
  • Automated submodule update: FBGEMM

    Automated submodule update: FBGEMM

    This is an automated pull request to update the first-party submodule for pytorch/FBGEMM.

    New submodule commit: https://github.com/pytorch/FBGEMM/commit/243010f7c3a6887e102add293cb5a530aa884240

    Test Plan: Ensure that CI jobs succeed on GitHub before landing.

    CLA Signed 
    opened by facebook-github-bot 12
  • Automated submodule update: FBGEMM

    Automated submodule update: FBGEMM

    This is an automated pull request to update the first-party submodule for pytorch/FBGEMM.

    New submodule commit: https://github.com/pytorch/FBGEMM/commit/6b629495e0ac2f5ab8fd7472fb40d19194077915

    Test Plan: Ensure that CI jobs succeed on GitHub before landing.

    CLA Signed 
    opened by facebook-github-bot 12
  • Unify apply_optimizer_in_backward API usage

    Unify apply_optimizer_in_backward API usage

    Summary: Currently the apply_optimizer_in_backward API lives in both torchrec and distributed folders. We want to migrate all usage of torchrec's apply_optimizer_in_backward to distributed's to unify the API.

    Reviewed By: colin2328, YLGH

    Differential Revision: D42319179

    CLA Signed fb-exported 
    opened by lequytra 1
  • metrics states vectorization

    metrics states vectorization

    Summary: The current implementation computes each metric state individually. This diff vectorizes these operations. Some key changes in this diff:

    • states vectorization in all supported metrics
    • In auc metric, stores its states with tensor instead of a list of one element. This is to make the states vectorization logic in all metrics be consistent.

    Most parts of this diff are the same as D40419238 (https://github.com/pytorch/torchrec/commit/50c861a4debb6d0d8bd55ddb27452e89f2d19d51). The main difference is that this one has added the backward compatibility support for old model checkpoints.

    Differential Revision: D42161727

    CLA Signed fb-exported 
    opened by renganxu 1
  • Correct reserved_percentage in stats

    Correct reserved_percentage in stats

    Summary: Currently in storage_reservations.py, we use percentage to only reserve hbm memory, no ddr memory (https://fburl.com/code/166evbil). This diff makes stats.py behave accordingly.

    Reviewed By: joshuadeng

    Differential Revision: D42329036

    CLA Signed fb-exported 
    opened by ge0405 1
  • Error occurs when running `nvt_preproc.sh`

    Error occurs when running `nvt_preproc.sh`

    Hello, I am trying to pre-process criteo TB dataset using NVTabular. I run the nvt_preproc.sh nvt_preproc but get errors.

    This is the error code.

    2022-12-21 15:04:51,981 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
    2022-12-21 15:04:51,982 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
    /N/u/haofeng/BigRed200/anaconda3/envs/rapids-22.12/lib/python3.9/site-packages/merlin/io/dataset.py:251: UserWarning: Initializing an NVTabular Dataset in CPU mode.This is an experimental feature with extremely limited support!
      warnings.warn(
    2022-12-21 16:25:17,116 - distributed.worker - WARNING - Compute Failed
    Key:       ('read-csv-90ef43a055712c364b500b995ea7e3d5', 0)
    Function:  execute_task
    args:      ((subgraph_callable-12d6d57f-182a-4e3a-b65c-e4ee508dcef9, [(<function read_block_from_file at 0x7ff756010a60>, <OpenFile '/N/scratch/haofeng/TB/raw/day_0'>, 0, 128272716, b'\n'), None, True, False]))
    kwargs:    {}
    Exception: 'ValueError("Mismatched dtypes found in `pd.read_csv`/`pd.read_table`.\\n\\n+--------+---------+----------+\\n| Column | Found   | Expected |\\n+--------+---------+----------+\\n| int_1  | float64 | int64    |\\n| int_11 | float64 | int64    |\\n| int_6  | float64 | int64    |\\n+--------+---------+----------+\\n\\nUsually this is due to dask\'s dtype inference failing, and\\n*may* be fixed by specifying dtypes manually by adding:\\n\\ndtype={\'int_1\': \'float64\',\\n       \'int_11\': \'float64\',\\n       \'int_6\': \'float64\'}\\n\\nto the call to `read_csv`/`read_table`.\\n\\nAlternatively, provide `assume_missing=True` to interpret\\nall unspecified integer columns as floats.")'
    
    finished splitting the last day, took 4719.522433280945
    handling the input paths: ['/N/scratch/haofeng/TB/raw/day_0', '/N/scratch/haofeng/TB/raw/day_1', '/N/scratch/haofeng/TB/raw/day_2', '/N/scratch/haofeng/TB/raw/day_3', '/N/scratch/haofeng/TB/raw/day_4', '/N/scratch/haofeng/TB/raw/day_5', '/N/scratch/haofeng/TB/raw/day_6', '/N/scratch/haofeng/TB/raw/day_7', '/N/scratch/haofeng/TB/raw/day_8', '/N/scratch/haofeng/TB/raw/day_9', '/N/scratch/haofeng/TB/raw/day_10', '/N/scratch/haofeng/TB/raw/day_11', '/N/scratch/haofeng/TB/raw/day_12', '/N/scratch/haofeng/TB/raw/day_13', '/N/scratch/haofeng/TB/raw/day_14', '/N/scratch/haofeng/TB/raw/day_15', '/N/scratch/haofeng/TB/raw/day_16', '/N/scratch/haofeng/TB/raw/day_17', '/N/scratch/haofeng/TB/raw/day_18', '/N/scratch/haofeng/TB/raw/day_19', '/N/scratch/haofeng/TB/raw/day_20', '/N/scratch/haofeng/TB/raw/day_21', '/N/scratch/haofeng/TB/raw/day_22', '/N/scratch/haofeng/TB/raw/day_23.part0', '/N/scratch/haofeng/TB/raw/day_23.part1']
    Traceback (most recent call last):
      File "/geode2/home/u030/haofeng/BigRed200/torchrec/torchrec/datasets/scripts/nvt/convert_tsv_to_parquet.py", line 107, in <module>
        convert_tsv_to_parquet(args.input_path, args.output_base_path)
      File "/geode2/home/u030/haofeng/BigRed200/torchrec/torchrec/datasets/scripts/nvt/convert_tsv_to_parquet.py", line 72, in convert_tsv_to_parquet
        tsv_dataset = nvt.Dataset(input_paths, **config)
      File "/N/u/haofeng/BigRed200/anaconda3/envs/rapids-22.12/lib/python3.9/site-packages/merlin/io/dataset.py", line 346, in __init__
        self.infer_schema()
      File "/N/u/haofeng/BigRed200/anaconda3/envs/rapids-22.12/lib/python3.9/site-packages/merlin/io/dataset.py", line 1127, in infer_schema
        dtypes = self.sample_dtypes(n=n, annotate_lists=True)
      File "/N/u/haofeng/BigRed200/anaconda3/envs/rapids-22.12/lib/python3.9/site-packages/merlin/io/dataset.py", line 1147, in sample_dtypes
        _real_meta = self.engine.sample_data(n=n)
      File "/N/u/haofeng/BigRed200/anaconda3/envs/rapids-22.12/lib/python3.9/site-packages/merlin/io/dataset_engine.py", line 71, in sample_data
        _head = _ddf.partitions[partition_index].head(n)
      File "/N/u/haofeng/BigRed200/anaconda3/envs/rapids-22.12/lib/python3.9/site-packages/dask/dataframe/core.py", line 1265, in head
        return self._head(n=n, npartitions=npartitions, compute=compute, safe=safe)
      File "/N/u/haofeng/BigRed200/anaconda3/envs/rapids-22.12/lib/python3.9/site-packages/dask/dataframe/core.py", line 1299, in _head
        result = result.compute()
      File "/N/u/haofeng/BigRed200/anaconda3/envs/rapids-22.12/lib/python3.9/site-packages/dask/base.py", line 315, in compute
        (result,) = compute(self, traverse=False, **kwargs)
      File "/N/u/haofeng/BigRed200/anaconda3/envs/rapids-22.12/lib/python3.9/site-packages/dask/base.py", line 600, in compute
        results = schedule(dsk, keys, **kwargs)
      File "/N/u/haofeng/BigRed200/anaconda3/envs/rapids-22.12/lib/python3.9/site-packages/distributed/client.py", line 3122, in get
        results = self.gather(packed, asynchronous=asynchronous, direct=direct)
      File "/N/u/haofeng/BigRed200/anaconda3/envs/rapids-22.12/lib/python3.9/site-packages/distributed/client.py", line 2291, in gather
        return self.sync(
      File "/N/u/haofeng/BigRed200/anaconda3/envs/rapids-22.12/lib/python3.9/site-packages/distributed/utils.py", line 339, in sync
        return sync(
      File "/N/u/haofeng/BigRed200/anaconda3/envs/rapids-22.12/lib/python3.9/site-packages/distributed/utils.py", line 406, in sync
        raise exc.with_traceback(tb)
      File "/N/u/haofeng/BigRed200/anaconda3/envs/rapids-22.12/lib/python3.9/site-packages/distributed/utils.py", line 379, in f
        result = yield future
      File "/N/u/haofeng/BigRed200/anaconda3/envs/rapids-22.12/lib/python3.9/site-packages/tornado/gen.py", line 762, in run
        value = future.result()
      File "/N/u/haofeng/BigRed200/anaconda3/envs/rapids-22.12/lib/python3.9/site-packages/distributed/client.py", line 2154, in _gather
        raise exception.with_traceback(traceback)
      File "/N/u/haofeng/BigRed200/anaconda3/envs/rapids-22.12/lib/python3.9/site-packages/dask/optimization.py", line 990, in __call__
        return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))
      File "/N/u/haofeng/BigRed200/anaconda3/envs/rapids-22.12/lib/python3.9/site-packages/dask/core.py", line 149, in get
        result = _execute_task(task, cache)
      File "/N/u/haofeng/BigRed200/anaconda3/envs/rapids-22.12/lib/python3.9/site-packages/dask/core.py", line 119, in _execute_task
        return func(*(_execute_task(a, cache) for a in args))
      File "/N/u/haofeng/BigRed200/anaconda3/envs/rapids-22.12/lib/python3.9/site-packages/dask/dataframe/io/csv.py", line 141, in __call__
        df = pandas_read_text(
      File "/N/u/haofeng/BigRed200/anaconda3/envs/rapids-22.12/lib/python3.9/site-packages/dask/dataframe/io/csv.py", line 196, in pandas_read_text
        coerce_dtypes(df, dtypes)
      File "/N/u/haofeng/BigRed200/anaconda3/envs/rapids-22.12/lib/python3.9/site-packages/dask/dataframe/io/csv.py", line 297, in coerce_dtypes
        raise ValueError(msg)
    ValueError: Mismatched dtypes found in `pd.read_csv`/`pd.read_table`.
    
    +--------+---------+----------+
    | Column | Found   | Expected |
    +--------+---------+----------+
    | int_1  | float64 | int64    |
    | int_11 | float64 | int64    |
    | int_6  | float64 | int64    |
    +--------+---------+----------+
    
    Usually this is due to dask's dtype inference failing, and
    *may* be fixed by specifying dtypes manually by adding:
    
    dtype={'int_1': 'float64',
           'int_11': 'float64',
           'int_6': 'float64'}
    
    to the call to `read_csv`/`read_table`.
    
    Alternatively, provide `assume_missing=True` to interpret
    all unspecified integer columns as floats.
    *** Error in `/N/u/haofeng/BigRed200/anaconda3/envs/rapids-22.12/bin/python': corrupted size vs. prev_size: 0x00007ff06c000b00 ***
    

    I read the error code, according to my current understanding, I think it is caused by a data type mismatch when reading the file. But I do download the original files from this website original data. I'm hoping to get some help telling me why this is and how I should fix it.

    I still have two more questions, the first one is that after I downloaded these files and decompressed them with gzip -dk $filename, I got some files which names are day_0, day_1 ... instead of day_0.csv, day_1.csv. Will this cause some mistakes? I think this is only a difference in the suffix, but I am not sure.
    Another question is it seems that I am running NVTabular Dataset in CPU mode, I have already install cupy, rmm and some other libraries, what should I do to run this program in CUDA mode?
    Thank you for your help!

    opened by allenfengjr 1
Releases(v0.3.2)
  • v0.3.2(Dec 15, 2022)

    KeyedJaggedTensor

    We observed performance regression due to a bottleneck in sparse data distribution for models that have multiple, large KJTs to redistribute.

    To combat this we altered the comms pattern to transport the minimum data required in the initial collective to support the collective calls for the actual KJT tensor data. This data sent in the initial collective, ‘splits’ means more data is transmitted over the comms stream overall, but the CPU is blocked for significantly shorter amounts of time leading to better overall QPS.

    Furthermore, we altered the TorchRec train pipeline to group the initial collective calls for the splits together before launching the more expensive KJT tensor collective calls. This pseudo ‘fusing’ minimizes the CPU blocked time as launching each subsequent input distribution is no longer dependent on the previous input distribution.

    We no longer pass in variable batch size in the sharder

    Planner

    On the planner side, we introduced a new feature “early stopping” to GreedyProposer. This brings a 4X speedup to planner when there are many proposals (>1000) to propose with. To use the feature, simply add “threshold=10” to GreedyProposer (10 is the suggested number for it, which means GreedyProposer will stop proposing after seeing 10 consecutive bad proposals). Secondly, we refactored the “deepcopy” logic in the planner code, which bring a 8X speedup on the overall planning time. See PR #665 for the details.

    Pinning requirements

    We are also pinning requirements to add more stability to TorchRec users

    Source code(tar.gz)
    Source code(zip)
  • v0.3.0(Oct 27, 2022)

    [ProtoType] Simplified Optimizer Fusion APIs

    We’ve provided a simplified and more intuitive API for setting fused optimizer settings via apply_optimizer_in_backward. This new approach enables the ability to specify optimizer settings on a per-parameter basis and sharded modules will configure FBGEMM’s TableBatchedEmbedding modules accordingly. Additionally, this now let's TorchRec’s planner account for optimizer memory usage. This should alleviate reports of sharding jobs OOMing after using Adam using a plan generated from planner.

    [ProtoType] Simplified Sharding APIs

    We’re introducing the shard API, which now allows you to shard only the embedding modules within a model, and provides an alternative to the current main entry point - DistributedModelParallel. This lets you have a finer grained control over the rest of the model, which can be useful for customized parallelization logic, and inference use cases (which may not require any parallelization on the dense layers). We’re also introducing construct_module_sharding_plan, providing a simpler interface to the TorchRec sharder.

    [Beta] Integration with FBGEMM's Quantized Comms Library

    Applying quantization or mixed precision to tensors in a collective call during model parallel training greatly improves training efficiency, with little to no effect on model quality. TorchRec now integrates with the quantized comms library provided by FBGEMM GPU and provides an interface to construct encoders and decoders (codecs) that surround the all_to_all, and reduce_scatter collective calls in the output_dist of a sharded module. We also allow you to construct your own codecs to apply to your sharded module. The codces provided by FBGEMM allow FP16, BF16, FP8, and INT8 compressions, and you may use different quantizations for the forward path and backward pass.

    Planner

    • We removed several unnecessary copies inside of planner that drastically decreases the runtime.
    • Cleaned up the Topology interface (no longer takes in unrelated information like batch size).
    Source code(tar.gz)
    Source code(zip)
  • v0.2.0(Jun 28, 2022)

    Changelog

    PyPi Installation

    The recommended install location is now from pypy. Additionally, TorchRec's binary will not longer contain fbgemm_gpu. Instead fbgemm_gpu will be installed as a dependency. See README for details

    Planner Improvements

    We added some additional features and bug fixed some bugs Variable batch size per feature to support request only features Better calculations for quant UVM Caching Bug fix for shard storage fitting on device

    Single process Batched + Fused Embeddings

    Previously TorchRec’s abstractions (EmbeddingBagCollection/EmbeddingCollection) over FBGEMM kernels, which provide benefits such as table batching, optimizer fusion, and UVM placement, could only be used in conjunction with DistributedModelParallel. We’ve decoupled these notions from sharding, and introduced the FusedEmbeddingBagCollection, which can be used as a standalone module, with all of the above features, and can also be sharded.

    Sharder

    We enabled embedding sharding support for variable batch sizes across GPUs.

    Benchmarking and Examples

    We introduce A set of benchmarking tests, showing performance characteristics of TorchRec’s base modules and research models built out of TorchRec. We provide an example demonstrating training a distributed TwoTower (i.e. User-Item) Retrieval model that is sharded using TorchRec. The projected item embeddings are added to an IVFPQ FAISS index for candidate generation. The retrieval model and KNN lookup are bundled in a Pytorch model for efficient end-to-end retrieval. inference example with Torch Deploy for both single and multi GPU

    Integrations

    We demonstrate that TorchRec works out of the box with many components commonly used alongside PyTorch models in production like systems, such as

    • Training a TorchRec model on Ray Clusters utilizing the Torchx Ray scheduler
    • Preprocessing and DataLoading with NVTabular on DLRM
    • Training a TorchRec model with on-the-fly preprocessing with TorchArrow showcasing RecSys domain UDFs.

    Scriptable Unsharded Modules

    The unsharded embedding modules (EmbeddingBagCollection/EmbeddingCollection and variants) are now torch scriptable.

    EmbeddingCollection Column Wise Sharding

    We now support column wise sharding for EmbeddingCollection, enabling sequence embeddings to be sharded column wise.

    JaggedTensor

    Boost performance of to_padded_dense function by implementing with FBGEMM.

    Linting

    Add lintrunner to allow contributors to lint and format their changes quickly, matching our internal formatter.

    Source code(tar.gz)
    Source code(zip)
  • v0.1.1(May 17, 2022)

    Changelog

    pytorch.org Install

    The recommended install location is now from download.pytorch.org. See README for details

    Recmetrics

    RecMetrics is a metrics library that collects common utilities and optimizations for Recommendation models.

    • A centralized metrics module that allows users to add new metrics
    • Commonly used metrics, including AUC, Calibration, CTR, MSE/RMSE, NE & Throughput
    • Optimization for metrics related operations to reduce the overhead of metric computation
    • Checkpointing

    Torchrec inference

    Larger models need GPU support for inference. Also, there is a difference between features used in common training stacks and inference stacks. The goal of this library is to make use of some features seen in training to make inference more unified and easier to use.

    EmbeddingTower and EmbeddingTowerCollection

    a new sharadable nn.Module called EmbeddingTower/EmbeddingTowerCollection. This module will give model authors the basic building block to establish a clear relationship between a set of embedding tables and post lookup modules.

    Examples/tutorials

    Inference example

    documentation (installation and example), updated cmake build and gRPC server example

    Bert4rec example

    Reproduction of bert4rec paper showcasing EmbeddingCollection module (non pooling)

    Sharding Tutorial

    Overview of sharding in torchrec and the five types of sharding https://pytorch.org/tutorials/advanced/sharding.html

    Improved Planner

    • Updated static estimates for perf
    • Models full model parallel path
    • Includes support for sequence embeddings, weighted features, and feature processors
    • Added grid search proposer
    Source code(tar.gz)
    Source code(zip)
  • v0.1.0(Apr 14, 2022)

    We are excited to announce TorchRec, a PyTorch domain library for Recommendation Systems. This new library provides common sparsity and parallelism primitives, enabling researchers to build state-of-the-art personalization models and deploy them in production.

    Modeling primitives, such as embedding bags and jagged tensors, that enable easy authoring of large, performant multi-device/multi-node models using hybrid data-parallelism and model-parallelism. Optimized RecSys kernels powered by FBGEMM , including support for sparse and quantized operations. A sharder which can partition embedding tables with a variety of different strategies including data-parallel, table-wise, row-wise, table-wise-row-wise, and column-wise sharding. A planner which can automatically generate optimized sharding plans for models. Pipelining to overlap dataloading device transfer (copy to GPU), inter-device communications (input_dist), and computation (forward, backward) for increased performance. GPU inference support. Common modules for RecSys, such as models and public datasets (Criteo & Movielens).

    See our announcement and docs

    Source code(tar.gz)
    Source code(zip)
Owner
Meta Research
Meta Research
QRec: A Python Framework for quick implementation of recommender systems (TensorFlow Based)

QRec is a Python framework for recommender systems (Supported by Python 3.7.4 and Tensorflow 1.14+) in which a number of influential and newly state-of-the-art recommendation models are implemented.

Yu 1.4k Dec 27, 2022
Recommendation System to recommend top books from the dataset

recommendersystem Recommendation System to recommend top books from the dataset Introduction The recom.py is the main program code. The dataset is als

Vishal karur 1 Nov 15, 2021
Collaborative variational bandwidth auto-encoder (VBAE) for recommender systems.

Collaborative Variational Bandwidth Auto-encoder The codes are associated with the following paper: Collaborative Variational Bandwidth Auto-encoder f

Yaochen Zhu 14 Dec 11, 2022
Persine is an automated tool to study and reverse-engineer algorithmic recommendation systems.

Persine, the Persona Engine Persine is an automated tool to study and reverse-engineer algorithmic recommendation systems. It has a simple interface a

Jonathan Soma 87 Nov 29, 2022
This library intends to be a reference for recommendation engines in Python

Crab - A Python Library for Recommendation Engines

Marcel Caraciolo 85 Oct 04, 2021
Recommender systems are the systems that are designed to recommend things to the user based on many different factors

Recommender systems are the systems that are designed to recommend things to the user based on many different factors. The recommender system deals with a large volume of information present by filte

Happy N. Monday 3 Feb 15, 2022
The official implementation of "DGCN: Diversified Recommendation with Graph Convolutional Networks" (WWW '21)

DGCN This is the official implementation of our WWW'21 paper: Yu Zheng, Chen Gao, Liang Chen, Depeng Jin, Yong Li, DGCN: Diversified Recommendation wi

FIB LAB, Tsinghua University 37 Dec 18, 2022
The implementation of the submitted paper "Deep Multi-Behaviors Graph Network for Voucher Redemption Rate Prediction" in SIGKDD 2021 Applied Data Science Track.

DMBGN: Deep Multi-Behaviors Graph Networks for Voucher Redemption Rate Prediction The implementation of the accepted paper "Deep Multi-Behaviors Graph

10 Jul 12, 2022
Graph Neural Networks for Recommender Systems

This repository contains code to train and test GNN models for recommendation, mainly using the Deep Graph Library (DGL).

217 Jan 04, 2023
Code for KHGT model, AAAI2021

KHGT Code for KHGT accepted by AAAI2021 Please unzip the data files in Datasets/ first. To run KHGT on Yelp data, use python labcode_yelp.py For Movi

32 Nov 29, 2022
Temporal Meta-path Guided Explainable Recommendation (WSDM2021)

Temporal Meta-path Guided Explainable Recommendation (WSDM2021) TMER Code of paper "Temporal Meta-path Guided Explainable Recommendation". Requirement

Yicong Li 13 Nov 30, 2022
Group-Buying Recommendation for Social E-Commerce

Group-Buying Recommendation for Social E-Commerce This is the official implementation of the paper Group-Buying Recommendation for Social E-Commerce (

Jun Zhang 37 Nov 28, 2022
A recommendation system for suggesting new books given similar books.

Book Recommendation System A recommendation system for suggesting new books given similar books. Datasets Dataset Kaggle Dataset Notebooks goodreads-E

Sam Partee 2 Jan 06, 2022
NVIDIA Merlin is an open source library designed to accelerate recommender systems on NVIDIA’s GPUs.

NVIDIA Merlin is an open source library providing end-to-end GPU-accelerated recommender systems, from feature engineering and preprocessing to training deep learning models and running inference in

420 Jan 04, 2023
ToR[e]cSys is a PyTorch Framework to implement recommendation system algorithms

ToR[e]cSys is a PyTorch Framework to implement recommendation system algorithms, including but not limited to click-through-rate (CTR) prediction, learning-to-ranking (LTR), and Matrix/Tensor Embeddi

LI, Wai Yin 90 Oct 08, 2022
Graph Neural Network based Social Recommendation Model. SIGIR2019.

Basic Information: This code is released for the papers: Le Wu, Peijie Sun, Yanjie Fu, Richang Hong, Xiting Wang and Meng Wang. A Neural Influence Dif

PeijieSun 144 Dec 29, 2022
A Library for Field-aware Factorization Machines

Table of Contents ================= - What is LIBFFM - Overfitting and Early Stopping - Installation - Data Format - Command Line Usage - Examples -

1.6k Dec 05, 2022
Elliot is a comprehensive recommendation framework that analyzes the recommendation problem from the researcher's perspective.

Comprehensive and Rigorous Framework for Reproducible Recommender Systems Evaluation

Information Systems Lab @ Polytechnic University of Bari 215 Nov 29, 2022
Recommender System Papers

Included Conferences: SIGIR 2020, SIGKDD 2020, RecSys 2020, CIKM 2020, AAAI 2021, WSDM 2021, WWW 2021

RUCAIBox 704 Jan 06, 2023
Mutual Fund Recommender System. Tailor for fund transactions.

Explainable Mutual Fund Recommendation Data Please see 'DATA_DESCRIPTION.md' for mode detail. Recommender System Methods Baseline Collabarative Fiilte

JHJu 2 May 19, 2022