DeepRec is a recommendation engine based on TensorFlow.

Related tags

Deep LearningDeepRec
Overview

DeepRec

Introduction

DeepRec is a recommendation engine based on TensorFlow 1.15, Intel-TensorFlow and NVIDIA-TensorFlow.

Background

Sparse model is a type of deep learning model that accounts for a relatively high proportion of discrete feature calculation logic in the model structure. Discrete features are usually expressed as non-numeric features that cannot be directly processed by algorithms such as id, tag, text, and phrases. They are widely used in high-value businesses such as search, advertising, and recommendation.

DeepRec has been deeply cultivated since 2016, which supports core businesses such as Taobao Search, recommendation and advertising. It precipitates a list of features on basic frameworks and has excellent performance in sparse models training. Facing a wide variety of external needs and the environment of deep learning framework embracing open source, DeepeRec open source is conducive to establishing standardized interfaces, cultivating user habits, greatly reducing the cost of external customers working on cloud and establishing the brand value.

Key Features

DeepRec has super large-scale distributed training capability, supporting model training of trillion samples and 100 billion Embedding Processing. For sparse model scenarios, in-depth performance optimization has ben conducted across CPU and GPU platform. It contains 3 kinds of features to improve usability and performance for super-scale scenarios.

Sparse Functions

  • Embedding Variable.
  • Dynamic Dimension Embedding Variable.
  • Adaptive Embedding Variable.
  • Multiple Hash Embedding Variable.

Performance Optimization

  • Distributed Training Framework Optimization, such as grpc+seastar, FuseRecv, StarServer, HybridBackend etc.
  • Runtime Optimization, such as CPU memory allocator (PRMalloc), GPU memory allocator etc.
  • Operator level optimization, such as BF16 mixed precision optimization, sparse operator optimization and EmbeddingVariable on PMEM and GPU, new hardware feature enabling, etc.
  • Graph level optimization, such as AutoGraphFusion, SmartStage, AutoPipeline, StrutureFeature, MicroBatch etc.

Deploy and Serving

  • Incremental model loading and exporting
  • Super-scale sparse model distributed serving
  • Multilevel hybrid storage and multi backend supported ..
  • Online deep learning with low latency

Installation

Prepare for installation

CPU Platform

registry.cn-shanghai.aliyuncs.com/pai-dlc/tensorflow-developer:1.15deeprec2106-cpu-py36-ubuntu18.04

GPU Platform

registry.cn-shanghai.aliyuncs.com/pai-dlc/tensorflow-developer:1.15deeprec2106-gpu-py36-cu110-ubuntu18.04

How to Build

configure

$ ./configure

Compile for CPU and GPU defaultly

$ bazel build -c opt --config=opt //tensorflow/tools/pip_package:build_pip_package

Compile for CPU and GPU: ABI=0

$ bazel build --cxxopt="-D_GLIBCXX_USE_CXX11_ABI=0" --host_cxxopt="-D_GLIBCXX_USE_CXX11_ABI=0" -c opt --config=opt //tensorflow/tools/pip_package:build_pip_package

Compile for CPU optimization: oneDNN + Unified Eigen Thread pool

$ bazel build  -c opt --config=opt  --config=mkl_threadpool --define build_with_mkl_dnn_v1_only=true //tensorflow/tools/pip_package:build_pip_package

Compile for CPU optimization and ABI=0

$ bazel build --cxxopt="-D_GLIBCXX_USE_CXX11_ABI=0" --host_cxxopt="-D_GLIBCXX_USE_CXX11_ABI=0" -c opt --config=opt --config=mkl_threadpool --define build_with_mkl_dnn_v1_only=true //tensorflow/tools/pip_package:build_pip_package

Create whl package

$ ./bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg

Install whl package

$ pip3 install /tmp/tensorflow_pkg/tensorflow-1.15.5+deeprec2106-cp36-cp36m-linux_x86_64.whl

Nightly Images

Image for GPU CUDA11.0

registry.cn-shanghai.aliyuncs.com/pai-dlc/tensorflow-training:deeprec-nightly-gpu-py36-cu110-ubuntu18.04

Image for CPU

registry.cn-shanghai.aliyuncs.com/pai-dlc/tensorflow-training:deeprec-nightly-cpu-py36-ubuntu18.04

Jave Compilation

$ ./configure
$ bazel build --config opt //tensorflow/java:tensorflow   //tensorflow/java:libtensorflow_jni
$ javac -cp bazel-bin/tensorflow/java/libtensorflow.jar ...
$ java -cp bazel-bin/tensorflow/java/libtensorflow.jar  -Djava.library.path=bazel-bin/tensorflow/java  ...


License

Apache License 2.0

Comments
  • [Grappler] Add Concat+Cast fusion

    [Grappler] Add Concat+Cast fusion

    For the BF16 graph, we usually find a concat+cast pattern from the feature column to DNN part. The optimization is for concat(FP32 -> FP32) + cast(FP32 > BF16) and concat(BF16 -> BF16) + cast(BF16 > FP32) to fuse one op.

    opened by aalbersk 7
  • Build from source and import error

    Build from source and import error "cannot import name saver"

    Please make sure that this is a build/installation issue. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:build_template

    System information

    • OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
    • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:
    • TensorFlow installed from (source or binary):
    • TensorFlow version:1.15
    • Python version:2.7
    • Installed using virtualenv? pip? conda?:
    • Bazel version (if compiling from source):
    • GCC/Compiler version (if compiling from source): g++ 7.5
    • CUDA/cuDNN version:
    • GPU model and memory:

    Describe the problem

    ERROR: /DeepRec/tensorflow/BUILD:893:1: Executing genrule //tensorflow:tf_python_api_gen_v1 failed (Exit 1)
    Traceback (most recent call last):
      File "/root/.cache/bazel/_bazel_root/ac437ea991a64a55acdfc27c9ef15814/execroot/org_tensorflow/bazel-out/host/bin/tensorflow/create_tensorflow.python_api_1_tf_python_api_gen_v1.runfiles/org_tensorflow/tensorflow/python/tools/api/generator/create_python_api.py", line 27, in <module>
        from tensorflow.python.tools.api.generator import doc_srcs
      File "/root/.cache/bazel/_bazel_root/ac437ea991a64a55acdfc27c9ef15814/execroot/org_tensorflow/bazel-out/host/bin/tensorflow/create_tensorflow.python_api_1_tf_python_api_gen_v1.runfiles/org_tensorflow/tensorflow/python/__init__.py", line 73, in <module>
        from tensorflow.python.ops.standard_ops import *
      File "/root/.cache/bazel/_bazel_root/ac437ea991a64a55acdfc27c9ef15814/execroot/org_tensorflow/bazel-out/host/bin/tensorflow/create_tensorflow.python_api_1_tf_python_api_gen_v1.runfiles/org_tensorflow/tensorflow/python/ops/standard_ops.py", line 25, in <module>
        from tensorflow.python import autograph
      File "/root/.cache/bazel/_bazel_root/ac437ea991a64a55acdfc27c9ef15814/execroot/org_tensorflow/bazel-out/host/bin/tensorflow/create_tensorflow.python_api_1_tf_python_api_gen_v1.runfiles/org_tensorflow/tensorflow/python/autograph/__init__.py", line 35, in <module>
        from tensorflow.python.autograph import operators
      File "/root/.cache/bazel/_bazel_root/ac437ea991a64a55acdfc27c9ef15814/execroot/org_tensorflow/bazel-out/host/bin/tensorflow/create_tensorflow.python_api_1_tf_python_api_gen_v1.runfiles/org_tensorflow/tensorflow/python/autograph/operators/__init__.py", line 40, in <module>
        from tensorflow.python.autograph.operators.control_flow import for_stmt
      File "/root/.cache/bazel/_bazel_root/ac437ea991a64a55acdfc27c9ef15814/execroot/org_tensorflow/bazel-out/host/bin/tensorflow/create_tensorflow.python_api_1_tf_python_api_gen_v1.runfiles/org_tensorflow/tensorflow/python/autograph/operators/control_flow.py", line 65, in <module>
        from tensorflow.python.autograph.operators import py_builtins
      File "/root/.cache/bazel/_bazel_root/ac437ea991a64a55acdfc27c9ef15814/execroot/org_tensorflow/bazel-out/host/bin/tensorflow/create_tensorflow.python_api_1_tf_python_api_gen_v1.runfiles/org_tensorflow/tensorflow/python/autograph/operators/py_builtins.py", line 30, in <module>
        from tensorflow.python.data.ops import dataset_ops
      File "/root/.cache/bazel/_bazel_root/ac437ea991a64a55acdfc27c9ef15814/execroot/org_tensorflow/bazel-out/host/bin/tensorflow/create_tensorflow.python_api_1_tf_python_api_gen_v1.runfiles/org_tensorflow/tensorflow/python/data/__init__.py", line 25, in <module>
        from tensorflow.python.data import experimental
      File "/root/.cache/bazel/_bazel_root/ac437ea991a64a55acdfc27c9ef15814/execroot/org_tensorflow/bazel-out/host/bin/tensorflow/create_tensorflow.python_api_1_tf_python_api_gen_v1.runfiles/org_tensorflow/tensorflow/python/data/experimental/__init__.py", line 89, in <module>
        from tensorflow.python.data.experimental.ops.batching import dense_to_sparse_batch
      File "/root/.cache/bazel/_bazel_root/ac437ea991a64a55acdfc27c9ef15814/execroot/org_tensorflow/bazel-out/host/bin/tensorflow/create_tensorflow.python_api_1_tf_python_api_gen_v1.runfiles/org_tensorflow/tensorflow/python/data/experimental/ops/batching.py", line 20, in <module>
        from tensorflow.python.data.ops import dataset_ops
      File "/root/.cache/bazel/_bazel_root/ac437ea991a64a55acdfc27c9ef15814/execroot/org_tensorflow/bazel-out/host/bin/tensorflow/create_tensorflow.python_api_1_tf_python_api_gen_v1.runfiles/org_tensorflow/tensorflow/python/data/ops/dataset_ops.py", line 40, in <module>
        from tensorflow.python.data.ops import iterator_ops
      File "/root/.cache/bazel/_bazel_root/ac437ea991a64a55acdfc27c9ef15814/execroot/org_tensorflow/bazel-out/host/bin/tensorflow/create_tensorflow.python_api_1_tf_python_api_gen_v1.runfiles/org_tensorflow/tensorflow/python/data/ops/iterator_ops.py", line 35, in <module>
        from tensorflow.python.training.saver import BaseSaverBuilder
      File "/root/.cache/bazel/_bazel_root/ac437ea991a64a55acdfc27c9ef15814/execroot/org_tensorflow/bazel-out/host/bin/tensorflow/create_tensorflow.python_api_1_tf_python_api_gen_v1.runfiles/org_tensorflow/tensorflow/python/training/saver.py", line 57, in <module>
        from tensorflow.python.training.saving import saveable_object_util
      File "/root/.cache/bazel/_bazel_root/ac437ea991a64a55acdfc27c9ef15814/execroot/org_tensorflow/bazel-out/host/bin/tensorflow/create_tensorflow.python_api_1_tf_python_api_gen_v1.runfiles/org_tensorflow/tensorflow/python/training/saving/saveable_object_util.py", line 33, in <module>
        from tensorflow.python.training import saver
    ImportError: cannot import name saver
    Target //tensorflow/tools/pip_package:build_pip_package failed to build
    

    Provide the exact sequence of commands / steps that you executed before running into the problem

    bazel build --cxxopt="-D_GLIBCXX_USE_CXX11_ABI=1" --host_cxxopt="-D_GLIBCXX_USE_CXX11_ABI=1" -c opt --config=v1 --config=opt --config=mkl_threadpool --define build_with_mkl_dnn_v1_only=true //tensorflow/tools/pip_package:build_pip_package
    

    Any other info / logs Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

    opened by 3-shi 6
  • at PMEM memkind environment  execute the launch script ,I got error log

    at PMEM memkind environment execute the launch script ,I got error log

    When I use the latest commit to build a PMEM memkind environment and execute the launch script, the following error will appear.

    1. The commit code version I used image

    2.The build option I used

    bazel build --cxxopt="-D_GLIBCXX_USE_CXX11_ABI=0" --host_cxxopt="-D_GLIBCXX_USE_CXX11_ABI=0" -c opt --copt="-L/usr/local/lib" --copt="-lpmem" --copt="-lmemkind" --config=opt //tensorflow/tools/pip_package:build_pip_package

    1. The scprit I used numactl -N 1 ./launch.sh --batch_size=1280 --dim_size=512 --max_mock_id_amplify=1800 --num_steps=2000 --ev_storage=pmem_memkind

    2. error logs

    INFO:tensorflow:Running local_init_op. INFO:tensorflow:Done running local_init_op. Traceback (most recent call last): File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call return fn(*args) File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn target_list, run_metadata) File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.InvalidArgumentError: From /job:ps/replica:0/task:0: MultiLevel EV's Cache size -1 should large than IDs in batch 1280 [[{{node fm/embedding_lookup_36}}]]

    During handling of the above exception, another exception occurred:

    Traceback (most recent call last): File "./benchmark.py", line 228, in tf.app.run() File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/platform/app.py", line 40, in run _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef) File "/home/pai/lib/python3.6/site-packages/absl/app.py", line 312, in run _run_main(main, args) File "/home/pai/lib/python3.6/site-packages/absl/app.py", line 258, in _run_main sys.exit(main(argv)) File "./benchmark.py", line 203, in main sess.run(train_op) File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 804, in run run_metadata=run_metadata) File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1309, in run run_metadata=run_metadata) File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1410, in run raise six.reraise(*original_exc_info) File "/home/pai/lib/python3.6/site-packages/six.py", line 719, in reraise raise value File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1395, in run return self._sess.run(*args, **kwargs) File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1468, in run run_metadata=run_metadata) File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1226, in run return self._sess.run(*args, **kwargs) File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 956, in run run_metadata_ptr) File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1180, in _run feed_dict_tensor, options, run_metadata) File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run run_metadata) File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.InvalidArgumentError: From /job:ps/replica:0/task:0: MultiLevel EV's Cache size -1 should large than IDs in batch 1280 [[node fm/embedding_lookup_36 (defined at /home/pai/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]

    Original stack trace for 'fm/embedding_lookup_36': File "./benchmark.py", line 228, in tf.app.run() File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/platform/app.py", line 40, in run _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef) File "/home/pai/lib/python3.6/site-packages/absl/app.py", line 312, in run _run_main(main, args) File "/home/pai/lib/python3.6/site-packages/absl/app.py", line 258, in _run_main sys.exit(main(argv)) File "./benchmark.py", line 121, in main tf.nn.embedding_lookup(fm_w, batch['col{}'.format(sidx)])) File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/ops/embedding_ops.py", line 418, in embedding_lookup counts=counts) File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/ops/embedding_ops.py", line 184, in _embedding_lookup_and_transform counts=counts), File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/util/dispatch.py", line 180, in wrapper return target(*args, **kwargs) File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/ops/array_ops.py", line 3958, in gather counts=counts) File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/ops/kv_variable_ops.py", line 749, in sparse_read name=name) File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/ops/gen_kv_variable_ops.py", line 647, in kv_resource_gather validate_indices=validate_indices, name=name) File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper op_def=op_def) File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/util/deprecation.py", line 507, in new_func return func(*args, **kwargs) File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op attrs, op_def, compute_device) File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal op_def=op_def) File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 1748, in init self._traceback = tf_stack.extract_stack()

    opened by jiefengshuo 4
  • Tensor slice example, tensor slice is much slower than TextLineDataset

    Tensor slice example, tensor slice is much slower than TextLineDataset

    Try replacing the TextLineDataset with a tensor slice dataset (see train.py), but MonitoredTrainingSession is much slower than the original. It takes roughly 100-110 seconds to create. The TextLineDataSet takes only 7 seconds. If I set checkpoint_dir to None, it can save 70 seconds.

    Do you have good advice for this? Whether checkpoint_dir can be improved?

    opened by zhanglirong1999 3
  • [BUILD] gcc-8.3 build DeepRec fail.

    [BUILD] gcc-8.3 build DeepRec fail.

    Please make sure that this is a build/installation issue. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:build_template

    System information

    • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Centos 7
    • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: no
    • TensorFlow installed from (source or binary): source
    • TensorFlow version: r1.15.5-deeprec2204u1
    • Python version:
    • Installed using virtualenv? pip? conda?:
    • Bazel version (if compiling from source):
    • GCC/Compiler version (if compiling from source): gcc version 8.3.1 20190311 (Red Hat 8.3.1-3) (GCC)
    • CUDA/cuDNN version: cuda11.4
    • GPU model and memory:

    Describe the problem

    Build deeprec fail when we use gcc 8.3.1. It triggers gcc 8.3.1 compiler bug. The error is as follows:

    unique_ali_op_ut.h:498:77: internal compiler error: in is_normal_capture_proxy, at cp/lambda.c:292

    Provide the exact sequence of commands / steps that you executed before running into the problem

    Any other info / logs Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

    image

    bug 
    opened by ProphetPeng 3
  • Unsupport GlobalStep in subclass of ValuePtrBase

    Unsupport GlobalStep in subclass of ValuePtrBase

    When we save checkpoint, the error F ./tensorflow/core/framework/embedding/value_ptr.h:256] Unsupport GlobalStep in subclass of ValuePtrBase occurs. Because I find that the checkpoint is a temporary file best_checkpoint/best.data-00000-of-00001.tempstate11898667549733680686.

    opened by Lihengwannafly 3
  • [Modelzoo] DIN and DIEN perf drop based on r1.15.5-deeprec2201 tag.

    [Modelzoo] DIN and DIEN perf drop based on r1.15.5-deeprec2201 tag.

    Modelzoo perf Test based on [Release] Update DeepRec release version to 1.15.5+deeprec2201. (#43). Test machines: Alibaba Cloud ECS general purpose instance family with high clock speeds - ecs.hfg7.2xlarge.

    Test perf result:

    Gstep | WDL | WDL | DLRM | DLRM | DeepFM | DeepFM | DSSM | DSSM | DIEN | DIEN | DIN | DIN -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- / | value | percent | value | percent | value | percent | value | percent | value | percent | value | percent Commuty TF | 31.92626 | baseline | 82.09168 | baseline | 37.20978 | baseline | 18.54726 | baseline | 14.62987 | baseline | 18.57746 | baseline DeepRec FP32 | 34.69318 | 108.67% | 105.4547 | 128.46% | 43.31713 | 116.41% | 21.64175 | 116.68% | 13.27125 | 90.71% | 17.6932 | 95.24% DeepRec BF16 | 49.38222 | 154.68% | 114.2221 | 139.14% | 47.34401 | 127.24% | 23.13698 | 124.75% | 13.0392 | 89.13% | 17.20525 | 92.61%

    Test AUC result:

    AUC | WDL | WDL | DLRM | DLRM | DeepFM | DeepFM | DSSM | DSSM | DIEN | DIEN | DIN | DIN -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- / | value | percent | value | percent | value | percent | value | percent | value | percent | value | percent Commuty TF | 0.775168 | baseline | 0.768852 | baseline | 0.744794 | baseline | 0.504404 | baseline | 0.8443 | baseline | 0.7887 | baseline DeepRec FP32 | 0.775515 | 100.04% | 0.771128 | 100.30% | 0.746055 | 100.17% | 0.503653 | 99.85% | 0.8472 | 100.34% | 0.7913 | 100.33% DeepRec BF16 | 0.77604 | 100.11% | 0.772185 | 100.43% | 0.741192 | 99.52% | 0.492327 | 97.61% | 0.8358 | 98.99% | 0.7883 | 99.95%

    PS: DSSM dataset is small, so its ACC and AUC is limited.

    opened by changqi1 3
  • [SmartStage] SmartStage has low performance on GPU.

    [SmartStage] SmartStage has low performance on GPU.

    测试环境 image 性能对比 image

    [1] Invalid argument: Trying to access resource linear/linear_model/C1/weights/part_0 located in device /job:localhost/replica:0/task:0/device:CPU:0 from device /job:localhost/replica:0/task:0/device:GPU:0 [2] 2022-06-07 09:49:01.768708: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at resource_variable_ops.cc:400 : Invalid argument: Trying to access resource linear/linear_model/C12/weights/part_0 located in device /job:localhost/replica:0/task:0/device:CPU:0 from device /job:localhost/replica:0/task:0/device:GPU:0

    opened by JackMoriarty 2
  • [Op] Parallelize UnsortedSegment op.

    [Op] Parallelize UnsortedSegment op.

    Parallelize UnsortedSegmentSum on CPU deivce.

    Under the same condition, we can see the “parallel” way is more effective. Op | Row | Col | S_id | T_nums -- | -- | -- | -- | -- UnsortedSegmentSum | 4096 | 1024 | 128 | 1 UnsortedSegmentSum | 4096 | 1024 | 128 | 2 UnsortedSegmentSum | 4096 | 1024 | 128 | 4 UnsortedSegmentSum | 4096 | 1024 | 128 | 8 UnsortedSegmentSum | 4096 | 1024 | 128 | 16

    image

    enhancement 
    opened by marvin-Yu 2
  • DeepRec utilize GPU with really low utilization on the special kind of CPU

    DeepRec utilize GPU with really low utilization on the special kind of CPU

    Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template

    System information

    • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): YES
    • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 18.04 in Docker
    • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:
    • TensorFlow installed from (source or binary): source
    • TensorFlow version (use command below): r1.15.5-deeprec2204-39-g0527d0b2ad8 1.15.5
    • Python version: Python 3.6.9
    • Bazel version (if compiling from source): Bazelisk version: v1.11.0 Build label: 0.24.1
    • GCC/Compiler version (if compiling from source): gcc version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04)
    • CUDA/cuDNN version: CUDA=11.4, V11.4.152, cuDNN 8
    • GPU model and memory: NVIDIA-SMI 470.82.01 Driver Version: 470.82.01 CUDA Version: 11.4, Tesla P100 * 4, 16280MiB

    You can collect some of this information using our environment capture script You can also obtain the TensorFlow version with: 1. TF 1.0: python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)" 2. TF 2.0: python -c "import tensorflow as tf; print(tf.version.GIT_VERSION, tf.version.VERSION)"

    Describe the current behavior In some kind of GPU instance in aliyun, I build DeepRec from source following this docs: https://github.com/alibaba/DeepRec#how-to-build, I confirm I enabled GPU, but in this machine, I notice my code only run on CPU, and GPU-Util is always zero and with low GPU Memory-Usage, here is a runtime capture image

    But on other machines, the same building and execute behavior works normally.

    Here is the CPU info which works fine:

    # cat /proc/cpuinfo
    processor       : 0
    vendor_id       : GenuineIntel
    cpu family      : 6
    model           : 85
    model name      : Intel(R) Xeon(R) Platinum 8163 CPU @ 2.50GHz
    stepping        : 4
    microcode       : 0x1
    cpu MHz         : 2499.998
    cache size      : 33792 KB
    physical id     : 0
    siblings        : 16
    core id         : 0
    cpu cores       : 8
    apicid          : 0
    initial apicid  : 0
    fpu             : yes
    fpu_exception   : yes
    cpuid level     : 22
    wp              : yes
    flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc eagerfpu pni pclmulqdq monitor ssse3 fma cx16 pcid sse
    4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsav
    eopt xsavec xgetbv1 arat
    bogomips        : 4999.99
    clflush size    : 64
    cache_alignment : 64
    address sizes   : 46 bits physical, 48 bits virtual
    power management:
    

    Here is the CPU info which works with low GPU util:

    $ cat /proc/cpuinfo
    processor       : 0
    vendor_id       : GenuineIntel
    cpu family      : 6
    model           : 79
    model name      : Intel(R) Xeon(R) CPU E5-2682 v4 @ 2.50GHz
    stepping        : 1
    microcode       : 0x1
    cpu MHz         : 2499.996
    cache size      : 40960 KB
    physical id     : 0
    siblings        : 32
    core id         : 0
    cpu cores       : 16
    apicid          : 0
    initial apicid  : 0
    fpu             : yes
    fpu_exception   : yes
    cpuid level     : 20
    wp              : yes
    flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc eagerfpu pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic
    movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ibrs ibpb stibp fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap xsaveopt arat spec_ctrl intel_stibp
    bogomips        : 4999.99
    clflush size    : 64
    cache_alignment : 64
    address sizes   : 46 bits physical, 48 bits virtual
    power management:
    

    Describe the expected behavior

    Code to reproduce the issue Provide a reproducible test case that is the bare minimum necessary to generate the problem.

    Other info / logs Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

    opened by fuhailin 2
  • [OP] Change fused matmul layout type and number thread for small size inputs.

    [OP] Change fused matmul layout type and number thread for small size inputs.

    This PR mainly change the _MklFuedMatMul layout type, It deleted those unnecessary tensor format changes and reduce framework overhead.

    • Before applying this PR. image
    • After applying this PR.. image

    Performance changing |_MklFusedMatMul performance|Time(ms)|percent| |:--:|:--:|:--:| |DeepRec FP32 - Before|8.862|baseline| |DeepRec FP32 - After|8.689|101%|

    opened by changqi1 2
  • ParquetDataset return a error shape .

    ParquetDataset return a error shape .

    error log:

    Traceback (most recent call last):
      File "train.py", line 832, in <module>
        main()
      File "train.py", line 544, in main
        iterator = tf.data.Iterator.from_structure(train_dataset.output_types,
    AttributeError: 'PrefetchDataset' object has no attribute 'output_types'
    

    modify :/root/DeepRec/modelzoo/dlrm

    diff --git a/modelzoo/dlrm/train.py b/modelzoo/dlrm/train.py
    index 1cd0e7915e..5fbc5ee4f2 100644
    --- a/modelzoo/dlrm/train.py
    +++ b/modelzoo/dlrm/train.py
    @@ -24,6 +24,7 @@ from tensorflow.python.client import timeline
     import json
    
     from tensorflow.python.ops import partitioned_variables
    +from tensorflow.python.data.experimental.ops import parquet_dataset_ops
    
     # Set to INFO for tracking training, default is WARN. ERROR for least messages
     tf.logging.set_verbosity(tf.logging.INFO)
    @@ -300,6 +301,22 @@ def build_model_input(filename, batch_size, num_epochs):
             features = all_columns
             return features, labels
    
    +    def parse_parquet(value):
    +        cont_defaults = [[0.0] for i in range(1, 14)]
    +        cate_defaults = [[' '] for i in range(1, 27)]
    +        label_defaults = [[0]]
    +        column_headers = TRAIN_DATA_COLUMNS
    +        record_defaults = label_defaults + cont_defaults + cate_defaults
    +        columns = value
    +        vs = []
    +        for k,v in columns.items():
    +            vs.append(v)
    +        all_columns = collections.OrderedDict(zip(column_headers, vs))
    +        labels = all_columns.pop(LABEL_COLUMN[0])
    +        features = all_columns
    +        return features, labels
    +
    +
         '''Work Queue Feature'''
         if args.workqueue and not args.tf:
             from tensorflow.python.ops.work_queue import WorkQueue
    @@ -311,12 +328,8 @@ def build_model_input(filename, batch_size, num_epochs):
    
    opened by zhaozheng09 0
  • ParquetDataset return dynamic shape Tensor when set drop_remainder True

    ParquetDataset return dynamic shape Tensor when set drop_remainder True

    Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template

    System information

    • Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
    • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 16.04
    • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: None
    • TensorFlow installed from (source or binary): source
    • TensorFlow version (use command below): 1.15.5+deeprec2208
    • Python version: python3.6
    • Bazel version (if compiling from source): 0.26.1
    • GCC/Compiler version (if compiling from source): gcc version 7.5.0
    • CUDA/cuDNN version: None
    • GPU model and memory: None

    You can collect some of this information using our environment capture script You can also obtain the TensorFlow version with: 1. TF 1.0: python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)" 2. TF 2.0: python -c "import tensorflow as tf; print(tf.version.GIT_VERSION, tf.version.VERSION)"

    Describe the current behavior I use python generate parquet files, when read parquet files use ParquetDataset and set drop_remainder=True, it return a dynamic shape Tensor.

    Describe the expected behavior when use TFRecordDataset and set drop_remainder=True, it return a static shape Tensor. it should be a static shape Tensor when drop_remainder=True.

    Code to reproduce the issue Provide a reproducible test case that is the bare minimum necessary to generate the problem. generate parquet:

    import pandas as pd
    import pyarrow as pa
    import pyarrow.parquet as pq
     
    schema = pa.schema([
        ('f1', pa.int64()),
        ('f2', pa.int64()),
        ('f3', pa.int64()),
        ('f4', pa.int64()),
        ('label', pa.float32())
    ])
     
    f1 = pa.array([1, 2, 3, 4, 5], type = pa.int64())
    f2 = pa.array([1, 2, 3, 4, 5], type = pa.int64())
    f3 = pa.array([1, 2, 3, 4, 5], type = pa.int64())
    f4 = pa.array([1, 2, 3, 4, 5], type = pa.int64())
    label = pa.array([0.1, 0.2, 0.3, 0.4, 0.5], pa.float32())
     
    batch = pa.RecordBatch.from_arrays(
        [f1, f2, f3, f4, label],
        schema = schema
    )
    table = pa.Table.from_batches([batch])
     
    pq.write_table(table, 'feature.parquet')
    

    read parquet:

    import os
    
    import tensorflow as tf
    from tensorflow.python.data.experimental.ops.dataframe import DataFrame
    from tensorflow.python.data.experimental.ops.parquet_dataset_ops import ParquetDataset
    from tensorflow.python.data.ops import dataset_ops
    
    
    def make_initializable_iterator(ds):
        r"""Wrapper of make_initializable_iterator."""
        if hasattr(dataset_ops, "make_initializable_iterator"):
            return dataset_ops.make_initializable_iterator(ds)
        return ds.make_initializable_iterator()
    
    
    def parquet_map(record):
        label = record.pop("label")
        return record, label
    
    
    filename = """feature.parquet"""
    
    ds = ParquetDataset(
        filename,
        batch_size=2,
        fields=[
            DataFrame.Field("f1", tf.int64),
            DataFrame.Field("f2", tf.int64),
            DataFrame.Field("f3", tf.int64),
            DataFrame.Field("f4", tf.int64),
            DataFrame.Field("label", tf.float32),
        ],
        num_parallel_reads=8,
        drop_remainder=True,
    ).map(parquet_map)
    ds = ds.prefetch(4)
    
    iterator = make_initializable_iterator(ds)
    features, labels = iterator.get_next()
    print("f1 type is:")
    print(type(features['f1']))
    print('f1 shape is:')
    print(features['f1'].shape)
    
    sess_config = tf.ConfigProto(allow_soft_placement=True, log_device_placement=False)
    
    with tf.Session(config=sess_config) as sess:
        sess.run(iterator.initializer)
        for i in range(1):
            feature, label = sess.run([features, labels])
            print(feature)
            print("Label: ")
            print(label)
    

    Other info / logs Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

    f1 type is:
    <class 'tensorflow.python.framework.ops.Tensor'>
    f1 shape is:
    (?,)
    2023-01-03 13:23:36.623328: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2600000000 Hz
    2023-01-03 13:23:36.629326: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x4ea9280 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
    2023-01-03 13:23:36.629356: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
    {'f1': array([1, 2]), 'f2': array([1, 2]), 'f3': array([1, 2]), 'f4': array([1, 2])}
    Label: 
    [0.1 0.2]
    

    add tfrecord code: write tfrecord:

    import tensorflow as tf
    tf.enable_eager_execution()
    
    # All raw values should be converted to a type compatible with tf.Example. Use
    # the following functions to do these convertions.
    def _bytes_feature(value):
        """Returns a bytes_list from a string / byte."""
        return tf.train.Feature(bytes_list=tf.train.BytesList(value=value))
    
    
    def _float_feature(value):
        """Returns a float_list from a float / double."""
        return tf.train.Feature(float_list=tf.train.FloatList(value=value))
    
    
    def _int64_feature(value):
        """Returns an int64_list from a bool / enum / int / uint."""
        return tf.train.Feature(int64_list=tf.train.Int64List(value=value))
    
    def write_record():
        f1 = [1, 2, 3, 4, 5]
        label = [1, 2, 3, 4, 5]
    
        feature = {
            'label': _int64_feature(label),
            'f1': _int64_feature(f1),
        }
        
        # Create a `example` from the feature dict.
        tf_example = tf.train.Example(features=tf.train.Features(feature=feature))
      
        # Write the serialized example to a record file.
        with tf.python_io.TFRecordWriter('feature.tfrecords') as writer:
            writer.write(tf_example.SerializeToString())
    
    if __name__ == "__main__":
        write_record()
    

    read tfrecord:

    import os
    
    import tensorflow as tf
    from tensorflow.python.data.experimental.ops.dataframe import DataFrame
    from tensorflow.python.data.experimental.ops.parquet_dataset_ops import ParquetDataset
    from tensorflow.python.data.ops import dataset_ops
    
    
    def make_initializable_iterator(ds):
        r"""Wrapper of make_initializable_iterator."""
        if hasattr(dataset_ops, "make_initializable_iterator"):
            return dataset_ops.make_initializable_iterator(ds)
        return ds.make_initializable_iterator()
    
    
    def tfrecord_map(example_proto):
        features = {}
        features['f1'] = tf.FixedLenFeature(shape=(1,), dtype=tf.int64)
        features['label'] = tf.FixedLenFeature(shape=(1,), dtype=tf.int64)
        parsed_example = tf.parse_example(example_proto, features)
        f1 = parsed_example['f1']
        label = parsed_example['label']
        features = {'f1': f1}
        labels = {'label': label}
        return features, labels
    
    
    filename = """feature.tfrecords"""
    
    dataset = tf.data.TFRecordDataset(filename)
    dataset = dataset.batch(2, drop_remainder=True)
    dataset = dataset.map(lambda example_proto: tfrecord_map(example_proto))
    dataset = dataset.prefetch(2)
    iterator = make_initializable_iterator(dataset)
    features, labels = iterator.get_next()
    print("f1 type is:")
    print(type(features['f1']))
    print('f1 shape is:')
    print(features['f1'].shape)
    

    result:

    f1 type is:
    <class 'tensorflow.python.framework.ops.Tensor'>
    f1 shape is:
    (2, 1)
    
    opened by welsonzhang 0
  • ParquetDataset met coredump when data contain DELTA_BINARY_PACKED encoding

    ParquetDataset met coredump when data contain DELTA_BINARY_PACKED encoding

    Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template

    System information

    • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes
    • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 18.04
    • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: None
    • TensorFlow installed from (source or binary): source
    • TensorFlow version (use command below): r1.15.5-deeprec2210-25-ga27850bf1de 1.15.5
    • Python version: Python 3.6.9
    • Bazel version (if compiling from source): 0.26.1
    • GCC/Compiler version (if compiling from source): gcc version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04)
    • CUDA/cuDNN version: cuda:11.7.0-cudnn8
    • GPU model and memory: NVIDIA TITAN V 12288MiB

    You can collect some of this information using our environment capture script You can also obtain the TensorFlow version with: 1. TF 1.0: python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)" 2. TF 2.0: python -c "import tensorflow as tf; print(tf.version.GIT_VERSION, tf.version.VERSION)"

    Describe the current behavior

    I use apache iceberg to generate parquet files, when parquet files compressed by zstd, ParquetDataset crashed with reading int64 type data. I notice DeepRec use arrow=5.0, but the arrow supports DELTA_BINARY_PACKED encoding begins at version 7.0, so I think we need to upgrade arrow version, and that won't affect other user's compatibility.

    Describe the expected behavior Works expected with DELTA_BINARY_PACKED encoding.

    Code to reproduce the issue Provide a reproducible test case that is the bare minimum necessary to generate the problem. part.zstd.parquet: https://drive.google.com/file/d/1CoumvsuL47trnFi4Bn6haRIsgTy9frSE/view?usp=share_link part.gz.parquet: https://drive.google.com/file/d/1V_cOrjIVTVZ5y7Q4KbHa085ay6GeaZH-/view?usp=share_link

    import os
    
    import tensorflow as tf
    from tensorflow.python.data.experimental.ops.dataframe import DataFrame
    from tensorflow.python.data.experimental.ops.parquet_dataset_ops import ParquetDataset
    from tensorflow.python.data.ops import dataset_ops
    
    
    
    def make_initializable_iterator(ds):
        r"""Wrapper of make_initializable_iterator."""
        if hasattr(dataset_ops, "make_initializable_iterator"):
            return dataset_ops.make_initializable_iterator(ds)
        return ds.make_initializable_iterator()
    
    
    def parquet_map(record):
        label = record.pop("label")
        return record, label
    
    
    filename = """part.zstd.parquet"""
    # filename = 'part.gz.parquet'
    
    # Read from a parquet file.
    ds = ParquetDataset(
        filename,
        batch_size=4,
        fields=[
            DataFrame.Field("f_2672", tf.int64),
            DataFrame.Field("f_2671", tf.int64, ragged_rank=0),
            DataFrame.Field("f_2673", tf.int64, ragged_rank=0),
            DataFrame.Field("f_5196", tf.float32, ragged_rank=0),
            DataFrame.Field("f_8436", tf.float32, ragged_rank=0),
            DataFrame.Field("label", tf.int32),
        ],
        num_parallel_reads=8,
    ).map(parquet_map)
    ds = ds.prefetch(4)
    
    iterator = make_initializable_iterator(ds)
    features, labels = iterator.get_next()
    
    sess_config = tf.ConfigProto(allow_soft_placement=True, log_device_placement=False)
    
    with tf.Session(config=sess_config) as sess:
        sess.run(iterator.initializer)
        for i in range(1):
            feature, label = sess.run([features, labels])
            print(feature)
            print("Label: ")
            print(label)
    
    

    Other info / logs Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached. image

    opened by fuhailin 0
  • [Auto Micro Batch] Iterator has not been initialized when setting micro_batch_num

    [Auto Micro Batch] Iterator has not been initialized when setting micro_batch_num

    Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template

    System information

    • Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
    • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 20.04.2 LTS
    • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:
    • TensorFlow installed from (source or binary):
    • TensorFlow version (use command below): 1.15
    • Python version: 3.6
    • Bazel version (if compiling from source): 0.26.1
    • GCC/Compiler version (if compiling from source):
    • CUDA/cuDNN version: 11.4
    • GPU model and memory: T4

    Describe the current behavior I set sess_config.graph_options.optimizer_options.micro_batch_num = 2, and it occurs that

      File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
        return fn(*args)
      File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py", line 1349, in _run_fn
        return self._call_tf_sessionrun(options, feed_dict, fetch_list,
      File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py", line 1441, in _call_tf_sessionrun
        return tf_session.TF_SessionRun_wrapper(self._session, options, feed_dict,
    tensorflow.python.framework.errors_impl.FailedPreconditionError: 2 root error(s) found.
      (0) Failed precondition: GetNext() failed because the iterator has not been initialized. Ensure that you have run the initializer operation for this iterator before getting the next element.
             [[{{node cond/IteratorGetNext_1/dup0}}]]
             [[metrics_1/ROC_cvr2_cpd_second_stay_act_metric/assert_greater_equal/Assert/AssertGuard/Assert/data_1/_22249]]
      (1) Failed precondition: GetNext() failed because the iterator has not been initialized. Ensure that you have run the initializer operation for this iterator before getting the next element.
             [[{{node cond/IteratorGetNext_1/dup0}}]]
    

    But if I disable micro_batch, the code is running normally.

    Code to reproduce the issue

    _train_input_fn = tf.compat.v1.data.make_initializable_iterator(_train_input_fn)
    _eval_input_fn = tf.compat.v1.data.make_initializable_iterator(_eval_input_fn)
    features, labels = tf.cond(is_training, true_fn=lambda: _train_input_fn.get_next(),
                               false_fn=lambda: _eval_input_fn.get_next())
    nitializer = [tf.compat.v1.global_variables_initializer(),
                                    tf.compat.v1.local_variables_initializer(),
                                    tf.compat.v1.tables_initializer(),
                                    _train_input_fn.initializer,
                                    _eval_input_fn.initializer]
    sess_config = tf.compat.v1.ConfigProto(allow_soft_placement=True, log_device_placement=log_device_placement)
    sess_config.gpu_options.allow_growth = True
    sess_config.graph_options.optimizer_options.micro_batch_num = 2
    
    sess_config.intra_op_parallelism_threads = intra_threads
    sess_config.inter_op_parallelism_threads = inter_threads
    session = tf.compat.v1.Session(config=sess_config)
    with session:
        session.run(initializer)
    

    Provide a reproducible test case that is the bare minimum necessary to generate the problem.

    Other info / logs Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

    opened by Lihengwannafly 0
  • ParquetDataset raise ValueError: No supported fields found in parquet file

    ParquetDataset raise ValueError: No supported fields found in parquet file

    Problem: 使用MapReduce生成对应的Parquet存储,但是使用DeepRec的ParquetDataset 抛异常了,异常如下: No supported fields found in parquet file。 对应的parquet的schema格式如下: message example {required int32 id;required binary email;} 使用的MR任务参考:https://github.com/whale2/iow-hadoop-streaming 原因: 走读代码发现没法扫描footer里面的schema,导致抛异常了。为什么没法获取schema呢?

    排查步骤: (1) 自己用python代码是可以顺利读取对应的schema的。 (2) 放弃mr,使用python代码生成类似的结构的内容,ParquetDataset 是可以顺利读出来的。 (3) 对比MR生成的parquest和python本地生成的parquet格式发现,一个是bytes,一个是string。 (4) 因此怀疑是MR生成的parquet是bytes格式,导致DeepRec没法识别。

    最终将schema定义成如下,问题解决: message example {required int32 id;required binary email(UTF-8);}

    opened by welsonzhang 0
Releases(r1.15.5-deeprec2210)
  • r1.15.5-deeprec2210(Nov 17, 2022)

    Major Features and Improvements

    Embedding

    • Support HBM-DRAM-SSD storage in EmbeddingVariable multi-tier storage.
    • Support multi-tier EmbeddingVariable initialized based on frequency when restore model.
    • Support to lookup location of ids of EmbeddingVariable.
    • Support kv_initialized_op for GPU Embedding Variable.
    • Support restore compatibility of EmbeddingVariable using init_from_proto.
    • Improve performance of apply/gather ops for EmbeddingVariable.
    • Add Eviction Manager in EmbeddingVariable Multi-tier storage.
    • Add unified thread pool for cache of Multi-tier storage in EmbeddingVariable.
    • Save frequencies and versions of features in SSDHash and LevelDB storage of EmbeddingVariable.
    • Avoid invalid eviction use HBM-DRAM storage of EmbeddingVariable.
    • Preventing from accessing uninitialized data use EmbeddingVariable.

    Graph & Grappler Optimization

    • Optimize Async EmbeddingLookup by placement optimization.
    • Place VarHandlerOp to Compute main graph for SmartStage.
    • Support independent thread pool for stage subgraph to avoid thread contention.
    • Implement device placement optimization.

    Runtime Optimization

    • Support CUDA Graph execution by adding CUDA Graph mode session.
    • Support CUDA Graph execution in JIT mode.
    • Support intra task cost estimate in CostModel in Executor.
    • Support tf.stream and tf.colocate python API for CUDA multi-stream.
    • Support embedding subgraphs partition policy when use CUDA multi-stream.
    • Optimize CUDA multi-stream by merging copy stream into compute stream.

    Ops & Hardware Acceleration

    • Add a list of Quantized* and _MklQuantized* ops.
    • Implement GPU version of SparseFillEmptyRows.
    • Implement c version of spin_lock to support multi-architectures.
    • Upgrade the OneDNN version to v2.7.

    Distributed

    • Support distributed training use SOK based on EmbeddingVariable.
    • Add NETWORK_MAX_CONNECTION_TIMEOUT to support connection timeout configurable in StarServer.
    • Upgrade the SOK version to v4.2.

    IO

    • Add TF_NEED_PARQUET_DATASET to enable ParquetDataset.

    Serving

    • Optimize embedding lookup performance by disable feature filter when serving.
    • Optimize error code for user when parse request or response failed.
    • Support independent update model threadpool to avoid performance jitter.

    ModelZoo

    • Add MaskNet Model.
    • Add PLE Model.
    • Support variable type BF16 in DCN model.

    BugFix

    • Fix tf.nn.embedding_lookup interface bug and session hang bug when enabling async embedding.
    • Fix warmup failed bug when user set warmup file path.
    • Fix build failure in ev_allocator.cc and hash.cc on ARM.
    • Fix build failure in arrow when build on ARM
    • Fix redefined error in NEON header file for ARM.
    • Fix _mm_malloc build failure in sparsehash on ARM.
    • Fix warmup failed bug when use session_group.
    • Fix build save graph bug when creating partitioned EmbeddingVariable in feature_column API.
    • Fix the colocation error when using EmbeddingVariable in distribution.
    • Fix HostNameToIp fails by replacing gethostbyname by getaddrinfo in StarServer.

    More details of features: https://deeprec.readthedocs.io/zh/latest/

    Release Images

    CPU Image

    alideeprec/deeprec-release:deeprec2210-cpu-py36-ubuntu18.04

    GPU Image

    alideeprec/deeprec-release:deeprec2210-gpu-py36-cu116-ubuntu18.04

    Thanks to our Contributors

    Duyi-Wang, Locke, shijieliu, Honglin Zhu, chenxujun, GosTraight2020, LALBJ, Nanno

    Source code(tar.gz)
    Source code(zip)
  • r1.15.5-deeprec2208u1(Nov 2, 2022)

    Major Features and Improvements

    BugFix

    • Fix a list of Quantized* and _MklQuantized* ops not found issue.
    • Fix build save graph bug when creating partitioned EmbeddingVariable in feature_column API.
    • Fix warmup failed bug when user set warmup file path.
    • Fix warmup failed bug when use session_group.

    Release Images

    CPU Image

    alideeprec/deeprec-release:deeprec2208u1-cpu-py36-ubuntu18.04

    GPU Image

    alideeprec/deeprec-release:deeprec2208u1-gpu-py36-cu116-ubuntu18.04

    Source code(tar.gz)
    Source code(zip)
  • r1.15.5-deeprec2208(Sep 23, 2022)

    Major Features and Improvements

    Embedding

    • Multi-tier of EmbeddingVariable support HBM, add async compactor in SSDHashKV.
    • Support tf.feature_column.shard_embedding_columns, SequenceCategoricalColumn and WeightedCategoricalColumn API for EmbeddingVariable.
    • Support save and restore checkpoint of GPU EmbeddingVariable.
    • Support EmbeddingVariable OpKernel with REAL_NUMBER_TYPES.
    • Support user defined default_value for feature filter.
    • Support feature column API for MultiHash.

    Graph & Grappler Optimization

    • Add FP32 fused l2 normalize op and grad op and tf.nn.fused_layer_normalize API.
    • Add Concat+Cast fusion ops.
    • Optimize SmartStage performance on GPU.
    • Add macro to control to optimize mkl_layout_pass.
    • Support asynchronous embedding lookup.

    Runtime Optimization

    • CPUAllocator, avoid multiple threads cleanup at the same time.
    • Support independent intra threadpool for each session and intra threadpool be pinned to cpuset.
    • Support multi-stream with virtual device.

    Ops & Hardware Acceleration

    • Implement ApplyFtrl, ResourceApplyFtrl, ApplyFtrlV2 and ResourceApplyFtrlV2 GPU kernels.
    • Optimize BatchMatmul GPU kernel.
    • Integrate cuBLASlt into backend and use BlasLtMatmul in batch_matmul_op.
    • Support GPU fusion of matmal+bias+(activation).
    • Merge NV-TF r1.15.5+22.06.

    Optimizer

    • Support AdamW optimizer for EmbeddingVariable.

    Model Save/Restore

    • Support asynchronously restore EmbeddingVariable from checkpoint.
    • Support EmbeddingVariable in init_from_checkpoint.

    Serving

    • Add go/java/python client SDK and demo.
    • Support GPU multi-streams in SessionGroup.
    • Support independent inter thread pool for each session in SessionGroup.
    • Support multi-tiered Embedding.
    • Support immutable EmbeddingVariable.

    Quantization

    • Add low precision optimization tool, support BF16, FP16, INT8 for savedmodel and checkpoint.
    • Add embedding variable quantization.

    ModelZoo

    • Optimize DIN's BF16 performance.
    • Add DCN & DCNv2 models and MLPerf recommendation benchmark.

    Profiler

    • Add detail information for RecvTensor in timeline.

    Dockerfile

    • Add ubuntu 22.04 dockerfile and images with gcc11.2 and python3.8.6.
    • Add cuda11.2, cuda11.4, cuda11.6, cuda11.7 docker images and use cuda 11.6 as default GPU image.

    Environment & Build

    • Update default TF_CUDA_COMPUTE_CAPABILITIES to 6.0,6.1,7.0,7.5,8.0.
    • Upgrade bazel version to 0.26.1.
    • Support for building DeepRec on ROCm2.10.0.

    BugFix

    • Fix build failures with gcc11 & gcc12.
    • StarServer, remove user packet split to avoid multiple user packet out-of-order issue.
    • Fix the 'NodeIsInGpu is not declare' issue.
    • Fix the placement bug of worker devices when distributed training in Modelzoo.
    • Fix out of range issue for BiasAddGrad op when enable AVX512.
    • Avoid loading invalid model when model update in serving.

    More details of features: https://deeprec.readthedocs.io/zh/latest/

    Release Images

    CPU Image

    alideeprec/deeprec-release:deeprec2208-cpu-py36-ubuntu18.04

    GPU Image

    alideeprec/deeprec-release:deeprec2208-gpu-py36-cu116-ubuntu18.04

    Source code(tar.gz)
    Source code(zip)
  • r1.15.5-deeprec2206(Jul 6, 2022)

    Major Features and Improvements

    Embedding

    • Multi-tier of EmbeddingVariable, add SSD_HashKV which is better performance than LevelDB.
    • Support GPU EmbeddingVariable which gather/apply ops place on GPU.
    • Add user API to record frequence and version for EmbeddingVariable.

    Graph Optimization

    • Add Embedding Fusion ops for CPU/GPU.
    • Optimize SmartStage performance on GPU.

    Runtime Optimization

    • Executor, support cost-based and critical path ops first.
    • GPUAllocator, support CUDA malloc async allocator. (need to use >= CUDA 11.2)
    • CPUAllocator, automatically memory allocation policy generation.
    • PMEMAllocator, optimize allocator and add statistic.

    Ops & Hardware Acceleration

    • Implement SparseReshape, SparseApplyAdam, SparseApplyAdagrad, SparseApplyFtrl, ApplyAdamAsync, SparseApplyAdamAsync, KvSparseApplyAdamAsync GPU kernels.
    • Optimize UnSortedSegment on CPU.
    • Upgrade OneDNN to v2.6.

    IO & Dataset

    • ParquetDataset, add parquet dataset which could reduce storage and improve performance.

    Model Save/Restore

    • Asynchronous restore EmbeddingVariable from checkpoint.

    Serving

    • SessionGroup, highly improve QPS and RT in inference.

    ModelZoo

    • Add models SimpleMultiTask, ESSM, DBMTL, MMoE, BST.

    Profiler

    • Support for mapping of operators and real thread ids in timeline.

    BugFix

    • Fix EmbeddingVariable core when EmbeddingVariable only has primary embedding value.
    • Fix abnormal behavior in L2-norm calculation.
    • Fix save checkpoint issue when use LevelDB in EmbeddingVariable.
    • Fix delete old checkpoint failure when use incremental checkpoint.
    • Fix build failure with CUDA 11.6.

    More details of features: https://deeprec.readthedocs.io/zh/latest/

    Release Images

    CPU Image

    alideeprec/deeprec-release:deeprec2206-cpu-py36-ubuntu18.04

    GPU Image

    alideeprec/deeprec-release:deeprec2206-gpu-py36-cu110-ubuntu18.04

    Source code(tar.gz)
    Source code(zip)
  • r1.15.5-deeprec2204u1(Apr 28, 2022)

    Major Features and Improvements

    BugFix

    • Fix saving checkpoint issue when use EmbeddingVariable. (https://github.com/alibaba/DeepRec/issues/167)
    • Fix inputs from different frames issue when use auto graph fusion. (https://github.com/alibaba/DeepRec/issues/144)
    • Fix embedding_lookup_sparse graph issue.

    Release Images

    CPU Image

    alideeprec/deeprec-release:deeprec2204u1-cpu-py36-ubuntu18.04

    GPU Image

    alideeprec/deeprec-release:deeprec2204u1-gpu-py36-cu110-ubuntu18.04

    Source code(tar.gz)
    Source code(zip)
  • r1.15.5-deeprec2204(Apr 7, 2022)

    Major Features and Improvements

    Embedding

    • Support hybrid storage of EmbeddingVariable (DRAM, PMEM, LevelDB)
    • Support memory-continuous storage of multi-slot EmbeddingVariable.
    • Optimize beta1_power and beta2_power slots of EmbeddingVariable.
    • Support restore frequency of features in EmbeddingVariable.

    Distributed Training

    • Integrate SOK in DeepRec.

    Graph Optimization

    • Auto Graph Fusion, support float32/int32/int64 type for select fusion.
    • SmartStage, fix graph contains circle bug when enable SmartStage optimization.

    Runtime Optimization

    • GPUTensorPoolAllocator, which reduce GPU memory usage and improve performance.
    • PMEMAllocator, support allocation in persistent memory.

    Optimizer

    • Optimize AdamOptimizer performance.

    Op & Hardware Acceleration

    • Change fused MatMul layout type and number thread for small size inputs.

    IO & Dataset

    • KafkaGroupIODataset, support consumer rebalance.

    Model Save/Restore

    • Support dump incremental graph info.

    Serving

    • Add serving module (ODL processor), which support Online Deep Learning (ODL).

    More details of features: https://deeprec.readthedocs.io/zh/latest/

    Release Images

    CPU Image

    registry.cn-shanghai.aliyuncs.com/pai-dlc-share/deeprec-training:deeprec2204-cpu-py36-ubuntu18.04

    GPU Image

    registry.cn-shanghai.aliyuncs.com/pai-dlc-share/deeprec-training:deeprec2204-gpu-py36-cu110-ubuntu18.04

    Known Issue

    Some user report issue when use Embedding Variable, such as https://github.com/alibaba/DeepRec/issues/167. The bug is fixed in r1.15.5-deeprec2204u1.

    Source code(tar.gz)
    Source code(zip)
  • r1.15.5-deeprec2201(Jan 11, 2022)

    This is the first release of DeepRec. DeepRec has super large-scale distributed training capability, supporting model training of trillion samples and 100 billion Embedding Processing. For sparse model scenarios, in-depth performance optimization has been conducted across CPU and GPU platform.

    Major Features and Improvements

    Embedding

    • Embedding Variable (including feature eviction and feature filter)
    • Dynamic Dimension Embedding Variable
    • Adaptive Embedding
    • Multi-Hash Variable

    Distributed Training

    • GRPC++
    • StarServer

    Graph Optimization

    • Auto Micro Batch
    • Auto Graph Fusion
    • Embedding Fusion
    • Smart Stage

    Runtime Optimization

    • CPU Memory Optimization
    • GPU Memory Optimization
    • GPU Virtual Memory

    Optimizer

    • AdamAsync Optimizer
    • AdagradDecay Optimizer

    Op & Hardware Acceleration

    • Unique, Gather, DynamicStitch, BiasAdd, Select, Transpose, SparseSegmentReduction, where, DynamicPartition, SparseConcat tens of ops' CPU/GPU optimization.
    • support oneDNN-2.3.2 & bf16
    • Support TF32

    IO & Dataset

    • WorkQueue
    • KafkaDataset

    More details of features: https://deeprec.readthedocs.io/zh/latest/

    Release Images

    CPU Image

    registry.cn-shanghai.aliyuncs.com/pai-dlc-share/deeprec-training:deeprec2201-cpu-py36-ubuntu18.04

    GPU Image

    registry.cn-shanghai.aliyuncs.com/pai-dlc-share/deeprec-training:deeprec2201-gpu-py36-cu110-ubuntu18.04

    Source code(tar.gz)
    Source code(zip)
Owner
Alibaba
Alibaba Open Source
Alibaba
Imaginaire - NVIDIA's Deep Imagination Team's PyTorch Library

Imaginaire Docs | License | Installation | Model Zoo Imaginaire is a pytorch library that contains optimized implementation of several image and video

NVIDIA Research Projects 3.6k Dec 29, 2022
Face Recognition plus identification simply and fast | Python

PyFaceDetection Face Recognition plus identification simply and fast Ubuntu Setup sudo pip3 install numpy sudo pip3 install cmake sudo pip3 install dl

Peyman Majidi Moein 16 Sep 22, 2022
The Incredible PyTorch: a curated list of tutorials, papers, projects, communities and more relating to PyTorch.

This is a curated list of tutorials, projects, libraries, videos, papers, books and anything related to the incredible PyTorch. Feel free to make a pu

Ritchie Ng 9.2k Jan 02, 2023
这是一个yolo3-tf2的源码,可以用于训练自己的模型。

YOLOV3:You Only Look Once目标检测模型在Tensorflow2当中的实现 目录 性能情况 Performance 所需环境 Environment 文件下载 Download 训练步骤 How2train 预测步骤 How2predict 评估步骤 How2eval 参考资料

Bubbliiiing 68 Dec 21, 2022
A Joint Video and Image Encoder for End-to-End Retrieval

Frozen️ in Time ❄️ ️️️️ ⏳ A Joint Video and Image Encoder for End-to-End Retrieval project page | arXiv | webvid-data Repository containing the code,

225 Dec 25, 2022
PyTorch implementation for "Sharpness-aware Quantization for Deep Neural Networks".

Sharpness-aware Quantization for Deep Neural Networks This is the official repository for our paper: Sharpness-aware Quantization for Deep Neural Netw

Zhuang AI Group 30 Dec 19, 2022
An Open Source Machine Learning Framework for Everyone

Documentation TensorFlow is an end-to-end open source platform for machine learning. It has a comprehensive, flexible ecosystem of tools, libraries, a

170.1k Jan 05, 2023
A Python module for the generation and training of an entry-level feedforward neural network.

ff-neural-network A Python module for the generation and training of an entry-level feedforward neural network. This repository serves as a repurposin

Riadh 2 Jan 31, 2022
Official Keras Implementation for UNet++ in IEEE Transactions on Medical Imaging and DLMIA 2018

UNet++: A Nested U-Net Architecture for Medical Image Segmentation UNet++ is a new general purpose image segmentation architecture for more accurate i

Zongwei Zhou 1.8k Jan 07, 2023
This project is the PyTorch implementation of our CVPR 2022 paper:

Requirements and Dependency Install PyTorch with CUDA (for GPU). (Experiments are validated on python 3.8.11 and pytorch 1.7.0) (For visualization if

Lei Huang 23 Nov 29, 2022
UPSNet: A Unified Panoptic Segmentation Network

UPSNet: A Unified Panoptic Segmentation Network Introduction UPSNet is initially described in a CVPR 2019 oral paper. Disclaimer This repository is te

Uber Research 622 Dec 26, 2022
iPOKE: Poking a Still Image for Controlled Stochastic Video Synthesis

iPOKE: Poking a Still Image for Controlled Stochastic Video Synthesis iPOKE: Poking a Still Image for Controlled Stochastic Video Synthesis Andreas Bl

CompVis Heidelberg 36 Dec 25, 2022
Dynamic Bottleneck for Robust Self-Supervised Exploration

Dynamic Bottleneck Introduction This is a TensorFlow based implementation for our paper on "Dynamic Bottleneck for Robust Self-Supervised Exploration"

Bai Chenjia 4 Nov 14, 2022
Hyperparameter Optimization for TensorFlow, Keras and PyTorch

Hyperparameter Optimization for Keras Talos • Key Features • Examples • Install • Support • Docs • Issues • License • Download Talos radically changes

Autonomio 1.6k Dec 15, 2022
Latte: Cross-framework Python Package for Evaluation of Latent-based Generative Models

Cross-framework Python Package for Evaluation of Latent-based Generative Models Latte Latte (for LATent Tensor Evaluation) is a cross-framework Python

Karn Watcharasupat 30 Sep 08, 2022
Pytorch implementation of the paper Progressive Growing of Points with Tree-structured Generators (BMVC 2021)

PGpoints Pytorch implementation of the paper Progressive Growing of Points with Tree-structured Generators (BMVC 2021) Hyeontae Son, Young Min Kim Pre

Hyeontae Son 9 Jun 06, 2022
Skipgram Negative Sampling in PyTorch

PyTorch SGNS Word2Vec's SkipGramNegativeSampling in Python. Yet another but quite general negative sampling loss implemented in PyTorch. It can be use

Jamie J. Seol 287 Dec 14, 2022
FluxTraining.jl gives you an endlessly extensible training loop for deep learning

A flexible neural net training library inspired by fast.ai

86 Dec 31, 2022
MPRNet-Cloud-removal: Progressive cloud removal

MPRNet-Cloud-removal Progressive cloud removal Requirements 1.Pytorch = 1.0 2.Python 3 3.NVIDIA GPU + CUDA 9.0 4.Tensorboard Installation 1.Clone the

Semi 95 Dec 18, 2022
Instant-Teaching: An End-to-End Semi-Supervised Object Detection Framework

This repo is the official implementation of "Instant-Teaching: An End-to-End Semi-Supervised Object Detection Framework". @inproceedings{zhou2021insta

34 Dec 31, 2022