Deploy optimized transformer based models on Nvidia Triton server

Last update: Jan 05, 2023

Related tags

Overview

🤗 Hugging Face Transformer submillisecond inference 🤯 and deployment on Nvidia Triton server

Yes, you can perfom inference with transformer based model in less than 1ms on the cheapest GPU available on Amazon (T4)!

The commands below have been tested on a AWS G4.dnn with Deep Learning Base AMI (Ubuntu 18.04) Version 44.0. They may require some small adaptations to be run on a another Linux distribution.

You can find explanations on how it works in Hugging Face Transformer inference UNDER 1 millisecond latency

Baseline set by Hugging Face Infinity demo

Hugging Face infinity demo video

AWS virtual machine: g4dn.xlarge (T4 GPU)
model: "philschmid/MiniLM-L6-H384-uncased-sst2" (Hugging Face hub URL)
experience 1 : batch size 1, seq len 16 tokens -> 1.7ms
experience 2 : batch size 1, seq len 128 tokens -> 2.5ms

Install dependencies

Those dependencies have to be installed on the remote machine directly (no container).

git clone [email protected]:ELS-RD/triton_transformers.git
pip3 install -r requirements.txt

Generate optimized models

We generate the models from a Docker image so we can also get measures for TensorRT + ONNX Runtime.

cd triton_transformers
DOCKER_BUILDKIT=1 docker build --tag onnxruntime-trt:latest -f Dockerfile .
docker run -it --rm --gpus all -v $PWD:/project onnxruntime-trt bash -c "cd /project && python convert_onnx.py"

⚠️ WARNING ⚠️ : if you run the conversion outside Docker container, you may have very different timings, and TensorRT won't work

It should produce something like that:

10/31/2021 11:35:08 INFO     inference done on Tesla T4
10/31/2021 11:35:08 INFO     timing [[TensorrtExecutionProvider] ./onnx_models/model-shape.onnx]: mean=0.61ms, sd=0.11ms, min=0.52ms, max=0.92ms, median=0.54ms, 95p=0.88ms, 99p=0.90ms
10/31/2021 11:35:08 INFO     timing [[CUDAExecutionProvider] ./onnx_models/model.onnx]: mean=1.10ms, sd=0.10ms, min=1.04ms, max=3.44ms, median=1.07ms, 95p=1.29ms, 99p=1.36ms
10/31/2021 11:35:08 INFO     timing [[CUDAExecutionProvider] ./onnx_models/model-optimized.onnx]: mean=0.63ms, sd=0.05ms, min=0.60ms, max=0.84ms, median=0.61ms, 95p=0.77ms, 99p=0.79ms
10/31/2021 11:35:08 INFO     timing [Pytorch_32]: mean=5.09ms, sd=0.16ms, min=4.88ms, max=6.11ms, median=5.07ms, 95p=5.28ms, 99p=5.35ms
10/31/2021 11:35:08 INFO     timing [Pytorch_FP16]: mean=6.04ms, sd=0.74ms, min=5.77ms, max=28.79ms, median=6.05ms, 95p=6.19ms, 99p=6.29ms

TensorRT and optimized ONNX Runtime provides very similar results on short sequences. In the following steps, we will continue with ONNX Runtime model because the dynamic axis are easier to work with compared to TensorRT.

Docker build will is very slow on a G4, be patient... the docker image is only required for TensorRT support inside ONNX Runtime (and measure a difference, if any, with ONNX Runtime).

FastAPI server

This is our baseline, easy to run, but not very performant.

# launch server, disable logging for best performances
python3 -m uvicorn --log-level warning server_onnx:app --port 8000 --host 0.0.0.0
# other variation, 1 worker per CPU for best latency (plus not a good idea to have several times the same model on a single GPU):
python3 -m gunicorn -w 1 -k uvicorn.workers.UvicornWorker --log-level warning server_onnx:app --bind 0.0.0.0:8000

# simple inference timing
time curl -G --data-urlencode query="This live event is great. I will sign-up for Infinity." localhost:8000/predict
# slightly more serious measure
sudo apt-get install linux-tools-common linux-tools-generic linux-tools-`uname -r`
sudo perf stat -r 50 -d curl -G --data-urlencode query="This live event is great. I will sign-up for Infinity." localhost:8000/predict -s > /dev/null

It should produce:

Performance counter stats for 'curl -G --data-urlencode query=This live event is great. I will sign-up for Infinity. localhost:8000/predict' (50 runs):

              6.14 msec task-clock                #    0.494 CPUs utilized            ( +-  0.59% )
                 3      context-switches          #    0.462 K/sec                    ( +-  1.84% )
                 0      cpu-migrations            #    0.000 K/sec                  
               577      page-faults               #    0.094 M/sec                    ( +-  0.06% )
   <not supported>      cycles                                                      
   <not supported>      instructions                                                
   <not supported>      branches                                                    
   <not supported>      branch-misses                                               
   <not supported>      L1-dcache-loads                                             
   <not supported>      L1-dcache-load-misses                                       
   <not supported>      LLC-loads                                                   
   <not supported>      LLC-load-misses                                             

         0.0124429 +- 0.0000547 seconds time elapsed  ( +-  0.44% )

Triton server

We want to copy the ONNX model we have generated in the first step in this folder. Then we launch the Triton image. As you can see we install Transformers and then launch the server itself. This is of course a bad practice, you should make your own 2 lines Dockerfile with Transformers inside.

# copy the generated model to triton model folder
cp ./onnx_models/model-optimized.onnx ./triton_models/sts/1/model.onnx
# install transformers (and its tokenizer) and launch server in a single line, ugly but good enough for our demo
# --shm-size 256m -> to have several Python backend at the same time
docker run -it --rm --gpus all -p8000:8000 -p8001:8001 -p8002:8002 --shm-size 256m \
  -v $PWD/triton_models:/models nvcr.io/nvidia/tritonserver:21.10-py3 \
  bash -c "pip install transformers && tritonserver --model-repository=/models"

Triton server perf analysis

You need to edit the source code to load the 16 or 128 token sequence (the text is already included).

16 tokens:

[email protected]:~/triton_transformers$ python3 triton_transformers.py 
10/31/2021 12:09:34 INFO     timing [triton transformers]: mean=1.53ms, sd=0.06ms, min=1.48ms, max=1.78ms, median=1.51ms, 95p=1.66ms, 99p=1.74ms
[[-3.4355469  3.2753906]]

128 tokens:

[email protected]:~/triton_transformers$ python3 triton_transformers.py 
10/31/2021 12:12:00 INFO     timing [triton transformers]: mean=1.96ms, sd=0.08ms, min=1.88ms, max=2.24ms, median=1.93ms, 95p=2.17ms, 99p=2.23ms
[[-3.4589844  3.3027344]]

There is also a more serious performance analysis tool called perf_analyzer (it will take care to check that measures are stable, etc.). documentation The tool need to be run on Ubuntu >= 20.04 (and won't work on Ubuntu 18.04 used for the AWS official Ubuntu deep learning image): It also make measures on torchserve and tensorflow.

# perf_analyzer needs this dependency
sudo apt install libb64-dev
# add -a for async measures, and -i grpc to use that protocol instead of http 
~/.local/bin/perf_analyzer -m transformers --percentile=95 --input-data perf_data.json --shape TEXT:1 # -i grpc -a
# just test the model part (easier to get random input)
~/.local/bin/perf_analyzer --input-data zero -m sts --shape input_ids:1,16 --shape attention_mask:1,16 #-i grpc -a

Call Triton HTTP API directly

If you don't want to use the tritonclient API, you can call the Triton server those ways:

# if you like Python requests library
python3 triton_requests.py

# if you want generic HTTP template, the @ means no data conversion
curl -X POST  http://localhost:8000/v2/models/transformers/versions/1/infer \
  --data-binary "@query_body.bin" \
  --header "Inference-Header-Content-Length: 160"

Use TensorRT model in Triton server (instead of ONNX)

To use TensorRT model instead of ONNX Runtime one:

we need to convert the ONNX to TensorRT engine
update the configuration, TensorRT takes int32 as input instead of int64

# we use Docker container to guarantee the use of the right trtexec version (otherwise you will have a deserialization error)
# it's a bacic conversion, IRL you want to provide minimum, optimimum and maximum shape at least
# it may take a few minutes...
docker run -it --rm --gpus all -v $PWD/onnx_models:/models nvcr.io/nvidia/tritonserver:21.10-py3 \
    /usr/src/tensorrt/bin/trtexec \
    --onnx=/models/model.onnx \
    --best \
    --minShapes=input_ids:1x16,attention_mask:1x16 \
    --optShapes=input_ids:1x16,attention_mask:1x16 \
    --maxShapes=input_ids:32x16,attention_mask:32x16  \
    --saveEngine="/models/model.plan" \
    --workspace=6000 \
    --useCudaGraph

# move to triton model folder
cp ./onnx_models/model.plan ./triton_models/sts/1/model.plan

You then need to update you config.pbtxt in STS and tokenizer folders, replace all TYPE_INT64 tensor type by TYPE_INT32. In STS configuraiton file, replace platform: "onnxruntime_onnx" by platform: "tensorrt_plan" Finally convert the numpy tensors to int32 in the tokenizer python code, like below (notice the astype()):

input_ids = pb_utils.Tensor("INPUT_IDS", tokens['input_ids'].astype(np.int32))
attention = pb_utils.Tensor("ATTENTION", tokens['attention_mask'].astype(np.int32))

And you are done!

Comments

Support for large models (external data format)
This PR closes #59.

Changelog:

Refactored dockerfile and fixed dependencies to cope with python3: /root/gpgpu/MachineLearning/myelin/src/compiler/optimizer/reshape_ppg.cpp:950: void myelin::ir::reshape_ppg_t::transform_op(myelin::ir::bb_t*, myelin::ir::operation_t*): Assertionop->outputs()[0]->dimensions().size() == 3' failed.`

Bumped patch version

Added --fast argument to skip the fp16 conversion (saving GPU memory)

Updated logging (increased default verbosity for better understandability)

Added external data path for tensorrt to cope with models > 2G

Moved ONNX export post Pytorch benchmark to do conversion on CPU only (for larger models)

bug documentation
opened by oborchers 16

Dynamic batching does not give better latency for Roberta running on TensorRT.

Hi, I used your build_engine API to convert the Roberta model. While building if I use the constant batch size for input_shapes, i.e. (min, optimal, max) -> (1,1,1) or (4, 4, 4,). The model yields good results (faster than ort and torch).

But when I convert it with dynamic batch size i.e. (min, optimal, max) -> (1, 4, 4), the model performs really slow compared to ort or torch.

code to understand the problem better:

# fast inference but constrained to use always 4 batches during inferencing
tensor_shapes = list(zip([4, 4, 4], [1, 128, 128]))

# slow inference
tensor_shapes = list(zip([1, 4, 4], [1, 128, 128]))

engine: ICudaEngine = build_engine(
    runtime=runtime,
    onnx_file_path=onnx_model_path,
    logger=trt_logger,
    min_shape=tensor_shapes[0],
    optimal_shape=tensor_shapes[1],
    max_shape=tensor_shapes[2],
    workspace_size=workspace_size * 1024**3,
    fp16=not quantization,
    int8=quantization,
    profiling=True,
)

save_engine(engine=engine, engine_file_path=tensorrt_path)

the complete build and inference logs for slow inference case (when converting with dynamic batch)

[06/02/2022-03:19:09] [TRT] [I] [MemUsageChange] Init CUDA: CPU +312, GPU +0, now: CPU 3789, GPU 2470 (MiB)
[06/02/2022-03:19:09] [TRT] [I] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 3790, GPU 2470 (MiB)
[06/02/2022-03:19:09] [TRT] [I] [MemUsageSnapshot] Begin constructing builder kernel library: CPU 3790 MiB, GPU 2470 MiB
[06/02/2022-03:19:09] [TRT] [I] [MemUsageSnapshot] End constructing builder kernel library: CPU 3924 MiB, GPU 2504 MiB
[06/02/2022-03:19:09] [TRT] [I] parsing TensorRT model
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:604] Reading dangerously large protocol message.  If the message turns out to be larger than 2147483647 bytes, parsing will be halted for security reasons.  To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:81] The total number of bytes read was 1418322027
[06/02/2022-03:19:22] [TRT] [W] onnx2trt_utils.cpp:366: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[06/02/2022-03:19:39] [TRT] [W] Output type must be INT32 for shape outputs
[06/02/2022-03:19:39] [TRT] [W] Output type must be INT32 for shape outputs
[06/02/2022-03:19:39] [TRT] [W] Output type must be INT32 for shape outputs
[06/02/2022-03:19:43] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +512, GPU +226, now: CPU 5802, GPU 2730 (MiB)
[06/02/2022-03:19:43] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +116, GPU +52, now: CPU 5918, GPU 2782 (MiB)
[06/02/2022-03:19:43] [TRT] [I] Timing cache disabled. Turning it on will improve builder speed.
[06/02/2022-03:19:43] [TRT] [W] Myelin graph with multiple dynamic values may have poor performance if they differ. Dynamic values are: 
[06/02/2022-03:19:43] [TRT] [W]  (# 1 (SHAPE input_ids))
[06/02/2022-03:19:43] [TRT] [W]  (# 0 (SHAPE attention_mask))
Warning: Slice op (Unnamed Layer_ 90) [Slice]_slice cannot slice along a uniform dimension.
Warning: Slice op (Unnamed Layer_ 90) [Slice]_slice cannot slice along a uniform dimension.
Warning: Slice op (Unnamed Layer_ 90) [Slice]_slice cannot slice along a uniform dimension.
Warning: Slice op (Unnamed Layer_ 90) [Slice]_slice cannot slice along a uniform dimension.
Warning: Slice op (Unnamed Layer_ 90) [Slice]_slice cannot slice along a uniform dimension.
Warning: Slice op (Unnamed Layer_ 90) [Slice]_slice cannot slice along a uniform dimension.
Warning: Slice op (Unnamed Layer_ 90) [Slice]_slice cannot slice along a uniform dimension.
[06/02/2022-03:25:32] [TRT] [W] Myelin graph with multiple dynamic values may have poor performance if they differ. Dynamic values are: 
[06/02/2022-03:25:32] [TRT] [W]  (# 1 (SHAPE input_ids))
[06/02/2022-03:25:32] [TRT] [W]  (# 0 (SHAPE attention_mask))
Warning: Slice op (Unnamed Layer_ 90) [Slice]_slice cannot slice along a uniform dimension.
Warning: Slice op (Unnamed Layer_ 90) [Slice]_slice cannot slice along a uniform dimension.
Warning: Slice op (Unnamed Layer_ 90) [Slice]_slice cannot slice along a uniform dimension.
Warning: Slice op (Unnamed Layer_ 90) [Slice]_slice cannot slice along a uniform dimension.
Warning: Slice op (Unnamed Layer_ 90) [Slice]_slice cannot slice along a uniform dimension.
Warning: Slice op (Unnamed Layer_ 90) [Slice]_slice cannot slice along a uniform dimension.
Warning: Slice op (Unnamed Layer_ 90) [Slice]_slice cannot slice along a uniform dimension.
[06/02/2022-03:30:10] [TRT] [I] Detected 2 inputs and 1 output network tensors.
[06/02/2022-03:30:10] [TRT] [W] Myelin graph with multiple dynamic values may have poor performance if they differ. Dynamic values are: 
[06/02/2022-03:30:10] [TRT] [W]  (# 1 (SHAPE input_ids))
[06/02/2022-03:30:10] [TRT] [W]  (# 0 (SHAPE attention_mask))
[06/02/2022-03:30:32] [TRT] [I] Total Host Persistent Memory: 208
[06/02/2022-03:30:32] [TRT] [I] Total Device Persistent Memory: 0
[06/02/2022-03:30:32] [TRT] [I] Total Scratch Memory: 442827264
[06/02/2022-03:30:32] [TRT] [I] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 774 MiB, GPU 2058 MiB
[06/02/2022-03:30:32] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 0.038945ms to assign 4 blocks to 4 nodes requiring 443041280 bytes.
[06/02/2022-03:30:32] [TRT] [I] Total Activation Memory: 443041280
[06/02/2022-03:30:32] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 5993, GPU 4298 (MiB)
[06/02/2022-03:30:32] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 5993, GPU 4306 (MiB)
[06/02/2022-03:30:32] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +0, GPU +1353, now: CPU 0, GPU 1353 (MiB)
[06/02/2022-03:30:33] [TRT] [I] Loaded engine size: 1364 MiB
[06/02/2022-03:30:33] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +10, now: CPU 7354, GPU 4282 (MiB)
[06/02/2022-03:30:33] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +1, GPU +8, now: CPU 7355, GPU 4290 (MiB)
[06/02/2022-03:30:33] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +1352, now: CPU 0, GPU 1352 (MiB)
[06/02/2022-03:30:38] [TRT] [I] Loaded engine size: 1364 MiB
[06/02/2022-03:30:38] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +10, now: CPU 7366, GPU 5636 (MiB)
[06/02/2022-03:30:38] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +1, GPU +8, now: CPU 7367, GPU 5644 (MiB)
[06/02/2022-03:30:38] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +1352, now: CPU 0, GPU 2704 (MiB)
[06/02/2022-03:30:38] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +10, now: CPU 6002, GPU 5636 (MiB)
[06/02/2022-03:30:38] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 6002, GPU 5644 (MiB)
[06/02/2022-03:30:43] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +423, now: CPU 0, GPU 3127 (MiB)

latencies in ms
--------------------------------------------------
Pytorch 
--------------------------------------------------
[93.5968, 94.0308, 94.8224, 93.6746, 94.5972, 94.0188, 92.3105, 93.6535, 92.4908, 91.4413]
--------------------------------------------------
Onnxruntime 
 --------------------------------------------------
[81.445, 81.3684, 80.2145, 81.5339, 82.9578, 83.6845, 83.6738, 82.6652, 81.5462, 82.8237]
--------------------------------------------------
TensorRT (FP16) 
 --------------------------------------------------
[426.353, 425.1992, 426.0317, 425.8226, 426.8828, 428.0485, 426.3119, 426.4556, 425.4863, 426.0393]
--------------------------------------------------

Is this the expected behavior?

I want to convert the model to use dynamic batches. When inferencing, the model should be able to handle a variable batch size and perform faster. How can I achieve that?

Any help would be greatly appreciated, thank you in advance.

bug

opened by Ki6an 12

Optimizations for T0

I'm trying to replicate the T5 ONNX optimization notebook (the latest version, on the feat/t5_3b branch) but for T0_3B (which in itself is a derivative of T5, but with a slightly different config and no tie_word_embeddings.

I installed ONNX runtime from source as described in the notebook.

The only changes I made to the notebook are replacing "t5-3b" with "bigscience/T0_3B", and commenting out out_dec["last_hidden_state"] = out_dec["last_hidden_state"] * (pytorch_model.model_dim**-0.5) in the ExportT5 class, as T0 does not use tie word embeddings.

However, the notebook fails on dec_if_ort_model = create_model_for_provider(dec_if_model_path, "CUDAExecutionProvider", log_severity=3), with the error: Fail: [ONNXRuntimeError] : 1 : FAIL : Load model from ./test-dec-if/model.onnx failed:This is an invalid model. Error: the graph is not acyclic.

Shouldn't T0 work because it is essentially T5? Your help would be greatly appreciated @pommedeterresautee. Thanks!

opened by michaelroyzen 10
WIP - Support Token Classification

I think this makes it so TD can handle TokenClassification.

However, the model I picked to test with seems to not convert well, both the ONNX and TensorRT conversion fail on the assert np.areclose, and I am not sure what this means...

I am testing it with

$ python src/transformer_deploy/convert.py -m dslim/bert-large-NER --backend onnx --seq-len 8 128 256 --batch-size 1 1 1 --task=TokenClassification --verbose

Also, sorry for all the commits, I can squash them on my fork and make it clean later, I just wanted to know if you had any idea why this was failing.

opened by sam-writer 9
got error in optimize onnx when ran gpt2 file from demo/generative-model

getting error when ran this code part logging.basicConfig() logging.getLogger().setLevel(logging.INFO) num_attention_heads, hidden_size = get_model_size(path=model_name) optimize_onnx( onnx_path="test-gpt2.onnx", onnx_optim_model_path="test-gpt2-opt.onnx", fp16=True, use_cuda=True, num_attention_heads=num_attention_heads, hidden_size=hidden_size, architecture='gpt2' )

INFO:fusion_base:Fused LayerNormalization count: 25 INFO:fusion_base:Fused FastGelu count: 12

failed in shape inference <class 'AssertionError'> failed in shape inference <class 'AssertionError'> failed in shape inference <class 'AssertionError'>

INFO:onnx_model:Graph pruned: 0 inputs, 0 outputs and 720 nodes are removed INFO:onnx_model_gpt2:postprocess: remove Reshape count:72 INFO:fusion_base:Fused FastGelu(add bias) count: 12 INFO:onnx_model_bert:opset verion: 13

AssertionError Traceback (most recent call last)

in () 9 num_attention_heads=num_attention_heads, 10 hidden_size=hidden_size, ---> 11 architecture='gpt2' 12 )

7 frames

/usr/local/lib/python3.7/dist-packages/onnxruntime/transformers/../tools/symbolic_shape_infer.py in _add_suggested_merge(self, symbols, apply) 209 210 def add_suggested_merge(self, symbols, apply=False): --> 211 assert all([(type(s) == str and s in self.symbolic_dims_) or is_literal(s) for s in symbols]) 212 symbols = set(symbols) 213 for k, v in self.suggested_merge.items():

AssertionError:
bug

opened by rohitmishra94 8
Support other tasks/architectures?

First off: thank you! This is a great project, I'm really grateful you released it publically.

From what I can tell, this supports encoder-only architectures, and the Sequence Classification task (ex). Am I correct? If so, are there plans to support, or interest in supporting, other architectures (encoder/decoder, decoder-only) and/or tasks (Token Classification and Masked token prediction for encoder-only architectures, or Seq2SeqLM for the other architectures)?

opened by sam-writer 7
GPT2 has slow inference

Hello,

your wrapper for gpt2 does not support 'past_key_values' as huggingface transformers initially do. I've seen your measurements in the gpt2 demo, and at least for pytorch they are not really correct, instead of just simply calling the model with always the same input, you should call the generate method..

I tried to run gpt2 in pytorch both on cpu and gpu (GPU: TESLA T4) with your sample text: "Here is some text to encode Hello World"

here are my results (vanilla pytorch): gpu no cache: 14s/sequence gpu cache: 3.6s/sequence

cpu no cache: 114s/sequence cpu cache: 9.8s/sequence

For every measurement, the result is average out of ten runs of the generate method, I used number of beams=5

when running greedysearch, the difference is not so big, but still.. cpu no cache: 29s cpu cache: 4.8s

CPU: Intel(R) Xeon(R) Platinum 8259CL CPU

opened by kobzaond 6

Calibration failure occurred with no scaling factors detected

Hey,

first of all, thanks a lot for your great work. This repo was already a great help to me.

With your quantization update for INT8, however, I ran into a problem. As soon as I activate --quantization, I get the following error:

[01/14/2022-11:18:37] [TRT] [W] Calibrator is not being used. Users must provide dynamic range for all tensors that are not Int32 or Bool.
[01/14/2022-11:18:37] [TRT] [E] 4: [standardEngineBuilder.cpp::initCalibrationParams::1402] Error Code 4: Internal Error (Calibration failure occurred with no scaling factors detected. This could be due to no int8 calibrator or insufficient custom scales for network layers. Please see int8 sample to setup calibration correctly.)
[01/14/2022-11:18:37] [TRT] [E] 2: [builder.cpp::buildSerializedNetwork::609] Error Code 2: Internal Error (Assertion enginePtr != nullptr failed. )

Traceback (most recent call last):
  File "/data/repos/transformer-deploy/src/transformer_deploy/convert.py", line 326, in <module>
    entrypoint()
  File "/data/repos/transformer-deploy/src/transformer_deploy/convert.py", line 322, in entrypoint
    main(commands=args)
  File "/data/repos/transformer-deploy/src/transformer_deploy/convert.py", line 216, in main
    engine: ICudaEngine = build_engine(
  File "/data/repos/transformer-deploy/src/transformer_deploy/backends/trt_utils.py", line 181, in build_engine
    engine: ICudaEngine = runtime.deserialize_cuda_engine(trt_engine)
TypeError: deserialize_cuda_engine(): incompatible function arguments. The following argument types are supported:
    1. (self: tensorrt.tensorrt.Runtime, serialized_engine: buffer) -> tensorrt.tensorrt.ICudaEngine

Invoked with: <tensorrt.tensorrt.Runtime object at 0x7feb14128e30>, None

The problem in the traceback is then just that the trt_engine will be None. I don't get any other warnings or errors, so I'm a bit at a loss. I've tried with distilroberta-base and also with bert-base-uncased, but I get the same error each time. Did you, by any chance, run into the same problem at some point in time or do you see what the issue may be?

Thanks a lot in advance!

opened by v1nc3nt27 6

Failed to load private model

Hi,

I tried to convert a private model of sentence-transformer on the Hugging Face Hub:

docker run -it --rm --gpus all \
    -v $PWD:/project ghcr.io/els-rd/transformer-deploy:0.4.0 \
    bash -c "cd /project && \
    convert_model -m \"Matthieu/paraphrase-multilingual-MiniLM-L12-v2-pooling-GPLv2\" \
    --backend tensorrt onnx \
    --task embedding \
    --seq-len 16 128 128 \
    --auth-token XXX"

However, the download of config.json file failed with the following message:

=============================
== Triton Inference Server ==
=============================

NVIDIA Release 22.01 (build 31237563)

Copyright (c) 2018-2021, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION.  All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 451/451 [00:00<00:00, 650kB/s]
Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.83M/4.83M [00:02<00:00, 2.48MB/s]
Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16.3M/16.3M [00:07<00:00, 2.41MB/s]
Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 280/280 [00:00<00:00, 425kB/s]
401 Client Error: Unauthorized for url: https://huggingface.co/Matthieu/paraphrase-multilingual-MiniLM-L12-v2-pooling-GPLv2/resolve/main/config.json
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/transformers/configuration_utils.py", line 585, in _get_config_dict
    resolved_config_file = cached_path(
  File "/usr/local/lib/python3.8/dist-packages/transformers/file_utils.py", line 1846, in cached_path
    output_path = get_from_cache(
  File "/usr/local/lib/python3.8/dist-packages/transformers/file_utils.py", line 2050, in get_from_cache
    _raise_for_status(r)
  File "/usr/local/lib/python3.8/dist-packages/transformers/file_utils.py", line 1977, in _raise_for_status
    request.raise_for_status()
  File "/usr/local/lib/python3.8/dist-packages/requests/models.py", line 960, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: https://huggingface.co/Matthieu/paraphrase-multilingual-MiniLM-L12-v2-pooling-GPLv2/resolve/main/config.json

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/bin/convert_model", line 8, in <module>
    sys.exit(entrypoint())
  File "/usr/local/lib/python3.8/dist-packages/transformer_deploy/convert.py", line 357, in entrypoint
    main(commands=args)
  File "/usr/local/lib/python3.8/dist-packages/transformer_deploy/convert.py", line 152, in main
    model_config: PretrainedConfig = AutoConfig.from_pretrained(pretrained_model_name_or_path=commands.model)
  File "/usr/local/lib/python3.8/dist-packages/transformers/models/auto/configuration_auto.py", line 612, in from_pretrained
    config_dict, _ = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/configuration_utils.py", line 537, in get_config_dict
    config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/configuration_utils.py", line 618, in _get_config_dict
    raise EnvironmentError(
OSError: We couldn't connect to 'https://huggingface.co/' to load this model and it looks like Matthieu/paraphrase-multilingual-MiniLM-L12-v2-pooling-GPLv2 is not the path to a directory conaining a config.json file.
Checkout your internet connection or see how to run the library in offline mode at 'https://huggingface.co/docs/transformers/installation#offline-mode'.

Any advice?

Thanks!

opened by Matthieu-Tinycoaching 5

'assert num_heads > 0' error with DistilBert

I get the following error when I try to optimize distilbert:

AssertionError                            Traceback (most recent call last)
<timed eval> in <module>

/opt/conda/lib/python3.7/site-packages/transformer_deploy/convert.py in main(input_args)
    245             onnx_path=onnx_model_path,
    246             onnx_optim_fp16_path=onnx_optim_fp16_path,
--> 247             use_cuda=True,
    248         )
    249         onnx_model = create_model_for_provider(path=onnx_optim_fp16_path, provider_to_use="CUDAExecutionProvider")

/opt/conda/lib/python3.7/site-packages/transformer_deploy/backends/ort_utils.py in optimize_onnx(onnx_path, onnx_optim_fp16_path, use_cuda)
     72         num_heads=0,  # automatic detection don't work with opset 13
     73         hidden_size=0,  # automatic detection
---> 74         optimization_options=optimization_options,
     75     )
     76 

/opt/conda/lib/python3.7/site-packages/onnxruntime/transformers/optimizer.py in optimize_model(input, model_type, num_heads, hidden_size, optimization_options, opt_level, use_gpu, only_onnxruntime)
    289 
    290     if not only_onnxruntime:
--> 291         optimizer.optimize(optimization_options)
    292 
    293     # Remove the temporary model.

/opt/conda/lib/python3.7/site-packages/onnxruntime/transformers/onnx_model_bert.py in optimize(self, options, add_dynamic_axes)
    317             if options is not None:
    318                 self.attention_mask.set_mask_format(options.attention_mask_format)
--> 319             self.fuse_attention()
    320 
    321         self.fuse_shape()

/opt/conda/lib/python3.7/site-packages/onnxruntime/transformers/onnx_model_bert.py in fuse_attention(self)
     52 
     53     def fuse_attention(self):
---> 54         self.attention_fusion.apply()
     55 
     56     def fuse_gelu(self):

/opt/conda/lib/python3.7/site-packages/onnxruntime/transformers/fusion_base.py in apply(self)
     41                     raise Exception("Can not find node in any graphs")
     42                 self.this_graph_name = graph.name
---> 43                 self.fuse(node, input_name_to_nodes, output_name_to_node)
     44 
     45         op_list = [node.op_type for node in self.nodes_to_add]

/opt/conda/lib/python3.7/site-packages/onnxruntime/transformers/fusion_attention.py in fuse(self, normalize_node, input_name_to_nodes, output_name_to_node)
    444             new_node = self.create_attention_node(mask_index, matmul_q, matmul_k, matmul_v, add_q, add_k, add_v,
    445                                                   q_num_heads, self.hidden_size, root_input,
--> 446                                                   attention_last_node.output[0], add_qk_str)
    447             if new_node is None:
    448                 return

/opt/conda/lib/python3.7/site-packages/onnxruntime/transformers/fusion_attention.py in create_attention_node(self, mask_index, q_matmul, k_matmul, v_matmul, q_add, k_add, v_add, num_heads, hidden_size, input, output, add_qk_str)
    161             Union[NodeProto, None]: the node created or None if failed.
    162         """
--> 163         assert num_heads > 0
    164 
    165         if hidden_size > 0 and (hidden_size % num_heads) != 0:

AssertionError:

While trying to resolve the issue, I observed that it did not occur when optimizer from onnxruntime-tools was used with opt_level 99 (instead of the one in onnxruntime.transformers). But the code then threw Exceptions due to some skip layer normalization issues.

opened by vishalsrao 5

Unable to install transformer-deploy module

Any support would be appreciated:

When running demo/torchdynamo/benchmark.ipynb, specifically this cell (pasted code), I run into the error below.

from typing import Dict

import numpy as np
import torch
from onnxruntime import GraphOptimizationLevel

from transformers import AutoModel, PreTrainedModel
from transformer_deploy.backends.ort_utils import convert_fp16
from transformer_deploy.backends.onnx_utils import save_onnx

from dynamo_utils import (
    benchmark,
    check_output,
    get_dynamo_optimizer,
    get_onnx_inference,
    get_pytorch_inference,
    get_pytorch_input,
    plot_benchmarks,
    print_pytorch_profile,
    get_tensorrt_inference,
    seq_lengths,
)

import gc
import tensorrt as trt
from tensorrt.tensorrt import ICudaEngine, Logger, Runtime
import onnx
from transformer_deploy.backends.trt_utils import build_engine, save_engine

ModuleNotFoundError Traceback (most recent call last)Cell In [3], line 8 5 from onnxruntime import GraphOptimizationLevel 7 from transformers import AutoModel, PreTrainedModel----> 8 from transformer_deploy.backends.ort_utils import convert_fp16 9 from transformer_deploy.backends.onnx_utils import save_onnx 11 from dynamo_utils import ( 12 benchmark, 13 check_output, (...) 21 seq_lengths, 22 )ModuleNotFoundError: No module named 'transformer_deploy'

opened by elvinagam 4

Question-Answering example not working for batch_size > 1

I'm running demo/question-answering/triton_client.py from the examples directory. The script returns expected result with batch_size=1. However, if you make the batch_size > 1 in this line, it outputs only the result of the first element in the batch and other elements are ignored.

I saw #84 and #106 about the question-answering example and batch_size but I don't think they are related to this. The triton server does not yield in any errors.

Am I missing something here?

opened by lakshaykc 0
Support for constrained beam-search in T5

HF T5 model (actually seq2seq model in general) supports complex decoding schemes such as constrained beam search https://huggingface.co/blog/constrained-beam-search. In my use case, I just really need the simplest constrained beam search where decoded sequences have to belong to a pre-defined Trie. This can be done via https://huggingface.co/docs/transformers/internal/generation_utils#transformers.PrefixConstrainedLogitsProcessor

Is this possible for transformer-deploy ?

opened by junwang-wish 0

Attempting to run T5 ORT model in Triton inference server

Hi there,

Thanks again for this library!

We're trying to convert a fine-tuned T5 model to ONNX and run it in Triton. We've managed to convert the model to ONNX and use the T5 notebook guide to run the model just fine in python.

But trying to get it to run in Triton has been a challenge. In particular, we're not sure how to get past_key_values to be passed through in Triton. We have the decoder config as follows:

name: "t5-dec-if-node_onnx_model"
max_batch_size: 0
platform: "onnxruntime_onnx"
default_model_filename: "model.bin"
input [
    {
        name: "input_ids"
        data_type: TYPE_INT32
        dims: [ -1, -1 ]
    },
    {
        name: "encoder_hidden_states"
        data_type: TYPE_FP32
        dims: [ -1, -1, 2048 ]
    },
    {
        name: "enable_cache"
        data_type: TYPE_BOOL
        dims: [ 1 ]
    },
    
        {
            name: "past_key_values.0.decoder.key"
            data_type: TYPE_FP32
            dims: [-1, 32, -1, 64]
        },
        {
            name: "past_key_values.0.decoder.value"
            data_type: TYPE_FP32
            dims: [-1, 32, -1, 64]
        },
        {
            name: "past_key_values.0.encoder.key"
            data_type: TYPE_FP32
            dims: [-1, 32, -1, 64]
        },
        {
            name: "past_key_values.0.encoder.value"
            data_type: TYPE_FP32
            dims: [-1, 32, -1, 64]
        }
     ...
]
output [
    {
        name: "logits"
        data_type: TYPE_FP32
        dims: [ -1, -1, 32128 ]
    }
]
instance_group [
    {
      count: 1
      kind: KIND_GPU
    }
]

And when we do the following:

input_1 = tritonclient.http.InferInput(name="input_ids", shape=(1, 24), datatype="INT32")
input_2 = tritonclient.http.InferInput(name="encoder_hidden_states", shape=(1, 24, 2048), datatype="FP32")
input_3 = tritonclient.http.InferInput(name="enable_cache", shape=(1, ), datatype="BOOL")

input_1.set_data_from_numpy(input_ids)
input_2.set_data_from_numpy(encoder_hidden_states)
input_3.set_data_from_numpy(np.asarray([True]))

result = triton_client.infer(
    model_name='t5-dec-if-node_onnx_model', 
    inputs=[input_1, input_2, input_3], 
    outputs=[tritonclient.http.InferRequestedOutput(name="logits", binary_data=False)]
)

We get this error:

InferenceServerException: [request id: <id_unknown>] expected 99 inputs but got 3 inputs for model 't5-dec-if-node_onnx_model'

Any idea how we can fix this?

opened by samiur 1

Two GPU are slower than one
Hi, I run Triton web server on two GPUs NVIDIA RTX3090Ti with --shm-size 20g. When I do inference, I get time near 1.56s. But if I run web server with only one GPU set --gpus '"device=0"' after that I get the time near 860ms. Length of input sequence was 256 tokens. I've optimized GPT2-medium by your script.

convert_model -m gpt2-medium \ --backend tensorrt onnx \ --seq-len 32 512 512 \ --task text-generation --atol=2"
opened by OleksandrKorovii 0

Tensorrt engine

I tried running TRT based-off three methods:

python src/transformer-deploy/convert.py
exisiting docker image
build docker image from repo

In all three instances, I got back the same response while running TRT backend.

The command I have been trying to run (docker for example):

docker run -it --rm --gpus all -v $PWD:/project ghcr.io/els-rd/transformer-deploy:latest bash -c "cd /project && \
  convert_model -m \"sentence-transformers/multi-qa-distilbert-cos-v1\" \
  --backend tensorrt onnx \
  --seq-len 128 128 256 \
  --batch-size 1 32 300"

When i pass only 'onnx' as backend param everything runs pretty smoothly. But face issues with 'tensorrt' backend.

[11/29/2022-10:58:17] [TRT] [E] 2: [optimizer.cpp::getFormatRequirements::2945] Error Code 2: Internal Error (Assertion !n->candidateRequirements.empty() failed. no supported formats)
[11/29/2022-10:58:17] [TRT] [E] 2: [builder.cpp::buildSerializedNetwork::636] Error Code 2: Internal Error (Assertion engine != nullptr failed. )
Traceback (most recent call last):
  File "/usr/local/bin/convert_model", line 8, in <module>
    sys.exit(entrypoint())
  File "/usr/local/lib/python3.8/dist-packages/transformer_deploy/convert.py", line 417, in entrypoint
    main(commands=args)
  File "/usr/local/lib/python3.8/dist-packages/transformer_deploy/convert.py", line 308, in main
    engine: ICudaEngine = build_engine(
  File "/usr/local/lib/python3.8/dist-packages/transformer_deploy/backends/trt_utils.py", line 206, in build_engine
    engine: ICudaEngine = runtime.deserialize_cuda_engine(trt_engine)
TypeError: deserialize_cuda_engine(): incompatible function arguments. The following argument types are supported:
    1. (self: tensorrt.tensorrt.Runtime, serialized_engine: buffer) -> tensorrt.tensorrt.ICudaEngine

Invoked with: <tensorrt.tensorrt.Runtime object at 0x7f88c85c12b0>, None
free(): invalid pointer

Would be great if I could have a workaround for this.

Versions: Python: 3.8.15 transformers-deploy: 0.5.3 TensorRT: 8.4.1.5 Onnxruntime (GPU): 1.12.0 transformers: 4.24.0

opened by imsiddhant07 1

Token type ids bug

Some models don't use token_type_ids in the forward pass. E.g. deberta has type_vocab_size=0 as a default value.

What happens is the model ignores token_type_ids (https://github.com/huggingface/transformers/blob/bac2d29a802803a7f2db8e8597a2ec81730afcc9/src/transformers/models/deberta/modeling_deberta.py#L810)

However, tokenizer doesn't know about this and token_type_ids is still in tokenizer.model_input_names.

This mismatch leads to

docker run -it --rm --gpus all \
  -v $PWD:/project ghcr.io/els-rd/transformer-deploy:0.5.3 \
  bash -c "cd /project && \
    convert_model -m \"microsoft/deberta-base-mnli\" \
    --backend onnx \
    --seq-len 16 128 128"

docker run -itd --rm --gpus '"device=3"' -p8000:8000 --shm-size 256m \
  -v $PWD/triton_models:/models nvcr.io/nvidia/tritonserver:22.07-py3 \
  bash -c "pip install transformers && tritonserver --model-repository=/models"

And the triton inference server fails with

I1123 13:49:09.821427 1 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7fbf36000000' with size 268435456
I1123 13:49:09.821983 1 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864
I1123 13:49:09.828017 1 model_repository_manager.cc:1206] loading: transformer_onnx_tokenize:1
I1123 13:49:09.828058 1 model_repository_manager.cc:1206] loading: transformer_onnx_model:1
I1123 13:49:09.830743 1 onnxruntime.cc:2458] TRITONBACKEND_Initialize: onnxruntime
I1123 13:49:09.830786 1 onnxruntime.cc:2468] Triton TRITONBACKEND API version: 1.10
I1123 13:49:09.830804 1 onnxruntime.cc:2474] 'onnxruntime' TRITONBACKEND API version: 1.10
I1123 13:49:09.830814 1 onnxruntime.cc:2504] backend configuration:
{"cmdline":{"auto-complete-config":"true","min-compute-capability":"6.000000","backend-directory":"/opt/tritonserver/backends","default-max-batch-size":"4"}}
I1123 13:49:09.846110 1 onnxruntime.cc:2560] TRITONBACKEND_ModelInitialize: transformer_onnx_model (version 1)
I1123 13:49:09.847111 1 onnxruntime.cc:666] skipping model configuration auto-complete for 'transformer_onnx_model': inputs and outputs already specified
I1123 13:49:09.851839 1 onnxruntime.cc:2603] TRITONBACKEND_ModelInstanceInitialize: transformer_onnx_model_0 (GPU device 0)
I1123 13:49:12.063610 1 onnxruntime.cc:2637] TRITONBACKEND_ModelInstanceFinalize: delete instance state
I1123 13:49:12.063688 1 onnxruntime.cc:2583] TRITONBACKEND_ModelFinalize: delete model state
E1123 13:49:12.063708 1 model_repository_manager.cc:1355] failed to load 'transformer_onnx_model' version 1: Invalid argument: unable to load model 'transformer_onnx_model', configuration expects 3 inputs, model provides 2
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
I1123 13:49:13.744756 1 python_be.cc:1767] TRITONBACKEND_ModelInstanceInitialize: transformer_onnx_tokenize_0 (GPU device 0)
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
I1123 13:49:14.298233 1 model_repository_manager.cc:1352] successfully loaded 'transformer_onnx_tokenize' version 1
E1123 13:49:14.298380 1 model_repository_manager.cc:1559] Invalid argument: ensemble 'transformer_onnx_inference' depends on 'transformer_onnx_model' which has no loaded version
I1123 13:49:14.298438 1 server.cc:559]
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+

I1123 13:49:14.298487 1 server.cc:586]
+-------------+----------------------------------------------------------------+----------------------------------------------------------------+
| Backend     | Path                                                           | Config                                                         |
+-------------+----------------------------------------------------------------+----------------------------------------------------------------+
| onnxruntime | /opt/tritonserver/backends/onnxruntime/libtriton_onnxruntime.s | {"cmdline":{"auto-complete-config":"true","min-compute-capabil |
|             | o                                                              | ity":"6.000000","backend-directory":"/opt/tritonserver/backend |
|             |                                                                | s","default-max-batch-size":"4"}}                              |
|             |                                                                |                                                                |
| python      | /opt/tritonserver/backends/python/libtriton_python.so          | {"cmdline":{"auto-complete-config":"true","min-compute-capabil |
|             |                                                                | ity":"6.000000","backend-directory":"/opt/tritonserver/backend |
|             |                                                                | s","default-max-batch-size":"4"}}                              |
+-------------+----------------------------------------------------------------+----------------------------------------------------------------+

I1123 13:49:14.298549 1 server.cc:629]
+---------------------------+---------+---------------------------------------------------------------------------------------------------------+
| Model                     | Version | Status                                                                                                  |
+---------------------------+---------+---------------------------------------------------------------------------------------------------------+
| transformer_onnx_model    | 1       | UNAVAILABLE: Invalid argument: unable to load model 'transformer_onnx_model', configuration expects 3 i |
|                           |         | nputs, model provides 2                                                                                 |
| transformer_onnx_tokenize | 1       | READY                                                                                                   |
+---------------------------+---------+---------------------------------------------------------------------------------------------------------+

I1123 13:49:14.351997 1 metrics.cc:650] Collecting metrics for GPU 0: NVIDIA GeForce RTX 3090
I1123 13:49:14.352405 1 tritonserver.cc:2176]
+----------------------------------+------------------------------------------------------------------------------------------------------------+
| Option                           | Value                                                                                                      |
+----------------------------------+------------------------------------------------------------------------------------------------------------+
| server_id                        | triton                                                                                                     |
| server_version                   | 2.24.0                                                                                                     |
| server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configu |
|                                  | ration system_shared_memory cuda_shared_memory binary_tensor_data statistics trace                         |
| model_repository_path[0]         | /models                                                                                                    |
| model_control_mode               | MODE_NONE                                                                                                  |
| strict_model_config              | 0                                                                                                          |
| rate_limit                       | OFF                                                                                                        |
| pinned_memory_pool_byte_size     | 268435456                                                                                                  |
| cuda_memory_pool_byte_size{0}    | 67108864                                                                                                   |
| response_cache_byte_size         | 0                                                                                                          |
| min_supported_compute_capability | 6.0                                                                                                        |
| strict_readiness                 | 1                                                                                                          |
| exit_timeout                     | 30                                                                                                         |
+----------------------------------+------------------------------------------------------------------------------------------------------------+

I1123 13:49:14.352443 1 server.cc:260] Waiting for in-flight requests to complete.
I1123 13:49:14.352453 1 server.cc:276] Timeout 30: Found 0 model versions that have in-flight inferences
I1123 13:49:14.352460 1 model_repository_manager.cc:1230] unloading: transformer_onnx_tokenize:1
I1123 13:49:14.352525 1 server.cc:291] All models are stopped, unloading models
I1123 13:49:14.352534 1 server.cc:298] Timeout 30: Found 1 live models and 0 in-flight non-inference requests
I1123 13:49:15.352620 1 server.cc:298] Timeout 29: Found 1 live models and 0 in-flight non-inference requests
I1123 13:49:15.444143 1 model_repository_manager.cc:1335] successfully unloaded 'transformer_onnx_tokenize' version 1
I1123 13:49:16.352790 1 server.cc:298] Timeout 28: Found 0 live models and 0 in-flight non-inference requests
error: creating server: Internal - failed to load all models

The proposed solution fixes this bug

opened by fursovia 2

Releases(v0.4.0)

v0.4.0(Feb 8, 2022)
add support for decoder based model (GPT-2) on both ONNX Runtime and TensorRT

refactor triton configuration generation (simplification)

add GPT-2 model documentation (notebook)

fix CPU quantization benchmark (was not using the quant model)

fix sentence transformers bug

Source code(tar.gz)
Source code(zip)
v0.3.0(Dec 28, 2021)
What's Changed

Update requirements_gpu.txt by @sam-writer in https://github.com/ELS-RD/transformer-deploy/pull/22

refactoring by @pommedeterresautee in https://github.com/ELS-RD/transformer-deploy/pull/27

add CPU inference support by @pommedeterresautee in https://github.com/ELS-RD/transformer-deploy/pull/28

Add QAT support to more models by @pommedeterresautee in https://github.com/ELS-RD/transformer-deploy/pull/29

Full Changelog: https://github.com/ELS-RD/transformer-deploy/compare/v0.2.0...v0.3.0
Source code(tar.gz)
Source code(zip)
v0.2.0(Dec 8, 2021)
support int-8 GPU quantization

add a tuto to perform quantization end to end

add QDQRoberta model

switch to ONNX opset 13

refactoring in the TensorRT engine creation

fix bugs

add auth token (for private HF repo)

What's Changed

Update triton by @pommedeterresautee in https://github.com/ELS-RD/transformer-deploy/pull/11

fix README.md by @pommedeterresautee in https://github.com/ELS-RD/transformer-deploy/pull/13

Fix install errors by @sam-writer in https://github.com/ELS-RD/transformer-deploy/pull/20

Add auth token by @sam-writer in https://github.com/ELS-RD/transformer-deploy/pull/19

Support GPU INT-8 quantization by @pommedeterresautee in https://github.com/ELS-RD/transformer-deploy/pull/15

New Contributors

@sam-writer made their first contribution in https://github.com/ELS-RD/transformer-deploy/pull/20

Full Changelog: https://github.com/ELS-RD/transformer-deploy/compare/v0.1.1...v0.2.0
Source code(tar.gz)
Source code(zip)
v0.1.1(Nov 24, 2021)
update Docker image

update documentation

Source code(tar.gz)
Source code(zip)
v0.1.0(Nov 23, 2021)
switch from a proof of concept to a library

add support for TensorRT Python API (for best performances)

improve documentation (separate Hugging Face Infinity thing from the doc, add benchmark, etc.)

fix issues with mixed precision

add license

add tests, Github actions, Makefile

change the way the Docker image is built

Source code(tar.gz)
Source code(zip)
v0.0.1(Nov 8, 2021)

all the scripts to reproduce https://medium.com/p/e1be0057a51c
Source code(tar.gz)
Source code(zip)

Deploy optimized transformer based models on Nvidia Triton server

Related tags

Overview

🤗 Hugging Face Transformer submillisecond inference 🤯 and deployment on Nvidia Triton server

Baseline set by Hugging Face Infinity demo

Install dependencies

Generate optimized models

FastAPI server

Triton server

Triton server perf analysis

Call Triton HTTP API directly

Use TensorRT model in Triton server (instead of ONNX)

Comments

Releases(v0.4.0)

v0.4.0(Feb 8, 2022)

v0.3.0(Dec 28, 2021)

What's Changed

v0.2.0(Dec 8, 2021)

What's Changed

New Contributors

v0.1.1(Nov 24, 2021)

v0.1.0(Nov 23, 2021)

v0.0.1(Nov 8, 2021)

Owner

Lefebvre Sarrut Services

Deep Structured Instance Graph for Distilling Object Detectors (ICCV 2021)

EdiBERT is a generative model based on a bi-directional transformer, suited for image manipulation

We present a regularized self-labeling approach to improve the generalization and robustness properties of fine-tuning.

Explainability of the Implications of Supervised and Unsupervised Face Image Quality Estimations Through Activation Map Variation Analyses in Face Recognition Models

A simple python module to generate anchor (aka default/prior) boxes for object detection tasks.

Pytorch domain adaptation package

Computing Shapley values using VAEAC

Music Source Separation; Train & Eval & Inference piplines and pretrained models we used for 2021 ISMIR MDX Challenge.

Implementation of Shape Generation and Completion Through Point-Voxel Diffusion

Official PyTorch Implementation of Learning Self-Similarity in Space and Time as Generalized Motion for Video Action Recognition, ICCV 2021

Face Detection and Alignment using Multi-task Cascaded Convolutional Networks (MTCNN)

an implementation of 3D Ken Burns Effect from a Single Image using PyTorch

Code for "R-GCN: The R Could Stand for Random"

Building blocks for uncertainty-aware cycle consistency presented at NeurIPS'21.

PyTorch implementation of Progressive Growing of GANs for Improved Quality, Stability, and Variation.

The official implementation of Equalization Loss for Long-Tailed Object Recognition (CVPR 2020) based on Detectron2

Dilated Convolution with Learnable Spacings PyTorch

Keras implementation of Deeplab v3+ with pretrained weights

Semi-Supervised Semantic Segmentation with Pixel-Level Contrastive Learning from a Class-wise Memory Bank

Deep metric learning methods implemented in Chainer