Mapping a variable-length sentence to a fixed-length vector using BERT model

Overview

Are you looking for X-as-service? Try Jina logo: Jina is a cloud-native neural search framework

the Cloud-Native Neural Search Framework for Any Kind of Data

bert-as-service

Using BERT model as a sentence encoding service, i.e. mapping a variable-length sentence to a fixed-length vector.

GitHub stars Pypi package ReadTheDoc PyPI - Downloads GitHub issues GitHub license Twitter

HighlightsWhat is itInstallGetting StartedAPITutorialsFAQBenchmarkBlog

Made by Han Xiao • 🌐 https://hanxiao.github.io

What is it

BERT is a NLP model developed by Google for pre-training language representations. It leverages an enormous amount of plain text data publicly available on the web and is trained in an unsupervised manner. Pre-training a BERT model is a fairly expensive yet one-time procedure for each language. Fortunately, Google released several pre-trained models where you can download from here.

Sentence Encoding/Embedding is a upstream task required in many NLP applications, e.g. sentiment analysis, text classification. The goal is to represent a variable length sentence into a fixed length vector, e.g. hello world to [0.1, 0.3, 0.9]. Each element of the vector should "encode" some semantics of the original sentence.

Finally, bert-as-service uses BERT as a sentence encoder and hosts it as a service via ZeroMQ, allowing you to map sentences into fixed-length representations in just two lines of code.

Highlights

  • 🔭 State-of-the-art: build on pretrained 12/24-layer BERT models released by Google AI, which is considered as a milestone in the NLP community.
  • 🐣 Easy-to-use: require only two lines of code to get sentence/token-level encodes.
  • Fast: 900 sentences/s on a single Tesla M40 24GB. Low latency, optimized for speed. See benchmark.
  • 🐙 Scalable: scale nicely and smoothly on multiple GPUs and multiple clients without worrying about concurrency. See benchmark.
  • 💎 Reliable: tested on multi-billion sentences; days of running without a break or OOM or any nasty exceptions.

More features: XLA & FP16 support; mix GPU-CPU workloads; optimized graph; tf.data friendly; customized tokenizer; flexible pooling strategy; build-in HTTP server and dashboard; async encoding; multicasting; etc.

Install

Install the server and client via pip. They can be installed separately or even on different machines:

pip install bert-serving-server  # server
pip install bert-serving-client  # client, independent of `bert-serving-server`

Note that the server MUST be running on Python >= 3.5 with Tensorflow >= 1.10 (one-point-ten). Again, the server does not support Python 2!

☝️ The client can be running on both Python 2 and 3 for the following consideration.

Getting Started

1. Download a Pre-trained BERT Model

Download a model listed below, then uncompress the zip file into some folder, say /tmp/english_L-12_H-768_A-12/

List of released pretrained BERT models (click to expand...)
BERT-Base, Uncased 12-layer, 768-hidden, 12-heads, 110M parameters
BERT-Large, Uncased 24-layer, 1024-hidden, 16-heads, 340M parameters
BERT-Base, Cased 12-layer, 768-hidden, 12-heads , 110M parameters
BERT-Large, Cased 24-layer, 1024-hidden, 16-heads, 340M parameters
BERT-Base, Multilingual Cased (New) 104 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
BERT-Base, Multilingual Cased (Old) 102 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
BERT-Base, Chinese Chinese Simplified and Traditional, 12-layer, 768-hidden, 12-heads, 110M parameters

Optional: fine-tuning the model on your downstream task. Why is it optional?

2. Start the BERT service

After installing the server, you should be able to use bert-serving-start CLI as follows:

bert-serving-start -model_dir /tmp/english_L-12_H-768_A-12/ -num_worker=4 

This will start a service with four workers, meaning that it can handle up to four concurrent requests. More concurrent requests will be queued in a load balancer. Details can be found in our FAQ and the benchmark on number of clients.

Below shows what the server looks like when starting correctly:

Alternatively, one can start the BERT Service in a Docker Container (click to expand...)
docker build -t bert-as-service -f ./docker/Dockerfile .
NUM_WORKER=1
PATH_MODEL=/PATH_TO/_YOUR_MODEL/
docker run --runtime nvidia -dit -p 5555:5555 -p 5556:5556 -v $PATH_MODEL:/model -t bert-as-service $NUM_WORKER

3. Use Client to Get Sentence Encodes

Now you can encode sentences simply as follows:

from bert_serving.client import BertClient
bc = BertClient()
bc.encode(['First do it', 'then do it right', 'then do it better'])

It will return a ndarray (or List[List[float]] if you wish), in which each row is a fixed-length vector representing a sentence. Having thousands of sentences? Just encode! Don't even bother to batch, the server will take care of it.

As a feature of BERT, you may get encodes of a pair of sentences by concatenating them with ||| (with whitespace before and after), e.g.

bc.encode(['First do it ||| then do it right'])

Below shows what the server looks like while encoding:

Use BERT Service Remotely

One may also start the service on one (GPU) machine and call it from another (CPU) machine as follows:

# on another CPU machine
from bert_serving.client import BertClient
bc = BertClient(ip='xx.xx.xx.xx')  # ip address of the GPU machine
bc.encode(['First do it', 'then do it right', 'then do it better'])

Note that you only need pip install -U bert-serving-client in this case, the server side is not required. You may also call the service via HTTP requests.

💡 Want to learn more? Checkout our tutorials:

Server and Client API

▴ Back to top

ReadTheDoc

The best way to learn bert-as-service latest API is reading the documentation.

Server API

Please always refer to the latest server-side API documented here., you may get the latest usage via:

bert-serving-start --help
bert-serving-terminate --help
bert-serving-benchmark --help
Argument Type Default Description
model_dir str Required folder path of the pre-trained BERT model.
tuned_model_dir str (Optional) folder path of a fine-tuned BERT model.
ckpt_name str bert_model.ckpt filename of the checkpoint file.
config_name str bert_config.json filename of the JSON config file for BERT model.
graph_tmp_dir str None path to graph temp file
max_seq_len int 25 maximum length of sequence, longer sequence will be trimmed on the right side. Set it to NONE for dynamically using the longest sequence in a (mini)batch.
cased_tokenization bool False Whether tokenizer should skip the default lowercasing and accent removal. Should be used for e.g. the multilingual cased pretrained BERT model.
mask_cls_sep bool False masking the embedding on [CLS] and [SEP] with zero.
num_worker int 1 number of (GPU/CPU) worker runs BERT model, each works in a separate process.
max_batch_size int 256 maximum number of sequences handled by each worker, larger batch will be partitioned into small batches.
priority_batch_size int 16 batch smaller than this size will be labeled as high priority, and jumps forward in the job queue to get result faster
port int 5555 port for pushing data from client to server
port_out int 5556 port for publishing results from server to client
http_port int None server port for receiving HTTP requests
cors str * setting "Access-Control-Allow-Origin" for HTTP requests
pooling_strategy str REDUCE_MEAN the pooling strategy for generating encoding vectors, valid values are NONE, REDUCE_MEAN, REDUCE_MAX, REDUCE_MEAN_MAX, CLS_TOKEN, FIRST_TOKEN, SEP_TOKEN, LAST_TOKEN. Explanation of these strategies can be found here. To get encoding for each token in the sequence, please set this to NONE.
pooling_layer list [-2] the encoding layer that pooling operates on, where -1 means the last layer, -2 means the second-to-last, [-1, -2] means concatenating the result of last two layers, etc.
gpu_memory_fraction float 0.5 the fraction of the overall amount of memory that each GPU should be allocated per worker
cpu bool False run on CPU instead of GPU
xla bool False enable XLA compiler for graph optimization (experimental!)
fp16 bool False use float16 precision (experimental)
device_map list [] specify the list of GPU device ids that will be used (id starts from 0)
show_tokens_to_client bool False sending tokenization results to client

Client API

Please always refer to the latest client-side API documented here. Client-side provides a Python class called BertClient, which accepts arguments as follows:

Argument Type Default Description
ip str localhost IP address of the server
port int 5555 port for pushing data from client to server, must be consistent with the server side config
port_out int 5556 port for publishing results from server to client, must be consistent with the server side config
output_fmt str ndarray the output format of the sentence encodes, either in numpy array or python List[List[float]] (ndarray/list)
show_server_config bool False whether to show server configs when first connected
check_version bool True whether to force client and server to have the same version
identity str None a UUID that identifies the client, useful in multi-casting
timeout int -1 set the timeout (milliseconds) for receive operation on the client

A BertClient implements the following methods and properties:

Method Description
.encode() Encode a list of strings to a list of vectors
.encode_async() Asynchronous encode batches from a generator
.fetch() Fetch all encoded vectors from server and return them in a generator, use it with .encode_async() or .encode(blocking=False). Sending order is NOT preserved.
.fetch_all() Fetch all encoded vectors from server and return them in a list, use it with .encode_async() or .encode(blocking=False). Sending order is preserved.
.close() Gracefully close the connection between the client and the server
.status Get the client status in JSON format
.server_status Get the server status in JSON format

📖 Tutorial

▴ Back to top

ReadTheDoc

The full list of examples can be found in example/. You can run each via python example/example-k.py. Most of examples require you to start a BertServer first, please follow the instruction here. Note that although BertClient works universally on both Python 2.x and 3.x, examples are only tested on Python 3.6.

Table of contents (click to expand...)

Building a QA semantic search engine in 3 minutes

The complete example can be found example8.py.

As the first example, we will implement a simple QA search engine using bert-as-service in just three minutes. No kidding! The goal is to find similar questions to user's input and return the corresponding answer. To start, we need a list of question-answer pairs. Fortunately, this README file already contains a list of FAQ, so I will just use that to make this example perfectly self-contained. Let's first load all questions and show some statistics.

prefix_q = '##### **Q:** '
with open('README.md') as fp:
    questions = [v.replace(prefix_q, '').strip() for v in fp if v.strip() and v.startswith(prefix_q)]
    print('%d questions loaded, avg. len of %d' % (len(questions), np.mean([len(d.split()) for d in questions])))

This gives 33 questions loaded, avg. len of 9. So looks like we have enough questions. Now start a BertServer with uncased_L-12_H-768_A-12 pretrained BERT model:

bert-serving-start -num_worker=1 -model_dir=/data/cips/data/lab/data/model/uncased_L-12_H-768_A-12

Next, we need to encode our questions into vectors:

bc = BertClient(port=4000, port_out=4001)
doc_vecs = bc.encode(questions)

Finally, we are ready to receive new query and perform a simple "fuzzy" search against the existing questions. To do that, every time a new query is coming, we encode it as a vector and compute its dot product with doc_vecs; sort the result descendingly; and return the top-k similar questions as follows:

while True:
    query = input('your question: ')
    query_vec = bc.encode([query])[0]
    # compute normalized dot product as score
    score = np.sum(query_vec * doc_vecs, axis=1) / np.linalg.norm(doc_vecs, axis=1)
    topk_idx = np.argsort(score)[::-1][:topk]
    for idx in topk_idx:
        print('> %s\t%s' % (score[idx], questions[idx]))

That's it! Now run the code and type your query, see how this search engine handles fuzzy match:

Serving a fine-tuned BERT model

Pretrained BERT models often show quite "okayish" performance on many tasks. However, to release the true power of BERT a fine-tuning on the downstream task (or on domain-specific data) is necessary. In this example, I will show you how to serve a fine-tuned BERT model.

We follow the instruction in "Sentence (and sentence-pair) classification tasks" and use run_classifier.py to fine tune uncased_L-12_H-768_A-12 model on MRPC task. The fine-tuned model is stored at /tmp/mrpc_output/, which can be changed by specifying --output_dir of run_classifier.py.

If you look into /tmp/mrpc_output/, it contains something like:

checkpoint                                        128
eval                                              4.0K
eval_results.txt                                  86
eval.tf_record                                    219K
events.out.tfevents.1545202214.TENCENT64.site     6.1M
events.out.tfevents.1545203242.TENCENT64.site     14M
graph.pbtxt                                       9.0M
model.ckpt-0.data-00000-of-00001                  1.3G
model.ckpt-0.index                                23K
model.ckpt-0.meta                                 3.9M
model.ckpt-343.data-00000-of-00001                1.3G
model.ckpt-343.index                              23K
model.ckpt-343.meta                               3.9M
train.tf_record                                   2.0M

Don't be afraid of those mysterious files, as the only important one to us is model.ckpt-343.data-00000-of-00001 (looks like my training stops at the 343 step. One may get model.ckpt-123.data-00000-of-00001 or model.ckpt-9876.data-00000-of-00001 depending on the total training steps). Now we have collected all three pieces of information that are needed for serving this fine-tuned model:

  • The pretrained model is downloaded to /path/to/bert/uncased_L-12_H-768_A-12
  • Our fine-tuned model is stored at /tmp/mrpc_output/;
  • Our fine-tuned model checkpoint is named as model.ckpt-343 something something.

Now start a BertServer by putting three pieces together:

bert-serving-start -model_dir=/pretrained/uncased_L-12_H-768_A-12 -tuned_model_dir=/tmp/mrpc_output/ -ckpt_name=model.ckpt-343

After the server started, you should find this line in the log:

I:GRAPHOPT:[gra:opt: 50]:checkpoint (override by fine-tuned model): /tmp/mrpc_output/model.ckpt-343

Which means the BERT parameters is overrode and successfully loaded from our fine-tuned /tmp/mrpc_output/model.ckpt-343. Done!

In short, find your fine-tuned model path and checkpoint name, then feed them to -tuned_model_dir and -ckpt_name, respectively.

Getting ELMo-like contextual word embedding

Start the server with pooling_strategy set to NONE.

bert-serving-start -pooling_strategy NONE -model_dir /tmp/english_L-12_H-768_A-12/

To get the word embedding corresponds to every token, you can simply use slice index as follows:

# max_seq_len = 25
# pooling_strategy = NONE

bc = BertClient()
vec = bc.encode(['hey you', 'whats up?'])

vec  # [2, 25, 768]
vec[0]  # [1, 25, 768], sentence embeddings for `hey you`
vec[0][0]  # [1, 1, 768], word embedding for `[CLS]`
vec[0][1]  # [1, 1, 768], word embedding for `hey`
vec[0][2]  # [1, 1, 768], word embedding for `you`
vec[0][3]  # [1, 1, 768], word embedding for `[SEP]`
vec[0][4]  # [1, 1, 768], word embedding for padding symbol
vec[0][25]  # error, out of index!

Note that no matter how long your original sequence is, the service will always return a [max_seq_len, 768] matrix for every sequence. When using slice index to get the word embedding, beware of the special tokens padded to the sequence, i.e. [CLS], [SEP], 0_PAD.

Using your own tokenizer

Often you want to use your own tokenizer to segment sentences instead of the default one from BERT. Simply call encode(is_tokenized=True) on the client slide as follows:

texts = ['hello world!', 'good day']

# a naive whitespace tokenizer
texts2 = [s.split() for s in texts]

vecs = bc.encode(texts2, is_tokenized=True)

This gives [2, 25, 768] tensor where the first [1, 25, 768] corresponds to the token-level encoding of "hello world!". If you look into its values, you will find that only the first four elements, i.e. [1, 0:3, 768] have values, all the others are zeros. This is due to the fact that BERT considers "hello world!" as four tokens: [CLS] hello world! [SEP], the rest are padding symbols and are masked out before output.

Note that there is no need to start a separate server for handling tokenized/untokenized sentences. The server can tell and handle both cases automatically.

Sometimes you want to know explicitly the tokenization performed on the server side to have better understanding of the embedding result. One such case is asking word embedding from the server (with -pooling_strategy NONE), one wants to tell which word is tokenized and which is unrecognized. You can get such information with the following steps:

  1. enabling -show_tokens_to_client on the server side;
  2. calling the server via encode(..., show_tokens=True).

For example, a basic usage like

bc.encode(['hello world!', 'thisis it'], show_tokens=True)

returns a tuple, where the first element is the embedding and the second is the tokenization result from the server:

(array([[[ 0.        , -0.        ,  0.        , ...,  0.        , -0.        , -0.        ],
        [ 1.1100919 , -0.20474958,  0.9895898 , ...,  0.3873255  , -1.4093989 , -0.47620595],
        ..., -0.        , -0.        ]],

       [[ 0.        , -0.        ,  0.        , ...,  0.        , 0.        ,  0.        ],
        [ 0.6293478 , -0.4088499 ,  0.6022662 , ...,  0.41740108, 1.214456  ,  1.2532915 ],
        ..., 0.        ,  0.        ]]], dtype=float32),
         
          [['[CLS]', 'hello', 'world', '!', '[SEP]'], ['[CLS]', 'this', '##is', 'it', '[SEP]']])

When using your own tokenization, you may still want to check if the server respects your tokens. For example,

bc.encode([['hello', 'world!'], ['thisis', 'it']], show_tokens=True, is_tokenized=True)

returns:

(array([[[ 0.        , -0.        ,  0.        , ...,  0.       ,  -0.        ,  0.        ],
        [ 1.1111546 , -0.56572634,  0.37183186, ...,  0.02397121,  -0.5445367 ,  1.1009651 ],
        ..., -0.        ,  0.        ]],

       [[ 0.        ,  0.        ,  0.        , ...,  0.        ,  -0.        ,  0.        ],
        [ 0.39262453,  0.3782491 ,  0.27096173, ...,  0.7122045 ,  -0.9874849 ,  0.9318679 ],
        ..., -0.        ,  0.        ]]], dtype=float32),
         
         [['[CLS]', 'hello', '[UNK]', '[SEP]'], ['[CLS]', '[UNK]', 'it', '[SEP]']])

One can observe that world! and thisis are not recognized on the server, hence they are set to [UNK].

Finally, beware that the pretrained BERT Chinese from Google is character-based, i.e. its vocabulary is made of single Chinese characters. Therefore it makes no sense if you use word-level segmentation algorithm to pre-process the data and feed to such model.

Extremely curious readers may notice that the first row in the above example is all-zero even though the tokenization result includes [CLS] (well done, detective!). The reason is that the tokenization result will always includes [CLS] and [UNK] regardless the setting of -mask_cls_sep. This could be useful when you want to align the tokens afterwards. Remember, -mask_cls_sep only masks [CLS] and [SEP] out of the computation. It doesn't affect the tokenization algorithm.

Using BertClient with tf.data API

The complete example can be found example4.py. There is also an example in Keras.

The tf.data API enables you to build complex input pipelines from simple, reusable pieces. One can also use BertClient to encode sentences on-the-fly and use the vectors in a downstream model. Here is an example:

batch_size = 256
num_parallel_calls = 4
num_clients = num_parallel_calls * 2  # should be at least greater than `num_parallel_calls`

# start a pool of clients
bc_clients = [BertClient(show_server_config=False) for _ in range(num_clients)]


def get_encodes(x):
    # x is `batch_size` of lines, each of which is a json object
    samples = [json.loads(l) for l in x]
    text = [s['raw_text'] for s in samples]  # List[List[str]]
    labels = [s['label'] for s in samples]  # List[str]
    # get a client from available clients
    bc_client = bc_clients.pop()
    features = bc_client.encode(text)
    # after use, put it back
    bc_clients.append(bc_client)
    return features, labels


ds = (tf.data.TextLineDataset(train_fp).batch(batch_size)
        .map(lambda x: tf.py_func(get_encodes, [x], [tf.float32, tf.string]),  num_parallel_calls=num_parallel_calls)
        .map(lambda x, y: {'feature': x, 'label': y})
        .make_one_shot_iterator().get_next())

The trick here is to start a pool of BertClient and reuse them one by one. In this way, we can fully harness the power of num_parallel_calls of Dataset.map() API.

Training a text classifier using BERT features and tf.estimator API

The complete example can be found example5.py.

Following the last example, we can easily extend it to a full classifier using tf.estimator API. One only need minor change on the input function as follows:

estimator = DNNClassifier(
    hidden_units=[512],
    feature_columns=[tf.feature_column.numeric_column('feature', shape=(768,))],
    n_classes=len(laws),
    config=run_config,
    label_vocabulary=laws_str,
    dropout=0.1)

input_fn = lambda fp: (tf.data.TextLineDataset(fp)
                       .apply(tf.contrib.data.shuffle_and_repeat(buffer_size=10000))
                       .batch(batch_size)
                       .map(lambda x: tf.py_func(get_encodes, [x], [tf.float32, tf.string]), num_parallel_calls=num_parallel_calls)
                       .map(lambda x, y: ({'feature': x}, y))
                       .prefetch(20))

train_spec = TrainSpec(input_fn=lambda: input_fn(train_fp))
eval_spec = EvalSpec(input_fn=lambda: input_fn(eval_fp), throttle_secs=0)
train_and_evaluate(estimator, train_spec, eval_spec)

The complete example can be found example5.py, in which a simple MLP is built on BERT features for predicting the relevant articles according to the fact description in the law documents. The problem is a part of the Chinese AI and Law Challenge Competition.

Saving and loading with TFRecord data

The complete example can be found example6.py.

The TFRecord file format is a simple record-oriented binary format that many TensorFlow applications use for training data. You can also pre-encode all your sequences and store their encodings to a TFRecord file, then later load it to build a tf.Dataset. For example, to write encoding into a TFRecord file:

bc = BertClient()
list_vec = bc.encode(lst_str)
list_label = [0 for _ in lst_str]  # a dummy list of all-zero labels

# write to tfrecord
with tf.python_io.TFRecordWriter('tmp.tfrecord') as writer:
    def create_float_feature(values):
        return tf.train.Feature(float_list=tf.train.FloatList(value=values))

    def create_int_feature(values):
        return tf.train.Feature(int64_list=tf.train.Int64List(value=list(values)))

    for (vec, label) in zip(list_vec, list_label):
        features = {'features': create_float_feature(vec), 'labels': create_int_feature([label])}
        tf_example = tf.train.Example(features=tf.train.Features(feature=features))
        writer.write(tf_example.SerializeToString())

Now we can load from it and build a tf.Dataset:

def _decode_record(record):
    """Decodes a record to a TensorFlow example."""
    return tf.parse_single_example(record, {
        'features': tf.FixedLenFeature([768], tf.float32),
        'labels': tf.FixedLenFeature([], tf.int64),
    })

ds = (tf.data.TFRecordDataset('tmp.tfrecord').repeat().shuffle(buffer_size=100).apply(
    tf.contrib.data.map_and_batch(lambda record: _decode_record(record), batch_size=64))
      .make_one_shot_iterator().get_next())

To save word/token-level embedding to TFRecord, one needs to first flatten [max_seq_len, num_hidden] tensor into an 1D array as follows:

def create_float_feature(values):
    return tf.train.Feature(float_list=tf.train.FloatList(value=values.reshape(-1)))

And later reconstruct the shape when loading it:

name_to_features = {
    "feature": tf.FixedLenFeature([max_seq_length * num_hidden], tf.float32),
    "label_ids": tf.FixedLenFeature([], tf.int64),
}
    
def _decode_record(record, name_to_features):
    """Decodes a record to a TensorFlow example."""
    example = tf.parse_single_example(record, name_to_features)
    example['feature'] = tf.reshape(example['feature'], [max_seq_length, -1])
    return example

Be careful, this will generate a huge TFRecord file.

Asynchronous encoding

The complete example can be found example2.py.

BertClient.encode() offers a nice synchronous way to get sentence encodes. However, sometimes we want to do it in an asynchronous manner by feeding all textual data to the server first, fetching the encoded results later. This can be easily done by:

# an endless data stream, generating data in an extremely fast speed
def text_gen():
    while True:
        yield lst_str  # yield a batch of text lines

bc = BertClient()

# get encoded vectors
for j in bc.encode_async(text_gen(), max_num_batch=10):
    print('received %d x %d' % (j.shape[0], j.shape[1]))

Broadcasting to multiple clients

The complete example can be found in example3.py.

The encoded result is routed to the client according to its identity. If you have multiple clients with same identity, then they all receive the results! You can use this multicast feature to do some cool things, e.g. training multiple different models (some using scikit-learn some using tensorflow) in multiple separated processes while only call BertServer once. In the example below, bc and its two clones will all receive encoded vector.

# clone a client by reusing the identity 
def client_clone(id, idx):
    bc = BertClient(identity=id)
    for j in bc.listen():
        print('clone-client-%d: received %d x %d' % (idx, j.shape[0], j.shape[1]))

bc = BertClient()
# start two cloned clients sharing the same identity as bc
for j in range(2):
    threading.Thread(target=client_clone, args=(bc.identity, j)).start()

for _ in range(3):
    bc.encode(lst_str)

Monitoring the service status in a dashboard

The complete example can be found in plugin/dashboard/.

As a part of the infrastructure, one may also want to monitor the service status and show it in a dashboard. To do that, we can use:

bc = BertClient(ip='server_ip')

json.dumps(bc.server_status, ensure_ascii=False)

This gives the current status of the server including number of requests, number of clients etc. in JSON format. The only thing remained is to start a HTTP server for returning this JSON to the frontend that renders it.

Alternatively, one may simply expose an HTTP port when starting a server via:

bert-serving-start -http_port 8081 -model_dir ...

This will allow one to use javascript or curl to fetch the server status at port 8081.

plugin/dashboard/index.html shows a simple dashboard based on Bootstrap and Vue.js.

Using bert-as-service to serve HTTP requests in JSON

Besides calling bert-as-service from Python, one can also call it via HTTP request in JSON. It is quite useful especially when low transport layer is prohibited. Behind the scene, bert-as-service spawns a Flask server in a separate process and then reuse a BertClient instance as a proxy to communicate with the ventilator.

To enable the build-in HTTP server, we need to first (re)install the server with some extra Python dependencies:

pip install -U bert-serving-server[http]

Then simply start the server with:

bert-serving-start -model_dir=/YOUR_MODEL -http_port 8125

Done! Your server is now listening HTTP and TCP requests at port 8125 simultaneously!

To send a HTTP request, first prepare the payload in JSON as following:

{
    "id": 123,
    "texts": ["hello world", "good day!"],
    "is_tokenized": false
}

, where id is a unique identifier helping you to synchronize the results; is_tokenized follows the meaning in BertClient API and false by default.

Then simply call the server at /encode via HTTP POST request. You can use javascript or whatever, here is an example using curl:

curl -X POST http://xx.xx.xx.xx:8125/encode \
  -H 'content-type: application/json' \
  -d '{"id": 123,"texts": ["hello world"], "is_tokenized": false}'

, which returns a JSON:

{
    "id": 123,
    "results": [[768 float-list], [768 float-list]],
    "status": 200
}

To get the server's status and client's status, you can send GET requests at /status/server and /status/client, respectively.

Finally, one may also config CORS to restrict the public access of the server by specifying -cors when starting bert-serving-start. By default -cors=*, meaning the server is public accessible.

Starting BertServer from Python

Besides shell, one can also start a BertServer from python. Simply do

from bert_serving.server.helper import get_args_parser
from bert_serving.server import BertServer
args = get_args_parser().parse_args(['-model_dir', 'YOUR_MODEL_PATH_HERE',
                                     '-port', '5555',
                                     '-port_out', '5556',
                                     '-max_seq_len', 'NONE',
                                     '-mask_cls_sep',
                                     '-cpu'])
server = BertServer(args)
server.start()

Note that it's basically mirroring the arg-parsing behavior in CLI, so everything in that .parse_args([]) list should be string, e.g. ['-port', '5555'] not ['-port', 5555].

To shutdown the server, you may call the static method in BertServer class via:

BertServer.shutdown(port=5555)

Or via shell CLI:

bert-serving-terminate -port 5555

This will terminate the server running on localhost at port 5555. You may also use it to terminate a remote server, see bert-serving-terminate --help for details.

💬 FAQ

▴ Back to top

ReadTheDoc

Q: Do you have a paper or other written explanation to introduce your model's details?

The design philosophy and technical details can be found in my blog post.

Q: Where is the BERT code come from?

A: BERT code of this repo is forked from the original BERT repo with necessary modification, especially in extract_features.py.

Q: How large is a sentence vector?

In general, each sentence is translated to a 768-dimensional vector. Depending on the pretrained BERT you are using, pooling_strategy and pooling_layer the dimensions of the output vector could be different.

Q: How do you get the fixed representation? Did you do pooling or something?

A: Yes, pooling is required to get a fixed representation of a sentence. In the default strategy REDUCE_MEAN, I take the second-to-last hidden layer of all of the tokens in the sentence and do average pooling.

Q: Are you suggesting using BERT without fine-tuning?

A: Yes and no. On the one hand, Google pretrained BERT on Wikipedia data, thus should encode enough prior knowledge of the language into the model. Having such feature is not a bad idea. On the other hand, these prior knowledge is not specific to any particular domain. It should be totally reasonable if the performance is not ideal if you are using it on, for example, classifying legal cases. Nonetheless, you can always first fine-tune your own BERT on the downstream task and then use bert-as-service to extract the feature vectors efficiently. Keep in mind that bert-as-service is just a feature extraction service based on BERT. Nothing stops you from using a fine-tuned BERT.

Q: Can I get a concatenation of several layers instead of a single layer ?

A: Sure! Just use a list of the layer you want to concatenate when calling the server. Example:

bert-serving-start -pooling_layer -4 -3 -2 -1 -model_dir /tmp/english_L-12_H-768_A-12/
Q: What are the available pooling strategies?

A: Here is a table summarizes all pooling strategies I implemented. Choose your favorite one by specifying bert-serving-start -pooling_strategy.

Strategy Description
NONE no pooling at all, useful when you want to use word embedding instead of sentence embedding. This will results in a [max_seq_len, 768] encode matrix for a sequence.
REDUCE_MEAN take the average of the hidden state of encoding layer on the time axis
REDUCE_MAX take the maximum of the hidden state of encoding layer on the time axis
REDUCE_MEAN_MAX do REDUCE_MEAN and REDUCE_MAX separately and then concat them together on the last axis, resulting in 1536-dim sentence encodes
CLS_TOKEN or FIRST_TOKEN get the hidden state corresponding to [CLS], i.e. the first token
SEP_TOKEN or LAST_TOKEN get the hidden state corresponding to [SEP], i.e. the last token
Q: Why not use the hidden state of the first token as default strategy, i.e. the [CLS]?

A: Because a pre-trained model is not fine-tuned on any downstream tasks yet. In this case, the hidden state of [CLS] is not a good sentence representation. If later you fine-tune the model, you may use [CLS] as well.

Q: BERT has 12/24 layers, so which layer are you talking about?

A: By default this service works on the second last layer, i.e. pooling_layer=-2. You can change it by setting pooling_layer to other negative values, e.g. -1 corresponds to the last layer.

Q: Why not the last hidden layer? Why second-to-last?

A: The last layer is too closed to the target functions (i.e. masked language model and next sentence prediction) during pre-training, therefore may be biased to those targets. If you question about this argument and want to use the last hidden layer anyway, please feel free to set pooling_layer=-1.

Q: So which layer and which pooling strategy is the best?

A: It depends. Keep in mind that different BERT layers capture different information. To see that more clearly, here is a visualization on UCI-News Aggregator Dataset, where I randomly sample 20K news titles; get sentence encodes from different layers and with different pooling strategies, finally reduce it to 2D via PCA (one can of course do t-SNE as well, but that's not my point). There are only four classes of the data, illustrated in red, blue, yellow and green. To reproduce the result, please run example7.py.

Intuitively, pooling_layer=-1 is close to the training output, so it may be biased to the training targets. If you don't fine tune the model, then this could lead to a bad representation. pooling_layer=-12 is close to the word embedding, may preserve the very original word information (with no fancy self-attention etc.). On the other hand, you may achieve the very same performance by simply using a word-embedding only. That said, anything in-between [-1, -12] is then a trade-off.

Q: Could I use other pooling techniques?

A: For sure. But if you introduce new tf.variables to the graph, then you need to train those variables before using the model. You may also want to check some pooling techniques I mentioned in my blog post.

Q: Do I need to batch the data before encode()?

No, not at all. Just do encode and let the server handles the rest. If the batch is too large, the server will do batching automatically and it is more efficient than doing it by yourself. No matter how many sentences you have, 10K or 100K, as long as you can hold it in client's memory, just send it to the server. Please also read the benchmark on the client batch size.

Q: Can I start multiple clients and send requests to one server simultaneously?

A: Yes! That's the purpose of this repo. In fact you can start as many clients as you want. One server can handle all of them (given enough time).

Q: How many requests can one service handle concurrently?

A: The maximum number of concurrent requests is determined by num_worker in bert-serving-start. If you a sending more than num_worker requests concurrently, the new requests will be temporally stored in a queue until a free worker becomes available.

Q: So one request means one sentence?

A: No. One request means a list of sentences sent from a client. Think the size of a request as the batch size. A request may contain 256, 512 or 1024 sentences. The optimal size of a request is often determined empirically. One large request can certainly improve the GPU utilization, yet it also increases the overhead of transmission. You may run python example/example1.py for a simple benchmark.

Q: How about the speed? Is it fast enough for production?

A: It highly depends on the max_seq_len and the size of a request. On a single Tesla M40 24GB with max_seq_len=40, you should get about 470 samples per second using a 12-layer BERT. In general, I'd suggest smaller max_seq_len (25) and larger request size (512/1024).

Q: Did you benchmark the efficiency?

A: Yes. See Benchmark.

To reproduce the results, please run bert-serving-benchmark.

Q: What is backend based on?

A: ZeroMQ.

Q: What is the parallel processing model behind the scene?

Q: Why does the server need two ports?

One port is for pushing text data into the server, the other port is for publishing the encoded result to the client(s). In this way, we get rid of back-chatter, meaning that at every level recipients never talk back to senders. The overall message flow is strictly one-way, as depicted in the above figure. Killing back-chatter is essential to real scalability, allowing us to use BertClient in an asynchronous way.

Q: Do I need Tensorflow on the client side?

A: No. Think of BertClient as a general feature extractor, whose output can be fed to any ML models, e.g. scikit-learn, pytorch, tensorflow. The only file that client need is client.py. Copy this file to your project and import it, then you are ready to go.

Q: Can I use multilingual BERT model provided by Google?

A: Yes.

Q: Can I use my own fine-tuned BERT model?

A: Yes. In fact, this is suggested. Make sure you have the following three items in model_dir:

  • A TensorFlow checkpoint (bert_model.ckpt) containing the pre-trained weights (which is actually 3 files).
  • A vocab file (vocab.txt) to map WordPiece to word id.
  • A config file (bert_config.json) which specifies the hyperparameters of the model.
Q: Can I run it in python 2?

A: Server side no, client side yes. This is based on the consideration that python 2.x might still be a major piece in some tech stack. Migrating the whole downstream stack to python 3 for supporting bert-as-service can take quite some effort. On the other hand, setting up BertServer is just a one-time thing, which can be even run in a docker container. To ease the integration, we support python 2 on the client side so that you can directly use BertClient as a part of your python 2 project, whereas the server side should always be hosted with python 3.

Q: Do I need to do segmentation for Chinese?

No, if you are using the pretrained Chinese BERT released by Google you don't need word segmentation. As this Chinese BERT is character-based model. It won't recognize word/phrase even if you intentionally add space in-between. To see that more clearly, this is what the BERT model actually receives after tokenization:

bc.encode(['hey you', 'whats up?', '你好么?', '我 还 可以'])
tokens: [CLS] hey you [SEP]
input_ids: 101 13153 8357 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
input_mask: 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

tokens: [CLS] what ##s up ? [SEP]
input_ids: 101 9100 8118 8644 136 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
input_mask: 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

tokens: [CLS] 你 好 么 ? [SEP]
input_ids: 101 872 1962 720 8043 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
input_mask: 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

tokens: [CLS] 我 还 可 以 [SEP]
input_ids: 101 2769 6820 1377 809 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
input_mask: 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

That means the word embedding is actually the character embedding for Chinese-BERT.

Q: Why my (English) word is tokenized to ##something?

Because your word is out-of-vocabulary (OOV). The tokenizer from Google uses a greedy longest-match-first algorithm to perform tokenization using the given vocabulary.

For example:

input = "unaffable"
tokenizer_output = ["un", "##aff", "##able"]
Q: Can I use my own tokenizer?

Yes. If you already tokenize the sentence on your own, simply send use encode with List[List[Str]] as input and turn on is_tokenized, i.e. bc.encode(texts, is_tokenized=True).

Q: I encounter zmq.error.ZMQError: Operation cannot be accomplished in current state when using BertClient, what should I do?

A: This is often due to the misuse of BertClient in multi-thread/process environment. Note that you can’t reuse one BertClient among multiple threads/processes, you have to make a separate instance for each thread/process. For example, the following won't work at all:

# BAD example
bc = BertClient()

# in Proc1/Thread1 scope:
bc.encode(lst_str)

# in Proc2/Thread2 scope:
bc.encode(lst_str)

Instead, please do:

# in Proc1/Thread1 scope:
bc1 = BertClient()
bc1.encode(lst_str)

# in Proc2/Thread2 scope:
bc2 = BertClient()
bc2.encode(lst_str)
Q: After running the server, I have several garbage tmpXXXX folders. How can I change this behavior ?

A: These folders are used by ZeroMQ to store sockets. You can choose a different location by setting the environment variable ZEROMQ_SOCK_TMP_DIR : export ZEROMQ_SOCK_TMP_DIR=/tmp/

Q: The cosine similarity of two sentence vectors is unreasonably high (e.g. always > 0.8), what's wrong?

A: A decent representation for a downstream task doesn't mean that it will be meaningful in terms of cosine distance. Since cosine distance is a linear space where all dimensions are weighted equally. if you want to use cosine distance anyway, then please focus on the rank not the absolute value. Namely, do not use:

if cosine(A, B) > 0.9, then A and B are similar

Please consider the following instead:

if cosine(A, B) > cosine(A, C), then A is more similar to B than C.

The graph below illustrates the pairwise similarity of 3000 Chinese sentences randomly sampled from web (char. length < 25). We compute cosine similarity based on the sentence vectors and Rouge-L based on the raw text. The diagonal (self-correlation) is removed for the sake of clarity. As one can see, there is some positive correlation between these two metrics.

Q: I'm getting bad performance, what should I do?

A: This often suggests that the pretrained BERT could not generate a decent representation of your downstream task. Thus, you can fine-tune the model on the downstream task and then use bert-as-service to serve the fine-tuned BERT. Note that, bert-as-service is just a feature extraction service based on BERT. Nothing stops you from using a fine-tuned BERT.

Q: Can I run the server side on CPU-only machine?

A: Yes, please run bert-serving-start -cpu -max_batch_size 16. Note that, CPUs do not scale as well as GPUs to large batches, therefore the max_batch_size on the server side needs to be smaller, e.g. 16 or 32.

Q: How can I choose num_worker?

A: Generally, the number of workers should be less than or equal to the number of GPUs or CPUs that you have. Otherwise, multiple workers will be allocated to one GPU/CPU, which may not scale well (and may cause out-of-memory on GPU).

Q: Can I specify which GPU to use?

A: Yes, you can specifying -device_map as follows:

bert-serving-start -device_map 0 1 4 -num_worker 4 -model_dir ...

This will start four workers and allocate them to GPU0, GPU1, GPU4 and again GPU0, respectively. In general, if num_worker > device_map, then devices will be reused and shared by the workers (may scale suboptimally or cause OOM); if num_worker < device_map, only device_map[:num_worker] will be used.

Note, device_map is ignored when running on CPU.

Benchmark

▴ Back to top

ReadTheDoc

The primary goal of benchmarking is to test the scalability and the speed of this service, which is crucial for using it in a dev/prod environment. Benchmark was done on Tesla M40 24GB, experiments were repeated 10 times and the average value is reported.

To reproduce the results, please run

bert-serving-benchmark --help

Common arguments across all experiments are:

Parameter Value
num_worker 1,2,4
max_seq_len 40
client_batch_size 2048
max_batch_size 256
num_client 1

Speed wrt. max_seq_len

max_seq_len is a parameter on the server side, which controls the maximum length of a sequence that a BERT model can handle. Sequences larger than max_seq_len will be truncated on the left side. Thus, if your client want to send long sequences to the model, please make sure the server can handle them correctly.

Performance-wise, longer sequences means slower speed and more chance of OOM, as the multi-head self-attention (the core unit of BERT) needs to do dot products and matrix multiplications between every two symbols in the sequence.

max_seq_len 1 GPU 2 GPU 4 GPU
20 903 1774 3254
40 473 919 1687
80 231 435 768
160 119 237 464
320 54 108 212

Speed wrt. client_batch_size

client_batch_size is the number of sequences from a client when invoking encode(). For performance reason, please consider encoding sequences in batch rather than encoding them one by one.

For example, do:

# prepare your sent in advance
bc = BertClient()
my_sentences = [s for s in my_corpus.iter()]
# doing encoding in one-shot
vec = bc.encode(my_sentences)

DON'T:

bc = BertClient()
vec = []
for s in my_corpus.iter():
    vec.append(bc.encode(s))

It's even worse if you put BertClient() inside the loop. Don't do that.

client_batch_size 1 GPU 2 GPU 4 GPU
1 75 74 72
4 206 205 201
8 274 270 267
16 332 329 330
64 365 365 365
256 382 383 383
512 432 766 762
1024 459 862 1517
2048 473 917 1681
4096 481 943 1809

Speed wrt. num_client

num_client represents the number of concurrent clients connected to the server at the same time.

num_client 1 GPU 2 GPU 4 GPU
1 473 919 1759
2 261 512 1028
4 133 267 533
8 67 136 270
16 34 68 136
32 17 34 68

As one can observe, 1 clients 1 GPU = 381 seqs/s, 2 clients 2 GPU 402 seqs/s, 4 clients 4 GPU 413 seqs/s. This shows the efficiency of our parallel pipeline and job scheduling, as the service can leverage the GPU time more exhaustively as concurrent requests increase.

Speed wrt. max_batch_size

max_batch_size is a parameter on the server side, which controls the maximum number of samples per batch per worker. If a incoming batch from client is larger than max_batch_size, the server will split it into small batches so that each of them is less or equal than max_batch_size before sending it to workers.

max_batch_size 1 GPU 2 GPU 4 GPU
32 450 887 1726
64 459 897 1759
128 473 931 1816
256 473 919 1688
512 464 866 1483

Speed wrt. pooling_layer

pooling_layer determines the encoding layer that pooling operates on. For example, in a 12-layer BERT model, -1 represents the layer closed to the output, -12 represents the layer closed to the embedding layer. As one can observe below, the depth of the pooling layer affects the speed.

pooling_layer 1 GPU 2 GPU 4 GPU
[-1] 438 844 1568
[-2] 475 916 1686
[-3] 516 995 1823
[-4] 569 1076 1986
[-5] 633 1193 2184
[-6] 711 1340 2430
[-7] 820 1528 2729
[-8] 945 1772 3104
[-9] 1128 2047 3622
[-10] 1392 2542 4241
[-11] 1523 2737 4752
[-12] 1568 2985 5303

Speed wrt. -fp16 and -xla

bert-as-service supports two additional optimizations: half-precision and XLA, which can be turned on by adding -fp16 and -xla to bert-serving-start, respectively. To enable these two options, you have to meet the following requirements:

  • your GPU supports FP16 instructions;
  • your Tensorflow is self-compiled with XLA and -march=native;
  • your CUDA and cudnn are not too old.

On Tesla V100 with tensorflow=1.13.0-rc0 it gives:

FP16 achieves ~1.4x speedup (round-trip) comparing to the FP32 counterpart. To reproduce the result, please run python example/example1.py.

Citing

▴ Back to top

If you use bert-as-service in a scientific publication, we would appreciate references to the following BibTex entry:

@misc{xiao2018bertservice,
  title={bert-as-service},
  author={Xiao, Han},
  howpublished={\url{https://github.com/hanxiao/bert-as-service}},
  year={2018}
}
Comments
  • feat: add three new open clip roberta base models

    feat: add three new open clip roberta base models

    Goals : align with openclip v2.7.0 Changes :

    • Add three new models: roberta-ViT-B-32::laion2b-s12b-b32k xlm-roberta-base-ViT-B-32::laion5b-s13b-b90k, and xlm-roberta-large-ViT-H-14::frozen_laion5b_s13b_b90k;
    • Add LayernormFp32 (original Layernorm handles fp16) (Default precision: fp32 on cpu and fp16 on gpu);
    • Split original CLIP to TextTransformer, VisionTransformer and add _build_text_tower _build_vision_tower for seperately building;
    • Rearrange modules;
    • Fix bugs on flash attention (only use on cuda).
    • Docs will be updated in #862
    component/server area/housekeeping area/cicd area/testing size/l 
    opened by OrangeSodahub 24
  • graph optimization fail

    graph optimization fail

    This issue shows up: I can successfully use this application on my MacBook. Then, I tried to apply it on a cloud server, but it cannot work and return error: FileNotFoundError: graph optimization fails and returns empty result. Could you please help me to figure out this issue? Thanks! ...

    opened by chiyuzhang94 21
  • not enough values to unpack in server.py

    not enough values to unpack in server.py

    Hello, I tried to run example5.py on a single-GPU ec2 instance. I set num of worker to 1.

    After a few hours running, it will popup following errors:

    image

    on client (same instance, another session) image

    I tried twice, always the case. Do you have any suggestion? Thanks.

    Also, usually how long it takes to finish the training of example5.py on a single-gpu machine?

    opened by thetuxedo 21
  • On Windows: FileNotFoundError: graph optimization fails and returns empty result

    On Windows: FileNotFoundError: graph optimization fails and returns empty result

    Prerequisites

    • TensorFlow version:1.12.0
    • Python version:3.6.5

    I'm using this command to start the server:

    python bert-serving-start -model_dir=../../pre-trained-model/
    

    Then this issue shows up: image

    bug 
    opened by jacky7788 17
  • On Windows, I got zmq.error:ZMQError: Protocal not supported when executing 'python app.py'

    On Windows, I got zmq.error:ZMQError: Protocal not supported when executing 'python app.py'

    Your project is awesome. But I'm not sure if it will work on Windows 10 platform. I just cloned your project, and downloaded the BERT pre-trained model. The moment I run python app.py -model_dir F:\data\chinese_L-12_H-768_A-12\chinese_L-12_H-768_A-12/ -num_worker=4 , I got an error:

    λ python app.py -model_dir F:\data\chinese_L-12_H-768_A-12\chinese_L-12_H-768_A-12/ -num_worker=4
    usage: app.py -model_dir F:\data\chinese_L-12_H-768_A-12\chinese_L-12_H-768_A-12/ -num_worker=4
                     ARG   VALUE
    __________________________________________________
          max_batch_size = 256
             max_seq_len = 25
               model_dir = F:\data\chinese_L-12_H-768_A-12\chinese_L-12_H-768_A-12/
              num_worker = 4
           pooling_layer = -2
        pooling_strategy = REDUCE_MEAN
                    port = 5555
    
    Exception in thread Thread-1:
    Traceback (most recent call last):
      File "D:\Anaconda3\lib\threading.py", line 916, in _bootstrap_inner
        self.run()
      File "F:\Work\Github\bert-as-service\service\server.py", line 72, in run
        self.backend.bind('ipc://*')
      File "zmq/backend/cython/socket.pyx", line 495, in zmq.backend.cython.socket.Socket.bind (zmq\backend\cython\socket.c:5653)
      File "zmq/backend/cython/checkrc.pxd", line 25, in zmq.backend.cython.checkrc._check_rc (zmq\backend\cython\socket.c:10014)
    zmq.error.ZMQError: Protocol not supported
    

    I have no idea what this zmq is, and I googled, it seems that 'ipc' is not supported on Windows, we should use 'tcp' instead. I tried to just change 'ipc' to 'tcp' on line 72, but still got the similar error:

    λ python app.py -model_dir F:\data\chinese_L-12_H-768_A-12\chinese_L-12_H-768_A-12/ -num_worker=4
    usage: app.py -model_dir F:\data\chinese_L-12_H-768_A-12\chinese_L-12_H-768_A-12/ -num_worker=4
                     ARG   VALUE
    __________________________________________________
          max_batch_size = 256
             max_seq_len = 25
               model_dir = F:\data\chinese_L-12_H-768_A-12\chinese_L-12_H-768_A-12/
              num_worker = 4
           pooling_layer = -2
        pooling_strategy = REDUCE_MEAN
                    port = 5555
    
    Exception in thread Thread-1:
    Traceback (most recent call last):
      File "D:\Anaconda3\lib\threading.py", line 916, in _bootstrap_inner
        self.run()
      File "F:\Work\Github\bert-as-service\service\server.py", line 72, in run
        self.backend.bind('tcp://*')
      File "zmq/backend/cython/socket.pyx", line 495, in zmq.backend.cython.socket.Socket.bind (zmq\backend\cython\socket.c:5653)
      File "zmq/backend/cython/checkrc.pxd", line 25, in zmq.backend.cython.checkrc._check_rc (zmq\backend\cython\socket.c:10014)
    zmq.error.ZMQError: Invalid argument
    

    Any idea on how to correct this?

    will do 
    opened by BruceDai003 14
  • 'bert-serving-start' is not recognized as an internal or external command

    'bert-serving-start' is not recognized as an internal or external command

    Hi, This is a very silly question..... I have python 3.6.6, tensorflow 1.12.0, doing everything in conda environment, Windows 10. I pip installed bert-serving-server/client and it shows Successfully installed GPUtil-1.4.0 bert-serving-client-1.7.2 bert-serving-server-1.7.2 pyzmq-17.1.2 but when I run the following as CLI bert-serving-start -model_dir /tmp/english_L-12_H-768_A-12/ -num_worker=4 it says 'bert-serving-start' is not recognized as an internal or external command

    I found bert-serving library is located under C:\Users\Name\Anaconda\Lib\site-packages. So I tried to run bert-serving-start again under these three folders:

    1. site-packages
    2. site-packages\bert_serving
    3. site-packages\bert_serving_server-1.7.2.dist-info

    However, the result is same as not recognized. Can anyone help me?

    opened by moon-home 13
  • why so slow when I get result from estimator.predict(input_fn)?

    why so slow when I get result from estimator.predict(input_fn)?

    Hi,

    Thanks in advance for all the help.

    I try to implement a simple version. just create estimator in flask and do predict

    what confused me is, after results=estimator.predict(), this loop segment always cost more than 10 seconds, like this:

    res = []
    for r in results:
        res.append(r['predict_results'].tolist())
    

    I've no idea why...does anyone meet issues like this?

    btw. I notice that zmq is used

    for r in estimator.predict(self.input_fn_builder(receivers, tf, sink_token), yield_single_examples=False):
                send_ndarray(sink_embed, r['client_id'], r['encodes'], ServerCmd.data_embed)
                logger.info('job done\tsize: %s\tclient: %s' % (r['encodes'].shape, r['client_id']))
    

    but I don't think that's the reason for me....

    opened by nlp4whp 12
  • the server seems hangs

    the server seems hangs

    I started a server, it seems ok. When I run a client to encode a demo sentence, the server will log a new job, but no more output after than. The server hangs. (The server log a new job only once, when I try to run a client second time, the server has no new job logged.) When I use ctr+z to stop the server, the linux shows segment fault. How to debug this problem? Thank you.


    server log______________ usage: app.py -model_dir ./models/chinese_L-12_H-768_A-12/ -num_worker=1 ARG VALUE gpu_memory_fraction = 0.5 max_batch_size = 256 max_seq_len = 25 model_dir = ./models/chinese_L-12_H-768_A-12/ num_worker = 1 pooling_layer = [-2] pooling_strategy = REDUCE_MEAN port = 5555 port_out = 5556

    I:VENTILATOR:[ser:__i: 79]:frontend-sink ipc: ipc://tmpt7XjXt/socket W:VENTILATOR:[ser:run:100]: only 0 GPU(s) is available, but ask for 1 I:SINK:[ser:run:230]:new job b'9be1e3dd-88f6-47c4-a51b-61b74bd6e65e#ea138d3d-ac8c-479e-ba3d-18b417fd8caa' size: 2 is registered!

    --------client code ------------------- from service.client import BertClient bc = BertClient() print('client inited.') x = ['hey you', 'whats up?']

    print(bc.encode(x)) # [2, 25, 768] print(bc.encode(x)[0]) # [1, 25, 768], sentence embeddings for hey you print(bc.encode(x)[0][0]) # [1, 1, 768], word embedding for [CLS]

    -----------client side log------------------- you should NOT see this message multiple times! if you see it appears repeatedly, consider moving "BertClient()" out of the loop. client inited.

    bug 
    opened by iamadog3333 12
  • Cannot send tensor to the encode() function

    Cannot send tensor to the encode() function

    The CLIPEncoder only works with the .uri and .blob objects when sending a encode() request. Here -

    https://github.com/jina-ai/clip-as-service/blob/fa42dc50f6c766c60fe246802ecc8c15e37fbdf4/server/clip_server/executors/clip_torch.py#L40

    If a .tensor is available in the Document, there shouldn't be a need to use .uri or .blob to create a PIL Image which is sent to the _preprocess function, and the tensor itself could be used for that purpose.

    opened by shubham-goel-zefr 11
  • KeyError on server side , then  i can't connect the server

    KeyError on server side , then i can't connect the server

    Process BertSink-1: Traceback (most recent call last): File "/usr/lib64/python3.6/multiprocessing/process.py", line 258, in _bootstrap self.run() File "/usr/lib/python3.6/site-packages/bert_serving/server/init.py", line 226, in run job_checksum[job_id])) KeyError: b'd4853427-8e43-48ee-8b8f-a24eb357face#1'

    After this , i can't connect the server to get sentence encode , how to solve it? release: python3.6 pip list: bert-serving-client 1.5.0
    bert-serving-server 1.5.0 server is started by : nohup bert-serving-start -model_dir ../multi_cased_L-12_H-768_A-12 -num_worker=8 -cpu &

    opened by mat7213 11
  • Why the result is not good on the Dataset——2018CCFBDCI汽车行业用户观点主题及情感识别?

    Why the result is not good on the Dataset——2018CCFBDCI汽车行业用户观点主题及情感识别?

    Prerequisites

    Please fill in by replacing [ ] with [x].

    System information

    Some of this information can be collected via this script.

    • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 16.04
    • TensorFlow installed from (source or binary):source
    • TensorFlow version:1.12.0
    • Python version:3.6.7
    • bert-as-service version: the former version,not the lastest
    • GPU model and memory:10G
    • CPU model and memory:

    Description

    Please replace YOUR_SERVER_ARGS and YOUR_CLIENT_ARGS accordingly. You can also write your own description for reproducing the issue.

    I'm using this command to start the server:

    python app.py -pooling_strategy NONE -model_dir ./chinese_L-12_H-768_A-12/ -num_worker=1
    

    and calling the server via:

    bc = BertClient()
    x_= bc.encode(x_batch) # (batch_size, seq_len) and get x_(batch_size, seq_len, 768)
    

    Then this issue shows up: I have used the sentence embedding to the model—— text classification, just connected a DNN and a classifier to predict the label(such as price, power, space, security and so on) base on comment on cars. But the result is very bad. Then I used the word embedding to the RNN and softmax,is bad too.I do not know why. ...

    opened by pengwei-iie 11
  • Support  huggingface transformers clip model?

    Support huggingface transformers clip model?

    can you Support the huggingface transformers clip model,such as https://huggingface.co/docs/transformers/model_doc/chinese_clip & https://huggingface.co/docs/transformers/model_doc/clip?

    opened by ScottishFold007 1
  • fix: check if results are empty

    fix: check if results are empty

    Fixing encoding of a dataset using cas. To reproduce: download this file https://jinaai.slack.com/files/U026WRMRL6A/F04GE6JKX0U/dataset.csv

    then:

    from docarray import DocumentArray, Document
    with open('dataset.csv') as fp:
      da = DocumentArray.from_csv(fp)
    
    from clip_client import Client
    
    c = Client(
        'grpcs://api.clip.jina.ai:2096', credential={'Authorization': 'atuh-token'}
    )
    
    encoded_da = c.encode(da, show_progress=True)
    
    size/xs component/client 
    opened by alaeddine-13 1
  •  bugs

    bugs

    I just want to generate embedded vectors, but there are too many bugs. The environment uses 0.8.0 docker pulled. When inputting vectors, input them once every 500 words. This time, 4511 errors are reported, which is inexplicable and cannot be captured

    CRITI… [email protected] inputs is not valid! FileNotFoundError('struct StreamityCrowdsale.Ico is not a URL or a valid local path') [12/21/22 00:29:40] Traceback (most recent call last):
    File "/opt/conda/lib/python3.9/site-packages/jina/clients/request/init.py", line 71, in request_generator
    for batch in batch_iterator(data, request_size):
    File "/opt/conda/lib/python3.9/site-packages/jina/helper.py", line 268, in batch_iterator
    chunk = tuple(islice(iterator, batch_size))
    File "/opt/conda/lib/python3.9/site-packages/clip_client/client.py", line 178, in _iter_doc
    d = Document(
    File "/opt/conda/lib/python3.9/site-packages/docarray/document/mixins/blob.py", line 19, in load_uri_to_blob
    self.blob = _uri_to_blob(self.uri, **kwargs)
    File "/opt/conda/lib/python3.9/site-packages/docarray/document/mixins/helper.py", line 24, in _uri_to_blob
    raise FileNotFoundError(f'{uri} is not a URL or a valid local path')
    FileNotFoundError: struct StreamityCrowdsale.Ico is not a URL or a valid local path
    WARNI… [email protected] process error: CancelledError() [12/21/22 00:29:40] Traceback (most recent call last): File "/opt/conda/lib/python3.9/site-packages/jina/clients/mixin.py", line 262, in _get_results async for resp in c._get_results(*args, **kwargs): File "/opt/conda/lib/python3.9/site-packages/jina/clients/base/grpc.py", line 124, in _get_results async for resp in stub.Call( File "/opt/conda/lib/python3.9/site-packages/grpc/aio/_call.py", line 326, in _fetch_stream_responses await self._raise_for_status() File "/opt/conda/lib/python3.9/site-packages/grpc/aio/_call.py", line 233, in _raise_for_status raise asyncio.CancelledError() asyncio.exceptions.CancelledError

    During handling of the above exception, another exception occurred:

    Traceback (most recent call last): ... return _2vec.encode(data) File "/opt/conda/lib/python3.9/site-packages/clip_client/client.py", line 295, in encode self._client.post( File "/opt/conda/lib/python3.9/site-packages/jina/clients/mixin.py", line 271, in post return run_async( File "/opt/conda/lib/python3.9/site-packages/jina/helper.py", line 1334, in run_async return asyncio.run(func(*args, **kwargs)) File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run return loop.run_until_complete(main) File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete return future.result() asyncio.exceptions.CancelledError

    opened by 161424 1
  • Error when using Client.encode() on a docarray created from the amazon-berkeley-objects-dataset

    Error when using Client.encode() on a docarray created from the amazon-berkeley-objects-dataset

    Here is the trace. `Traceback (most recent call last): File "C:\Python\Python37\lib\site-packages\jina\clients\helper.py", line 47, in _arg_wrapper return func(*args, **kwargs) File "C:\Python\Python37\lib\site-packages\clip_client\client.py", line 169, in _gather_result results[r[:, 'id']][:, attribute] = r[:, attribute] File "C:\Python\Python37\lib\site-packages\docarray\array\mixins\getitem.py", line 102, in getitem elif isinstance(index[0], bool): IndexError: list index out of range

    During handling of the above exception, another exception occurred:

    Traceback (most recent call last): File ".\main.py", line 15, in encoded_docarray = clt.encode(doc_array, show_progress=True) File "C:\Python\Python37\lib\site-packages\clip_client\client.py", line 336, in encode parameters=parameters, File "C:\Python\Python37\lib\site-packages\jina\clients\mixin.py", line 288, in post **kwargs, File "C:\Python\Python37\lib\site-packages\jina\helper.py", line 1342, in run_async return future.result() File "C:\Python\Python37\lib\site-packages\jina\clients\mixin.py", line 264, in _get_results async for resp in c._get_results(*args, **kwargs): return future.result() File "C:\Python\Python37\lib\site-packages\jina\clients\mixin.py", line 264, in _get_results async for resp in c._get_results(*args, **kwargs): File "C:\Python\Python37\lib\site-packages\jina\clients\base\grpc.py", line 146, in _get_results logger=self.logger, File "C:\Python\Python37\lib\site-packages\jina\clients\helper.py", line 83, in callback_exec _safe_callback(on_done, continue_on_error, logger)(response) File "C:\Python\Python37\lib\site-packages\jina\clients\helper.py", line 49, in _arg_wrapper err_msg = f'uncaught exception in callback {func.name}(): {ex!r}' AttributeError: 'functools.partial' object has no attribute 'name'`

    opened by devinzli 2
  • Which framework maximizes inference speed ?

    Which framework maximizes inference speed ?

    Hello,

    Do you have somewhere a speed comparison of the different CLIP models with the following matrix ?

    • hardware : T4, V100, A40, A100 etc.
    • model : ViT-B, ViT-L, ViT-H etc.
    • batch size
    • tensor type
    • framework : pytorch JIT, TensorRT, AiTemplate, FasterTransformer by NVIDIA etc.

    I am especially interested by the last dimension : which framework / compiler is currently the best to maximize speed of vision transformers ?

    Thank you, Simon

    opened by SimJeg 2
Releases(v0.8.1)
  • v0.8.1(Nov 15, 2022)

    Release Note (0.8.1)

    Release time: 2022-11-15 11:15:48

    This release contains 1 new feature, 1 performance improvement, 2 bug fixes and 4 documentation improvements.

    🆕 Features

    Allow custom callback in clip_client (#849)

    This feature allows clip-client users to send a request to a server and then process the response with a custom callback function. There are three callbacks that users can process with custom functions: on_done, on_error and on_always.

    The following code snippet shows how to send a request to a server and save the response to a database.

    from clip_client import Client
    
    db = {}
    
    def my_on_done(resp):
        for doc in resp.docs:
            db[doc.id] = doc
    
    
    def my_on_error(resp):
        with open('error.log', 'a') as f:
            f.write(resp)
    
    
    def my_on_always(resp):
        print(f'{len(resp.docs)} docs processed')
    
    
    c = Client('grpc://0.0.0.0:12345')
    c.encode(
        ['hello', 'world'], on_done=my_on_done, on_error=my_on_error, on_always=my_on_always
    )
    

    For more details, please refer to the CLIP client documentation.

    🚀 Performance

    Integrate flash attention (#853)

    We have integrated the flash attention module as a faster replacement for nn.MultiHeadAttention. To take advantage of this feature, you will need to install the flash attention module manually:

    pip install git+https://github.com/HazyResearch/flash-attention.git
    

    If flash attention is present, clip_server will automatically try to use it.

    The table below compares CLIP performance with and without the flash attention module. We conducted all tests on a Tesla T4 GPU, and times how long it took to encode a batch of documents 100 times.

    | Model | Input data | Input shape | w/o flash attention | flash attention | Speedup | |------------|------------|-------------------|----------------|---------------|---------| | ViT-B-32 | text | (1, 77) | 0.42692 | 0.37867 | 1.1274 | | ViT-B-32 | text | (8, 77) | 0.48738 | 0.45324 | 1.0753 | | ViT-B-32 | text | (16, 77) | 0.4764 | 0.44315 | 1.07502 | | ViT-B-32 | image | (1, 3, 224, 224) | 0.4349 | 0.40392 | 1.0767 | | ViT-B-32 | image | (8, 3, 224, 224) | 0.47367 | 0.45316 | 1.04527 | | ViT-B-32 | image | (16, 3, 224, 224) | 0.51586 | 0.50555 | 1.0204 |

    Based on our experiments, performance improvements vary depending on the model and GPU, but in general, the flash attention module improves performance.

    🐞 Bug Fixes

    Increase timeout at startup for Executor docker images (#854)

    During Executor initialization, it can take quite a lot of time to download model parameters. If a model is very large and downloading slowly, the Executor may fail due to time-out before even starting. We have increased the timeout to 3000000ms.

    Install transformers for Executor docker images (#851)

    We have added the transformers package to Executor docker images, in order to support the multilingual CLIP model.

    📗 Documentation Improvements

    • Update Finetuner docs (#843)
    • Add tips for client parallelism usage (#846)
    • Move benchmark conclusion to beginning (#847)
    • Add instructions for using clip server hosted by Jina (#848)

    🤟 Contributors

    We would like to thank all contributors to this release:

    • Ziniu Yu (@ZiniuYu)
    • Jie Fu (@jemmyshin)
    • felix-wang (@numb3r3)
    • YangXiuyu (@OrangeSodaHub)
    Source code(tar.gz)
    Source code(zip)
  • v0.8.0(Oct 12, 2022)

    Release Note (0.8.0)

    Release time: 2022-10-12 08:11:40

    This release contains 3 new features, 1 performance improvement, and 1 documentation improvements.

    🆕 Features

    Support large ONNX model files (#828)

    Before this release, the ONNX model file is limited to 2GB. Now we support large ONNX models which are archived into zip files, in which several small ONNX files are stored for subgraphs. As a result, we are now able to serve all of the CLIP models via onnxruntime.

    Support ViT-B-32, ViT-L-14, ViT-H-14 and ViT-g-14 trained on laion-2b (#825)

    Users can now serve four new CLIP models from OpenCLIP trained on the Laion-2B dataset:

    • ViT-B-32::laion2b-s34b-b79k
    • ViT-L-14::laion2b-s32b-b82k
    • ViT-H-14::laion2b-s32b-b79k
    • ViT-g-14::laion2b-s12b-b42k

    The ViT-H-14 model achieves 78.0% zero-shot top-1 accuracy on ImageNet and 73.4% on zero-shot image retrieval at [email protected] on MS COCO. This is the best-performing open source CLIP model. To use the new models, simply specify the model name, e.g., ViT-H-14::laion2b-s32b-b79k in the FLOW YAML. For example:

    jtype: Flow
    version: '1'
    with:
      port: 51000
    executors:
      - name: clip_t
        uses:
          jtype: CLIPEncoder
          with:
            name: ViT-H-14::laion2b-s32b-b79k
          metas:
            py_modules:
              - clip_server.executors.clip_torch
    

    Please refer to model support to see the full list of supported models.

    In-place result in clip_client; preserve output order by uuid (#815)

    The clip_client module now supports in-place embedding. This means the result of a call to the CLIP server to get embeddings is stored in the input DocumentArray, instead of creating a new DocumentArray. Consequently, the DocumentArray returned by a call to Client.encode now has the same order as the input DocumentArray.

    This could cause a breaking change if code depends on Client.encode to return a new DocumentArray instance.

    If you run the following code, you can verify that the input DocumentArray now contains the embeddings and that the order is unchanged.

    from docarray import DocumentArray, Document
    from clip_client import Client
    
    c = Client('grpc://0.0.0.0:51000')
    
    da = [
        Document(text='she smiled, with pain'),
        Document(uri='apple.png'),
        Document(uri='apple.png').load_uri_to_image_tensor(),
        Document(blob=open('apple.png', 'rb').read()),
        Document(uri='https://clip-as-service.jina.ai/_static/favicon.png'),
        Document(
            uri=''
        ),
    ]
    
    c.encode(da)
    print(da.embeddings)
    

    🚀 Performance

    Drop image content to boost latency (#824)

    Calls to Client.encode no longer return the input image with the embedding. Since embeddings are now inserted into the original DocumentArray instance, this is unnecessary network traffic. As a result, the system is now faster and more responsive. Performance improvement is dependent on the size of the image and network bandwidth.

    📗 Documentation Improvements

    CLIP benchmark on zero-shot classification and retrieval tasks (#832)

    We now provide benchmark information for CLIP models on zero-shot classification and retrieval tasks. This information should help users to choose the best CLIP model for their specific use-cases. For more details, please read the Benchmark page in the CLIP-as-Service User Guide.

    🤟 Contributors

    We would like to thank all contributors to this release: felix-wang(@numb3r3 ) Ziniu Yu(@ZiniuYu ) Jie Fu(@jemmyshin )

    Source code(tar.gz)
    Source code(zip)
  • v0.7.0(Sep 13, 2022)

    Release Note (0.7.0)

    Release time: 2022-09-13 13:47:54

    🙇 We'd like to thank all contributors for this new release! In particular, numb3r3, felix-wang, Jie Fu, Ziniu Yu, Jina Dev Bot, 🙇

    🆕 New Features

    • [a07a5218] - support clip retrieval (#816) (felix-wang)

    🐞 Bug fixes

    • [213ecc28] - always return docarray as search result (#821) (felix-wang)
    • [eca57745] - readme: use new demo server (#819) (felix-wang)

    📗 Documentation

    • [8d9725fb] - update clip search (#820) (felix-wang)
    • [fa7e5776] - docs for retrieval (#808) (Jie Fu)
    • [47144c23] - enable horizontal scrolling in wide tables (#818) (Ziniu Yu)

    🍹 Other Improvements

    • [53636cea] - bump version to 0.7.0 (numb3r3)
    • [eda4aa8e] - version: the next version will be 0.6.3 (Jina Dev Bot)
    • [f7ee26a1] - improve model not found error msg (#812) (Ziniu Yu)
    Source code(tar.gz)
    Source code(zip)
  • v0.6.2(Sep 1, 2022)

    Release Note (0.6.2)

    Release time: 2022-09-01 04:16:27

    🙇 We'd like to thank all contributors for this new release! In particular, Ziniu Yu, Jina Dev Bot, felix-wang, 🙇

    🐞 Bug fixes

    • [ea239685] - grpc meta auth (#811) (felix-wang)

    📗 Documentation

    • [4461d2e9] - update model support table (#813) (Ziniu Yu)

    🍹 Other Improvements

    • [f7ee26a1] - improve model not found error msg (#812) (Ziniu Yu)
    • [f1c0057d] - version: the next version will be 0.6.2 (Jina Dev Bot)
    Source code(tar.gz)
    Source code(zip)
  • v0.6.1(Aug 30, 2022)

    Release Note (0.6.1)

    Release time: 2022-08-30 13:57:32

    🙇 We'd like to thank all contributors for this new release! In particular, felix-wang, Jina Dev Bot, numb3r3, 🙇

    🐞 Bug fixes

    • [ea239685] - grpc meta auth (#811) (felix-wang)

    🍹 Other Improvements

    • [83a8120c] - version: the next version will be 0.6.1 (Jina Dev Bot)
    • [2a80235c] - bump version to 0.6.0 (numb3r3)
    Source code(tar.gz)
    Source code(zip)
  • v0.6.0(Aug 30, 2022)

    Release Note (0.6.0)

    Release time: 2022-08-30 04:19:21

    🙇 We'd like to thank all contributors for this new release! In particular, numb3r3, Ziniu Yu, felix-wang, Jina Dev Bot, 🙇

    🆕 New Features

    • [3c43eed3] - do not send blob from server when it is loaded in client (#804) (Ziniu Yu)
    • [f852dfc8] - add warning if input is too large (#796) (Ziniu Yu)
    • [65032f02] - encode text first when both text and uri are presented (#795) (Ziniu Yu)

    🐞 Bug fixes

    • [bb2c142b] - cast dtype for fp16 (#801) (felix-wang)

    📗 Documentation

    • [a5893c70] - update jcloud gpu usage (#809) (Ziniu Yu)
    • [b4fb0dd2] - fix hub table typo (#803) (Ziniu Yu)

    🍹 Other Improvements

    • [2a80235c] - bump version to 0.6.0 (numb3r3)
    • [59b9f771] - update protobuf version (#810) (Ziniu Yu)
    • [89205f06] - update executor docstring (#806) (Ziniu Yu)
    • [25c91e21] - version: the next version will be 0.5.2 (Jina Dev Bot)
    Source code(tar.gz)
    Source code(zip)
  • v0.5.1(Aug 8, 2022)

    Release Note (0.5.1)

    Release time: 2022-08-08 05:11:18

    🙇 We'd like to thank all contributors for this new release! In particular, Ziniu Yu, Jina Dev Bot, numb3r3, 🙇

    🆕 New Features

    • [65032f02] - encode text first when both text and uri are presented (#795) (Ziniu Yu)

    📗 Documentation

    • [7c6708fa] - update hub readme (#794) (Ziniu Yu)

    🍹 Other Improvements

    • [a7c4f490] - version: the next version will be 0.5.1 (Jina Dev Bot)
    • [b00963c4] - bump version to 0.5.0 (numb3r3)
    Source code(tar.gz)
    Source code(zip)
  • v0.5.0(Aug 3, 2022)

    Release Note (0.5.0)

    Release time: 2022-08-03 05:13:06

    🙇 We'd like to thank all contributors for this new release! In particular, numb3r3, Ziniu Yu, Alex Shan, felix-wang, Sha Zhou, Jina Dev Bot, Han Xiao, 🙇

    🆕 New Features

    • [3402b1d1] - replace traversal_paths with access_paths (#791) (Ziniu Yu)
    • [87928a7b] - update onnx models and md5 (#785) (Ziniu Yu)
    • [8bd83896] - support onnx backend for openclip (#781) (felix-wang)
    • [f043b4d9] - update openclip loader (#782) (Alex Shan)
    • [fa62d8e9] - support openclip&mclip models + refactor model loader (#774) (Alex Shan)
    • [32b11cd6] - allow model selection in client (#775) (Ziniu Yu)
    • [0ff4e252] - allow credential in client (#765) (Ziniu Yu)
    • [ee7da10d] - support custom onnx file and update model signatures (#761) (Ziniu Yu)
    • [ed1b92d1] - docs: add qabot (#759) (Sha Zhou)

    🐞 Bug fixes

    • [e48a7a38] - change onnx and trt default model name to ViT-B-32::openai (#793) (Ziniu Yu)
    • [8b8082a9] - mclip cuda device (#792) (felix-wang)
    • [8681b88e] - fp16 inference (#790) (felix-wang)
    • [ab00c2ae] - upgrade jina (#788) (felix-wang)
    • [1db43b48] - no allow client to change server batch size (#787) (Ziniu Yu)
    • [58772079] - add models and md5 (#783) (Ziniu Yu)
    • [7c8285bb] - async progress bar does not display (#779) (Ziniu Yu)
    • [79e85eed] - miscalling clip_server in clip_client (Han Xiao)

    📗 Documentation

    • [c67a7f59] - add model support (#784) (Alex Shan)
    • [bc6b72e6] - add finetuner docs (#771) (Ziniu Yu)
    • [2b78b12e] - improve model support (#768) (Ziniu Yu)

    🍹 Other Improvements

    • [b00963c4] - bump version to 0.5.0 (numb3r3)
    • [c458dd65] - remove clip_hg (#786) (Ziniu Yu)
    • [ca03dca3] - fix markdown-table extention (#772) (felix-wang)
    • [7b19bffe] - version: the next version will be 0.4.21 (Jina Dev Bot)
    Source code(tar.gz)
    Source code(zip)
  • v0.4.20(Jun 21, 2022)

    Release Note (0.4.20)

    Release time: 2022-06-21 15:45:06

    🙇 We'd like to thank all contributors for this new release! In particular, Han Xiao, Jina Dev Bot, 🙇

    🐞 Bug fixes

    • [79e85eed] - miscalling clip_server in clip_client (Han Xiao)

    📗 Documentation

    • [6e054db8] - read config from stdin to allow pipe (Han Xiao)

    🍹 Other Improvements

    • [c3e75133] - version: the next version will be 0.4.20 (Jina Dev Bot)
    Source code(tar.gz)
    Source code(zip)
  • v0.4.19(Jun 20, 2022)

    Release Note (0.4.19)

    Release time: 2022-06-20 16:32:32

    🙇 We'd like to thank all contributors for this new release! In particular, Han Xiao, Jina Dev Bot, 🙇

    🆕 New Features

    • [6902d2df] - read config from stdin to allow pipe (#758) (Han Xiao)

    📗 Documentation

    • [6e054db8] - read config from stdin to allow pipe (Han Xiao)

    🍹 Other Improvements

    • [4a298d4f] - add docker image docs (Han Xiao)
    • [1e931e8b] - version: the next version will be 0.4.19 (Jina Dev Bot)
    • [a0c2661b] - fix tag docker build job (Han Xiao)
    Source code(tar.gz)
    Source code(zip)
  • v0.4.18(Jun 20, 2022)

    Release Note (0.4.18)

    Release time: 2022-06-20 11:21:16

    🙇 We'd like to thank all contributors for this new release! In particular, Han Xiao, Jina Dev Bot, 🙇

    🍹 Other Improvements

    • [a0c2661b] - fix tag docker build job (Han Xiao)
    • [23f738ec] - version: the next version will be 0.4.18 (Jina Dev Bot)
    • [9e469bf7] - fix readme (Han Xiao)
    Source code(tar.gz)
    Source code(zip)
  • v0.4.17(Jun 20, 2022)

    Release Note (0.4.17)

    Release time: 2022-06-20 10:56:12

    🙇 We'd like to thank all contributors for this new release! In particular, Han Xiao, Ziniu Yu, numb3r3, felix-wang, Jina Dev Bot, 🙇

    🆕 New Features

    • [03541dd7] - add cas server dockerfile (#757) (Han Xiao)
    • [4d069a84] - upload torch executor (#723) (Ziniu Yu)

    🐞 Bug fixes

    • [eca1e700] - add integerate test for client (#753) (felix-wang)

    📗 Documentation

    • [7c2faae2] - update jcloud docs (#754) (Ziniu Yu)
    • [9d872f2e] - add disk usage / memory usage benchmark table (#751) (Ziniu Yu)

    🍹 Other Improvements

    • [9e469bf7] - fix readme (Han Xiao)
    • [4c4e74b2] - upload executor in cd workflow (numb3r3)
    • [96923f12] - fix docker cd (#755) (felix-wang)
    • [1869e61f] - add visual reasoning to docs (Han Xiao)
    • [2083f097] - version: the next version will be 0.4.17 (Jina Dev Bot)
    Source code(tar.gz)
    Source code(zip)
  • v0.4.16(Jun 14, 2022)

    Release Note (0.4.16)

    Release time: 2022-06-14 08:52:07

    🙇 We'd like to thank all contributors for this new release! In particular, felix-wang, Ziniu Yu, Han Xiao, Jina Dev Bot, 🙇

    🐞 Bug fixes

    • [eca1e700] - add integerate test for client (#753) (felix-wang)
    • [b5c339fe] - fix client concurrent issue (#752) (Ziniu Yu)

    🍹 Other Improvements

    • [e5ab22f5] - update slack (Han Xiao)
    • [5503becb] - fix docs (Han Xiao)
    • [909cdb11] - add cas on colab section (Han Xiao)
    • [3d3ef936] - version: the next version will be 0.4.16 (Jina Dev Bot)
    Source code(tar.gz)
    Source code(zip)
  • v0.4.15(Jun 13, 2022)

    Release Note (0.4.15)

    Release time: 2022-06-13 13:06:16

    🙇 We'd like to thank all contributors for this new release! In particular, Han Xiao, felix-wang, Ziniu Yu, Jina Dev Bot, 🙇

    🆕 New Features

    • [e022bd46] - add traversal paths (#750) (felix-wang)
    • [4fe5a1b1] - add traversal paths (#748) (felix-wang)

    🐞 Bug fixes

    • [752202f8] - monitor documentation (#745) (felix-wang)

    🍹 Other Improvements

    • [dab8341e] - add cas on colab section (Han Xiao)
    • [29bd68a4] - add replicas field in all yamls (Han Xiao)
    • [d5be8c2f] - Revert "feat: add traversal paths (#748)" (#749) (Han Xiao)
    • [7f2d8fe8] - update links in docs (#747) (Ziniu Yu)
    • [52a8b0a6] - version: the next version will be 0.4.15 (Jina Dev Bot)
    Source code(tar.gz)
    Source code(zip)
  • v0.4.14(Jun 9, 2022)

    Release Note (0.4.14)

    Release time: 2022-06-09 13:39:46

    🙇 We'd like to thank all contributors for this new release! In particular, felix-wang, Jina Dev Bot, 🙇

    🐞 Bug fixes

    • [752202f8] - monitor documentation (#745) (felix-wang)

    🧼 Code Refactoring

    • [5eb5d7e8] - monitor (#743) (felix-wang)

    🍹 Other Improvements

    • [06097f20] - version: the next version will be 0.4.14 (Jina Dev Bot)
    Source code(tar.gz)
    Source code(zip)
  • v0.4.13(Jun 9, 2022)

    Release Note (0.4.13)

    Release time: 2022-06-09 04:42:07

    🙇 We'd like to thank all contributors for this new release! In particular, felix-wang, Ziniu Yu, Han Xiao, Jina Dev Bot, 🙇

    🆕 New Features

    • [d675148b] - add clip_hg executor (#740) (Ziniu Yu)

    🧼 Code Refactoring

    • [5eb5d7e8] - monitor (#743) (felix-wang)

    📗 Documentation

    • [130108c1] - add JCloud deployment docs (#739) (Ziniu Yu)
    • [5e06667a] - update monitoring feature (#737) (felix-wang)

    🍹 Other Improvements

    • [4b88e992] - fix docs (Han Xiao)
    • [b130d645] - add grafana dashboard (#741) (felix-wang)
    • [12ede839] - version: the next version will be 0.4.13 (Jina Dev Bot)
    Source code(tar.gz)
    Source code(zip)
  • v0.4.12(Jun 1, 2022)

    Release Note (0.4.12)

    Release time: 2022-06-01 08:28:41

    🙇 We'd like to thank all contributors for this new release! In particular, felix-wang, Ziniu Yu, Jina Dev Bot, samsja, 🙇

    🆕 New Features

    • [60a986a0] - add monitoring (#674) (samsja)

    🐞 Bug fixes

    • [bb8c4ce0] - better monitoring (#738) (felix-wang)
    • [751cf9de] - does not require port (#735) (Ziniu Yu)

    📗 Documentation

    • [5e06667a] - update monitoring feature (#737) (felix-wang)

    🍹 Other Improvements

    • [b523c624] - version: the next version will be 0.4.12 (Jina Dev Bot)
    Source code(tar.gz)
    Source code(zip)
  • v0.4.11(May 27, 2022)

    Release Note (0.4.11)

    Release time: 2022-05-27 07:44:46

    🙇 We'd like to thank all contributors for this new release! In particular, samsja, Shubham Goel, Han Xiao, Ziniu Yu, Roshan Jossy, Jina Dev Bot, 🙇

    🆕 New Features

    • [60a986a0] - add monitoring (#674) (samsja)

    🐞 Bug fixes

    • [59f48e60] - windows file name conflict (#729) (Ziniu Yu)
    • [0054b47c] - server: fix content assignment (#727) (Han Xiao)

    📗 Documentation

    • [2f3a2077] - tracking: remove utm source in links (#728) (Roshan Jossy)

    🍹 Other Improvements

    • [c7c96251] - Corrected replicas indentation in server.md (#731) (Shubham Goel)
    • [8d112275] - fix docs (Han Xiao)
    • [7323d99e] - version: the next version will be 0.4.11 (Jina Dev Bot)
    Source code(tar.gz)
    Source code(zip)
  • v0.4.10(May 24, 2022)

    Release Note (0.4.10)

    Release time: 2022-05-24 07:46:48

    🙇 We'd like to thank all contributors for this new release! In particular, Han Xiao, Jina Dev Bot, 🙇

    🐞 Bug fixes

    • [0054b47c] - server: fix content assignment (#727) (Han Xiao)
    • [a7311fbf] - server: recover original contents of the input da (#726) (Han Xiao)

    🍹 Other Improvements

    • [926621bc] - version: the next version will be 0.4.10 (Jina Dev Bot)
    Source code(tar.gz)
    Source code(zip)
  • v0.4.9(May 23, 2022)

    Release Note (0.4.9)

    Release time: 2022-05-23 15:13:23

    🙇 We'd like to thank all contributors for this new release! In particular, Han Xiao, numb3r3, felix-wang, Jina Dev Bot, 🙇

    🐞 Bug fixes

    • [a7311fbf] - server: recover original contents of the input da (#726) (Han Xiao)
    • [42ef75b1] - server: remove embeddings to save bandwidth (Han Xiao)
    • [2d2da147] - docker push cd (numb3r3)
    • [994635fa] - k8s dockerize (#725) (felix-wang)
    • [d12c5115] - docker file (#719) (felix-wang)
    • [65991a3f] - client: fix https args to tls (#722) (Han Xiao)

    🍹 Other Improvements

    • [b6adcf8b] - docs: add multi gpu setting (Han Xiao)
    • [3d8c552a] - version: the next version will be 0.4.9 (Jina Dev Bot)
    Source code(tar.gz)
    Source code(zip)
  • v0.4.8(May 13, 2022)

    Release Note (0.4.8)

    Release time: 2022-05-13 09:24:42

    🙇 We'd like to thank all contributors for this new release! In particular, Han Xiao, numb3r3, felix-wang, Jina Dev Bot, 🙇

    ⚡ Performance Improvements

    • [72f1bc4a] - server: use await gather in rank function (#716) (Han Xiao)

    🐞 Bug fixes

    • [65991a3f] - client: fix https args to tls (#722) (Han Xiao)
    • [1002a913] - docker release cd (#717) (felix-wang)
    • [71d2c867] - docker build push (#714) (felix-wang)

    🏁 Unit Test and CICD

    • [38043676] - fix force release (numb3r3)

    🍹 Other Improvements

    • [0da311e4] - docs: change http to https (Han Xiao)
    • [741ad796] - docs: add playground (Han Xiao)
    • [a2b6d337] - version: the next version will be 0.4.8 (Jina Dev Bot)
    Source code(tar.gz)
    Source code(zip)
  • v0.4.7(May 11, 2022)

    Release Note (0.4.7)

    Release time: 2022-05-11 16:25:08

    🙇 We'd like to thank all contributors for this new release! In particular, Han Xiao, Jina Dev Bot, 🙇

    ⚡ Performance Improvements

    • [72f1bc4a] - server: use await gather in rank function (#716) (Han Xiao)
    • [cda93fdd] - server: use await gather in rank function (Han Xiao)

    🍹 Other Improvements

    • [66b14fc6] - version: the next version will be 0.4.7 (Jina Dev Bot)
    Source code(tar.gz)
    Source code(zip)
  • v0.4.6(May 11, 2022)

    Release Note (0.4.6)

    Release time: 2022-05-11 15:10:52

    🙇 We'd like to thank all contributors for this new release! In particular, Han Xiao, Jina Dev Bot, 🙇

    ⚡ Performance Improvements

    • [cda93fdd] - server: use await gather in rank function (Han Xiao)

    🐞 Bug fixes

    • [6ed4c484] - convert distance to score (Han Xiao)

    🍹 Other Improvements

    • [06fcd07b] - version: the next version will be 0.4.6 (Jina Dev Bot)
    Source code(tar.gz)
    Source code(zip)
  • v0.4.5(May 11, 2022)

    Release Note (0.4.5)

    Release time: 2022-05-11 12:10:29

    🙇 We'd like to thank all contributors for this new release! In particular, Han Xiao, Jina Dev Bot, 🙇

    🐞 Bug fixes

    • [6ed4c484] - convert distance to score (Han Xiao)

    🧼 Code Refactoring

    • [59c06986] - server: remove redundant logics of rank (#715) (Han Xiao)

    🍹 Other Improvements

    • [d565d31f] - version: the next version will be 0.4.5 (Jina Dev Bot)
    Source code(tar.gz)
    Source code(zip)
  • v0.4.4(May 11, 2022)

    Release Note (0.4.4)

    Release time: 2022-05-11 12:00:46

    🙇 We'd like to thank all contributors for this new release! In particular, Han Xiao, felix-wang, Jina Dev Bot, 🙇

    🆕 New Features

    • [edf0d862] - add dockerfiles and cd workflow (#712) (felix-wang)

    🐞 Bug fixes

    • [bb520d14] - keep logit_scale on same device (#710) (felix-wang)

    🧼 Code Refactoring

    • [59c06986] - server: remove redundant logics of rank (#715) (Han Xiao)

    🍹 Other Improvements

    • [72d69c75] - docs: update readme (Han Xiao)
    • [f898c8ce] - version: the next version will be 0.4.4 (Jina Dev Bot)
    Source code(tar.gz)
    Source code(zip)
  • v0.4.3(May 9, 2022)

    Release Note (0.4.3)

    Release time: 2022-05-09 10:23:15

    🙇 We'd like to thank all contributors for this new release! In particular, felix-wang, Roshan Jossy, Han Xiao, Jina Dev Bot, 🙇

    🐞 Bug fixes

    • [bb520d14] - keep logit_scale on same device (#710) (felix-wang)
    • [835eb13f] - use cosine as the rank score (#708) (felix-wang)

    📗 Documentation

    • [da87d13a] - tracking: update external links' source (#711) (Roshan Jossy)

    🍹 Other Improvements

    • [099e2218] - docs: update readme (Han Xiao)
    • [ce5806d3] - update index html (#709) (felix-wang)
    • [a1651079] - version: the next version will be 0.4.3 (Jina Dev Bot)
    Source code(tar.gz)
    Source code(zip)
  • v0.4.2(May 9, 2022)

    Release Note (0.4.2)

    Release time: 2022-05-09 05:32:39

    🙇 We'd like to thank all contributors for this new release! In particular, felix-wang, Han Xiao, Jina Dev Bot, 🙇

    🆕 New Features

    • [f66b145b] - add ranker endpoint for all backends (#707) (felix-wang)

    🐞 Bug fixes

    • [835eb13f] - use cosine as the rank score (#708) (felix-wang)

    🍹 Other Improvements

    • [706fa624] - docs: update readme (Han Xiao)
    • [7fd04d2d] - docs: add cas async usage to readme (Han Xiao)
    • [90bb4c5c] - version: the next version will be 0.4.2 (Jina Dev Bot)
    Source code(tar.gz)
    Source code(zip)
  • v0.4.1(May 4, 2022)

    Release Note (0.4.1)

    Release time: 2022-05-04 17:38:48

    🙇 We'd like to thank all contributors for this new release! In particular, felix-wang, Han Xiao, Jina Dev Bot, 🙇

    🆕 New Features

    • [f66b145b] - add ranker endpoint for all backends (#707) (felix-wang)
    • [f7b9af40] - add tensorrt support (#688) (felix-wang)
    • [33efcb00] - add async rerank (#701) (Han Xiao)

    🐞 Bug fixes

    • [618dbdb2] - cd workflow (#706) (felix-wang)

    🍹 Other Improvements

    • [3f34d46d] - docs: add cas async usage to readme (Han Xiao)
    • [0f941660] - version: the next version will be 0.4.1 (Jina Dev Bot)
    Source code(tar.gz)
    Source code(zip)
  • v0.4.0(Apr 30, 2022)

    Release Note (0.4.0)

    Release time: 2022-04-30 20:25:29

    🙇 We'd like to thank all contributors for this new release! In particular, Han Xiao, Jina Dev Bot, 🙇

    🆕 New Features

    • [33efcb00] - add async rerank (#701) (Han Xiao)
    • [12d33c49] - add async rerank (Han Xiao)

    🧼 Code Refactoring

    • [050c34e0] - use packaging instead of distutil (#700) (Han Xiao)

    🍹 Other Improvements

    • [20e66b95] - version: the next version will be 0.3.6 (Jina Dev Bot)
    Source code(tar.gz)
    Source code(zip)
  • v0.3.5(Apr 30, 2022)

    Release Note (0.3.5)

    Release time: 2022-04-30 18:55:10

    🙇 We'd like to thank all contributors for this new release! In particular, Han Xiao, Jina Dev Bot, 🙇

    🐞 Bug fixes

    • [8ac2e9bb] - torch: fix oom in rerank endpoint (#699) (Han Xiao)

    🧼 Code Refactoring

    • [050c34e0] - use packaging instead of distutil (#700) (Han Xiao)

    🍹 Other Improvements

    • [d2c2c872] - version: the next version will be 0.3.5 (Jina Dev Bot)
    Source code(tar.gz)
    Source code(zip)
Owner
Han Xiao
Founder & CEO @jina-ai | We're hiring 👐
Han Xiao
a chinese segment base on crf

Genius Genius是一个开源的python中文分词组件,采用 CRF(Conditional Random Field)条件随机场算法。 Feature 支持python2.x、python3.x以及pypy2.x。 支持简单的pinyin分词 支持用户自定义break 支持用户自定义合并词

duanhongyi 237 Nov 04, 2022
SentAugment is a data augmentation technique for semi-supervised learning in NLP.

SentAugment SentAugment is a data augmentation technique for semi-supervised learning in NLP. It uses state-of-the-art sentence embeddings to structur

Meta Research 363 Dec 30, 2022
A unified tokenization tool for Images, Chinese and English.

ICE Tokenizer Token id [0, 20000) are image tokens. Token id [20000, 20100) are common tokens, mainly punctuations. E.g., icetk[20000] == 'unk', ice

THUDM 42 Dec 27, 2022
PyJPBoatRace: Python-based Japanese boatrace tools 🚤

pyjpboatrace :speedboat: provides you with useful tools for data analysis and auto-betting for boatrace.

5 Oct 29, 2022
Code for Editing Factual Knowledge in Language Models

KnowledgeEditor Code for Editing Factual Knowledge in Language Models (https://arxiv.org/abs/2104.08164). @inproceedings{decao2021editing, title={Ed

Nicola De Cao 86 Nov 28, 2022
Implementation of Natural Language Code Search in the project CodeBERT: A Pre-Trained Model for Programming and Natural Languages.

CodeBERT-Implementation In this repo we have replicated the paper CodeBERT: A Pre-Trained Model for Programming and Natural Languages. We are interest

Tanuj Sur 4 Jul 01, 2022
An automated program that helps customers of Pizza Palour place their pizza orders

PIzza_Order_Assistant Introduction An automated program that helps customers of Pizza Palour place their pizza orders. The program uses voice commands

Tindi Sommers 1 Dec 26, 2021
multi-label,classifier,text classification,多标签文本分类,文本分类,BERT,ALBERT,multi-label-classification,seq2seq,attention,beam search

multi-label,classifier,text classification,多标签文本分类,文本分类,BERT,ALBERT,multi-label-classification,seq2seq,attention,beam search

hellonlp 30 Dec 12, 2022
Shared, streaming Python dict

UltraDict Sychronized, streaming Python dictionary that uses shared memory as a backend Warning: This is an early hack. There are only few unit tests

Ronny Rentner 192 Dec 23, 2022
The entmax mapping and its loss, a family of sparse softmax alternatives.

entmax This package provides a pytorch implementation of entmax and entmax losses: a sparse family of probability mappings and corresponding loss func

DeepSPIN 330 Dec 22, 2022
History Aware Multimodal Transformer for Vision-and-Language Navigation

History Aware Multimodal Transformer for Vision-and-Language Navigation This repository is the official implementation of History Aware Multimodal Tra

Shizhe Chen 46 Nov 23, 2022
A Python package implementing a new model for text classification with visualization tools for Explainable AI :octocat:

A Python package implementing a new model for text classification with visualization tools for Explainable AI 🍣 Online live demos: http://tworld.io/s

Sergio Burdisso 285 Jan 02, 2023
Dé op-de-vlucht Pieton vertaler. Wereldwijd gebruikt door meer dan 1.000+ succesvolle bedrijven!

Dé op-de-vlucht Pieton vertaler. Wereldwijd gebruikt door meer dan 1.000+ succesvolle bedrijven!

Lau 1 Dec 17, 2021
RoNER is a Named Entity Recognition model based on a pre-trained BERT transformer model trained on RONECv2

RoNER RoNER is a Named Entity Recognition model based on a pre-trained BERT transformer model trained on RONECv2. It is meant to be an easy to use, hi

Stefan Dumitrescu 9 Nov 07, 2022
A PyTorch-based model pruning toolkit for pre-trained language models

English | 中文说明 TextPruner是一个为预训练语言模型设计的模型裁剪工具包,通过轻量、快速的裁剪方法对模型进行结构化剪枝,从而实现压缩模型体积、提升模型速度。 其他相关资源: 知识蒸馏工具TextBrewer:https://github.com/airaria/TextBrewe

Ziqing Yang 231 Jan 08, 2023
ConvBERT-Prod

ConvBERT 目录 0. 仓库结构 1. 简介 2. 数据集和复现精度 3. 准备数据与环境 3.1 准备环境 3.2 准备数据 3.3 准备模型 4. 开始使用 4.1 模型训练 4.2 模型评估 4.3 模型预测 5. 模型推理部署 5.1 基于Inference的推理 5.2 基于Serv

yujun 7 Apr 08, 2022
本插件是pcrjjc插件的重置版,可以独立于后端api运行

pcrjjc2 本插件是pcrjjc重置版,不需要使用其他后端api,但是需要自行配置客户端 本项目基于AGPL v3协议开源,由于项目特殊性,禁止基于本项目的任何商业行为 配置方法 环境需求:.net framework 4.5及以上 jre8 别忘了装jre8 别忘了装jre8 别忘了装jre8

132 Dec 26, 2022
AI and Machine Learning workflows on Anthos Bare Metal.

Hybrid and Sovereign AI on Anthos Bare Metal Table of Contents Overview Terraform as IaC Substrate ABM Cluster on GCE using Terraform TensorFlow ResNe

Google Cloud Platform 8 Nov 26, 2022
An assignment on creating a minimalist neural network toolkit for CS11-747

minnn by Graham Neubig, Zhisong Zhang, and Divyansh Kaushik This is an exercise in developing a minimalist neural network toolkit for NLP, part of Car

Graham Neubig 63 Dec 29, 2022
Galois is an auto code completer for code editors (or any text editor) based on OpenAI GPT-2.

Galois is an auto code completer for code editors (or any text editor) based on OpenAI GPT-2. It is trained (finetuned) on a curated list of approximately 45K Python (~470MB) files gathered from the

Galois Autocompleter 91 Sep 23, 2022