Model Serving made Efficient in the Cloud.
Introduction
Mosec is a high-performance and flexible model serving framework for building ML model-enabled backend and microservices. It bridges the gap between any machine learning models you just trained and the efficient online service API.
- Highly performant: web layer and task coordination built with Rust
๐ฆ , which offers blazing speed in addition to efficient CPU utilization powered by async I/O - Ease of use: user interface purely in Python
๐ , by which users can serve their models in an ML framework-agnostic manner using the same code as they do for offline testing - Dynamic batching: aggregate requests from different users for batched inference and distribute results back
- Pipelined stages: spawn multiple processes for pipelined stages to handle CPU/GPU/IO mixed workloads
- Cloud friendly: designed to run in the cloud, with the model warmup, graceful shutdown, and Prometheus monitoring metrics, easily managed by Kubernetes or any container orchestration systems
- Do one thing well: focus on the online serving part, users can pay attention to the model performance and business logic
Installation
Mosec requires Python 3.6 or above. Install the latest PyPI package with:
pip install -U mosec
Usage
Write the server
Import the libraries and set up a basic logger to better observe what happens.
import logging
from mosec import Server, Worker
from mosec.errors import ValidationError
logger = logging.getLogger()
logger.setLevel(logging.DEBUG)
formatter = logging.Formatter(
"%(asctime)s - %(process)d - %(levelname)s - %(filename)s:%(lineno)s - %(message)s"
)
sh = logging.StreamHandler()
sh.setFormatter(formatter)
logger.addHandler(sh)
Then, we build an API to calculate the exponential with base e for a given number. To achieve that, we simply inherit the Worker
class and override the forward
method. Note that the input req
is by default a JSON-decoded object, e.g., a dictionary here (wishfully it receives data like {"x": 1}
). We also enclose the input parsing part with a try...except...
block to reject invalid input (e.g., no key named "x"
or field "x"
cannot be converted to float
).
import math
class CalculateExp(Worker):
def forward(self, req: dict) -> dict:
try:
x = float(req["x"])
except KeyError:
raise ValidationError("cannot find key 'x'")
except ValueError:
raise ValidationError("cannot convert 'x' value to float")
y = math.exp(x) # f(x) = e ^ x
logger.debug(f"e ^ {x} = {y}")
return {"y": y}
Finally, we append the worker to the server to construct a single-stage workflow
, and we specify the number of processes we want it to run in parallel. Then we run the server.
if __name__ == "__main__":
server = Server()
server.append_worker(
CalculateExp, num=2
) # we spawn two processes for parallel computing
server.run()
Run the server
After merging the snippets above into a file named server.py
, we can first have a look at the command line arguments:
python server.py --help
Then let's start the server...
python server.py
and in another terminal, test it:
curl -X POST http://127.0.0.1:8000/inference -d '{"x": 2}'
or check the metrics:
curl http://127.0.0.1:8000/metrics
That's it! You have just hosted your exponential-computing model as a server!
Example
More ready-to-use examples can be found in the Example section. It includes:
- Multi-stage workflow
- Batch processing worker
- PyTorch deep learning models:
- sentiment analysis
- image recognition
Contributing
We welcome any kind of contribution. Please give us feedback by raising issues or directly contribute your code and pull request!