当前位置:网站首页>Run faster with go: use golang to serve machine learning

2022-07-05

therefore , Our requirement is to complete every second with as few resources as possible 300 Ten thousand predictions . thankfully , This is a relatively simple recommendation system model , That is, dobby slot machine (MAB). Dobby slot machines usually involve from Beta Distribution Sampling in equal distribution . This is also where it takes the most time . If we can do as many samples as possible at the same time , We can make good use of resources . Maximizing resource utilization is the key to reducing the overall resources required by the model .

Our current forecasting service uses Python Written microservices , They follow the following general structure :

request -> Function acquisition -> forecast -> Post processing -> return

A request may require us to respond to thousands of users 、 Score the content . with GIL And multi process Python The handling performance is very good , We have achieved the goal based on cython and C++ Batch sampling method , Around GIL, We use many based on the number of cores workers To handle requests concurrently .

Currently, single node Python Service can be done 192 individual RPS , Each about 400 Yes . Average CPU Utilization is only 20% about . The limiting factor now is language 、 Service framework and network call to storage function .

Why Golang?

Golang It's a statically typed language , It is very instrumental . This means that errors will be detected early , And it's easy to refactor code .Golang Concurrency of is native , This is for machine learning algorithms that can run in parallel and for Featurestore Concurrent network calls are very important . It is here One of the fastest service languages in the benchmark . It is also a compilation language , So it can be optimized at compile time .

Transplant the existing MAB To Golang On

The basic idea , Divide the system into 3 Parts of :

  • Basic for prediction and health REST API And stub
  • Featurestore Acquisition , To do this, implement a module
  • Use cgo Ascension and transfer c++ Sampling code

The first part is easy , I chose Fiber Framework for REST API. It seems to be the most popular , Well documented , similar Expressjs Of API. And it performs quite well in the benchmark .

Early code :

func main() {
    // setup fiber
 app := fiber.New()
    // catch all exception
    // load model struct
 ctx := context.Background()
 md, err := model.NewModel(ctx)
 if err != nil {
 defer md.Close()

    // health API
 app.Get("/health", func(c *fiber.Ctx) error {
  if err != nil {
   return fiber.NewError(
                fmt.Sprintf("Model couldn't load: %v", err))
  return c.JSON(&fiber.Map{
    // predict API
 app.Post("/predict", func(c *fiber.Ctx) error {
  var request map[string]interface{}
  err := json.Unmarshal(c.Body(), &request)
  if err != nil {
   return err

  return c.JSON(md.Predict(request))

That's it , Once the task is completed . It took less than an hour .

In the second part , You need to learn a little about how to write Structure with method and goroutines . And C++ and Python One of the main differences is ,Golang Full object-oriented programming is not supported , Mainly, inheritance is not supported . Its method on the structure is also completely different from other languages I have encountered .

What we use Featurestore Yes Golang client , All I have to do is write a wrapper around it to read a large number of concurrent entities .

The basic structure I want is :

type VertexFeatureStoreClient struct {
 //client reference to gcp's client

func NewVertexFeatureStoreClient(ctx context.Context,) (*VertexFeatureStoreClient, error) {
// client creation code

func (vfs *VertexFeatureStoreClient) GetFeaturesByIdsChunk(ctx context.Context, featurestore, entityName string, entityIds []string, featureList []string) (map[string]map[string]interface{}, error) {
 // fetch code for 100 items

func (vfs *VertexFeatureStoreClient) GetFeaturesByIds(ctx context.Context, featurestore, entityName string, entityIds []string, featureList []string) (map[string]map[string]interface{}, error) {
 const chunkSize = 100 // limit from GCP
    // code to run each fetch concurrently
 featureChannel := make(chan map[string]map[string]interface{})
 errorChannel := make(chan error)
 var count = 0
 for i := 0; i < len(entityIds); i += chunkSize {
  end := i + chunkSize
  if end > len(entityIds) {
   end = len(entityIds)
  go func(ents []string) {
   features, err := vfs.GetFeaturesByIdsChunk(ctx, featurestore, entityName, ents, featureList)
   if err != nil {
    errorChannel <- err
   featureChannel <- features
 results := make(map[string]map[string]interface{}, len(entityIds))
 for {
  select {
  case err := <-errorChannel:
   return nil, err
  case res := <-featureChannel:
   for k, v := range res {
    results[k] = v
  if count < 1 {

 return results, nil
func (vfs *VertexFeatureStoreClient) Close() error {
    //close code

About Goroutine A hint of

Use as many channels as possible , There are many tutorials to use Goroutine Of sync workgroups. Those are lower level API, In most cases, you don't need . The channel is running Goroutine In an elegant way , Even if you don't need to pass data , You can send flags in the channel to collect .goroutines Is a cheap virtual thread , You don't have to worry about making too many threads and running on multiple cores . Abreast of the times golang It can run across cores for you .

About the third part , This is the hardest part . It took about a day to debug it . therefore , If your use case does not require complex sampling and C++, I suggest using it directly Gonum , You'll save yourself a lot of time .

I didn't realize , from cython when , I have to compile it manually C++ file , And load it into cgo include flags in .

The header file :

#ifndef BETA_DIST_H
#define BETA_DIST_H

#ifdef __cplusplus
extern "C"

    double beta_sample(double, double, long);
#ifdef __cplusplus


Be careful extern C , This is a C++ Code in go Need to be used in , because mangling ,C Unwanted . Another problem is , I can't do anything in the header file #include sentence , under these circumstances cgo link failure ( Unknown cause ). So I moved these statements to .cpp In file .

Compile it :

g++ -fPIC -I/usr/local/include -L/usr/local/lib  betadist.cpp -shared -o libbetadist.so

Once the compilation is complete , You can use it cgo.

cgo Packaging documents :

#cgo CPPFLAGS: -I${SRCDIR}/cbetadist
#cgo CPPFLAGS: -I/usr/local/include
#cgo LDFLAGS: -Wl,-rpath,${SRCDIR}/cbetadist
#cgo LDFLAGS: -L${SRCDIR}/cbetadist
#cgo LDFLAGS: -L/usr/local/lib
#cgo LDFLAGS: -lstdc++
#cgo LDFLAGS: -lbetadist
#include <betadist.hpp>
import "C"

func Betasample(alpha, beta float64, random int) float64 {
 return float64(C.beta_sample(C.double(alpha), C.double(beta), C.long(random)))

Be careful LDFLAGS Medium -lbetadist Is used to link libbetadist.so Of . You must also run export DYLD_LIBRARY_PATH=/fullpath_to/folder_containing_so_file/ . Then I can run go run . , It can be like go Work like a project .

It is very simple to integrate them with simple model structure and prediction methods , And it takes less time .


Max RPS192819
Max latency78ms110ms
Max CPU util.~20%~55%

That's right RPS Of 4.3 times The promotion of , This makes our minimum number of nodes from 80 Reduce to 19 individual , This is a huge cost advantage . The maximum delay is slightly higher , But it's acceptable , because python The service is 192 It is already saturated by o'clock , If the flow exceeds this figure , It will decrease significantly .

I should convert all my models into Golang Do you ?

A short answer : no need .

Long answer .Go It has great advantages in service , but Python It is still the king of experiments . I only recommend using it in the basic model with simple model and long-term operation Go, Not experiments .Go For complex ML For use cases still Not very mature .

So the elephant in the room , Why not Rust ?

Um. , Schiff did it . Have a look . It's even better than Go faster .


