当前位置:网站首页>How can services start smoothly under tens of millions of QPS
How can services start smoothly under tens of millions of QPS
2022-08-02 20:45:00 【InfoQ】
背景
Whether in the test or online environment,After the service system has just started,The first request will be much slower than the normal request response time,Usually in the hundredsms~1秒不等.If our caller service sets a timeout,then when the callee service has just started,There will be a high probability that the timeout limit will be reached,Timeout exception to occur.When the traffic is very high,It may be found that as soon as the service starts,Because of the slow response time,Immediately killed by high traffic,导致服务无法启动,Affect production operations.
参考“Ali micro service governance, the white paper”,下图 Spring Cloud 应⽤中第⼀次和第⼆次通过调⽤RestTemplate 调⽤远程服务的耗时对比情况:
原因分析
OpenJDK 使用了 JIT(Just-in-time) 即时编译技术,可以动态的把 Java 字节码编译成高度优化过机器码,提高执行效率,但在编译之前,Java 代码是以相对低效的解释器模式执行的.在应用启动完成后、业务流量刚进来的短时间内,容易出现的状况是大量 Java 方法开始被 JIT 编译,同时业务请求被较慢的解释器模式执行,最终的结果就是系统负载飙高,可能导致很多用户请求超时.Spring Cloud RibbonAlso lazy loading,The first request also loads a lot of function code,Causes the request processing time to increase,thereby increasing the risk of request processing timeouts.
解决方案
- The first step is to ensure that the system runs smoothly完全启动完成后,to have traffic access;
- 第二步Small flow preheating,through a small amount of trafficJVM The virtual machine compiles high-frequency code into machine code and caches it in the JVM 缓存中,再次使用的时候不会触发临时加载;
- 第三步Gradually put traffic to full volume,When a high-traffic request uses hot code,No need to explain every pass,Finally achieve smooth start.
System fully booted
How can I ensure that the service system is fully started after,traffic to get in??The easiest and best to use is to manually configure the reverse proxy,when a service instance is about to go offlineIPRemove from proxy server,Active and observe whether the node logs start over,再将该节点IPAdd the address to the code server,But when hundreds or thousands of nodes need to be operated and maintained,The workload is very big.Generally, we will choose the detection mechanism to ensure that the traffic will not be imported until the system is fully started..
F5探活
If you are a traditional virtual machine-based monolithic service,可以在F5Configure a probe port on,F5可以通过HTTP或TCP协议,Call the specified port of this service.When the number of failed calls exceeds the configured threshold,F5This service node will not be called,After the service instance starts successfully,After the number of successful calls is greater than the set threshold,F5Traffic will automatically call the service instance.
注册中心注册
在微服务场景下,已Eureka注册中心为例,Providerwill now register the information in the registration center,注册完成后,ConsumerThe local service list is regularly pulled and updated from the registry,通过RibbonLoad to the correspondingProvider实例上.
关闭Ribbon懒加载
对于中小公司而言,Shut down by configurationRibbon懒加载,Have the client load as the container starts,No need to wait for client configuration class to be created on first request,Reduce request processing time,Also a nice simple and crude solution.具体配置如下:
ribbon.eager-load.enabled=true
ribbon.eager-load.clients=service_id1,service_id2
probe detection
在容器环境下,Especially the use of service roomsservice方式(Spring Boot+k8s Service)或Service Mesh(如Istio+Envoy)进行调用时,就需要使用到Kubernetes的探针机制.Those who are not familiar with this concept can refer to
这里
.简单理解Livenessis a survival probe,检查容器是否还在运行(Running状态),When a container is detected to be unhealthy,将会重启Pod;Readinessis a readiness probe,判断服务是否可用(Ready状态),当Pod达到Readystatus to receive traffic requests.
有两个POD,podA和podB,他们lablel中App=app1.通过Label Selector:App=app1,使Hollowapp Service可以Kubernetesaccess them internally,此时pod A的Readiness是
Failure
,Kubernete会将Service对应的Endpoint中关于Pod A的IP去掉,这样通过Servicewill only accessPod B,without accessingPod A.当Pod A中Readiness Probe变成
Success
后,Call the service will be able to call againPod A实例.
目前spring boot 在2.3After already support ready probe and survival,在actuator新增了两个地址:/actuator/health/liveness和/actuator/health/readiness,具体配置如下:
<--省略无用信息-->
spec:
containers:
- name: *****
image: *****
livenessProbe:
httpGet:
path: /actuator/health/liveness
port: 8090
initialDelaySeconds: 5 # 容器启动后多久开始探测
failureThreshold: 10 # 连续探测10次失败表示失败
timeoutSeconds: 10 # 表示容器必须在10s内做出相应反馈给probe,否则视为探测失败
periodSeconds: 5 # 探测周期,每5s探测一次
readinessProbe:
httpGet:
path: /actuator/health/readiness
port: 8090
initialDelaySeconds: 10 # 容器启动后多久开始探测
timeoutSeconds: 2 # 表示容器必须在2s内做出相应反馈给probe,否则视为探测失败
periodSeconds: 30 # 探测周期,每30s探测一次
successThreshold: 1 # 连续探测1次成功表示成功
failureThreshold: 3 # 连续探测3次失败表示失败
<--省略无用信息-->
Just interested can learn
掌握SpringBoot-2.3的容器探针:实战篇
,Here is a more detailed description.
Small flow preheating
在线上发布场景下,很多时候刚启动的冷系统直接处理⼤量请求,可能由于系统内部资源初始化不彻底从⽽出现⼤量请求超时、阻塞、报错甚⾄导致刚发布应⽤宕机等线上发布事故出现.一般我们会使用
⼩流量预热⽅法
来解决此类问题.
⼀般情况下,刚发布微服务应⽤实例跟其他正常实例⼀样⼀起平摊线上总 QPS.⼩流量预热⽅Method through the service consumer side provider instance startup time calculating weight+Load Balancing Algorithm,Gradually increase the flow to normal⽔平,This can help just start shipment⾏进⾏预热,详细 QPS 随时间变化曲线如图所示.
开源 Dubbo
实现的⼩Principle of traffic service warm-up process:
- 服务提供端在向注册中⼼注册服务的过程中,将⾃身的预热时⻓ WarmupTime、服务启动时间 StartTime 通过元数据的形式注册到注册中⼼中;
- 服务消费端在注册中⼼订阅相关服务实例列表,调⽤过程中根据 WarmupTime、StartTime 计算个实例所分批的调⽤权重;
- 刚启动 StartTime 距离调⽤时刻差值较⼩的实例权重下,从⽽实现对刚启动应⽤Allocate less traffic to achieve 其进⾏⼩流量预热.
由于篇幅原因,就不在过多介绍,Details can be downloaded《Ali micro service governance, the white paper》
open source meituanOCTO-RPC
原理也是一样,核心代码如下:
Serivce Mesh Isito-proxy
使用Service Mesh的话,EnvoyA related preheating mechanism is also providedSlow start mode,目前是支持Round Robin and Least RequestTwo load aggregation,A set of low-flow preheating models are also provided:
EnvoyBy default doesn't open preheating function,To enable, please refer to the following configuration:
message SlowStartConfig {
// Represents the size of slow start window.
// If set, the newly created host remains in slow start mode starting from its creation time
// for the duration of slow start window.
google.protobuf.Duration slow_start_window = 1;
// This parameter controls the speed of traffic increase over the slow start window. Defaults to 1.0,
// so that endpoint would get linearly increasing amount of traffic.
// When increasing the value for this parameter, the speed of traffic ramp-up increases non-linearly.
// The value of aggression parameter should be greater than 0.0.
// By tuning the parameter, is possible to achieve polynomial or exponential shape of ramp-up curve.
//
// During slow start window, effective weight of an endpoint would be scaled with time factor and aggression:
// `new_weight = weight * time_factor ^ (1 / aggression)`,
// where `time_factor=(time_since_start_seconds / slow_start_time_seconds)`.
//
// As time progresses, more and more traffic would be sent to endpoint, which is in slow start window.
// Once host exits slow start, time_factor and aggression no longer affect its weight.
core.v3.RuntimeDouble aggression = 2;
}
总结
Lossless online seems simple,In virtual machine and container environment,Solutions vary,但思路都是一样的,The key includes three parts:
- 实例启动,包括JVM、Springetc. initialization and initialization of some microservice components such as configuration center,It is worth noting that do not register in the registry before the resource is fully loaded,You can delay registration by setting,让应⽤在充分初始化后再注册到注册中⼼对外提供服务;
- Small flow preheating,Via client load and weight algorithm,Make the traffic of the instance node just started grow linearly,Ultimately reached normal levels;
- 全量运行,需要查看JVM、Interface corresponding time indicator monitoring and logging,Make sure that the newly launched instance goes online without any problems.
参考文章
优雅启动:如何避免流量打到没有启动完成的节点?
掌握SpringBoot-2.3的容器探针:实战篇
Kubernetes Liveness and Readiness Probes
Configure Liveness, Readiness and Startup Probes
config.cluster.v3.Cluster.SlowStartConfig
envoy slow start mode
《Ali micro service governance, the white paper》
边栏推荐
猜你喜欢
随机推荐
golang源码分析(33)pollFD
深圳地铁16号线二期进入盾构施工阶段,首台盾构机顺利始发
多聚体/壳聚糖修饰白蛋白纳米球/mPEG-HSA聚乙二醇人血清白蛋白纳米球的制备与研究
golang刷leetcode 经典(3) 设计推特
ES: Promise的基本用法
TSF微服务治理实战系列(一)——治理蓝图
Remember the stuck analysis of an industrial automation control system in .NET
在线文档Sheet技术解析
POE交换机全方位解读(下)
HDF驱动框架的API(2)
千万级QPS下服务如何才能平滑启动
AI+医疗:使用神经网络进行医学影像识别分析
天翼云4.0来了!千城万池,无所不至!
二叉查找树的查找
衡量软件产品质量的 14 个指标
cache2go-源码阅读
MySQL基本语法
红队实战靶场ATT&CK(一)
魔豹联盟:佛萨奇2.0dapp系统开发模式详情
Local broadcast MSE fragments mp4 service