当前位置:网站首页>A series of problems caused by IPVS connection reuse in kubernetes

A series of problems caused by IPVS connection reuse in kubernetes

2022-06-24 15:14:00 imroc

This article excerpts from kubernetes Learning notes

background

stay Kubernetes There is a long discussed in the community bug (#81775), The problem is when client Yes service Initiate a large number of new projects TCP When the connection , The new connection is forwarded to Terminating Or have been completely destroyed Pod On , Cause continuous packet loss ( Report errors no route to host), The root cause is the kernel ipvs Connection multiplexing trigger , This article will break down in detail .

conn_reuse_mode brief introduction

Before introducing the reason , Let's introduce conn_reuse_mode This kernel parameter , It is the following two patch Introduced :

  1. year 2015 d752c364571743d696c2a54a449ce77550c35ac5
  2. year 2016 f719e3754ee2f7275437e61a6afd520181fdd43b

Its purpose is to :

  1. When client ip:client port When multiplexing occurs , about TIME_WAIT In state ip_vs_conn, Reschedule , bring connection stay rs The distribution is more balanced , To improve performance .
  2. If it's time to mode yes 0, Will reuse the old ip_vs_conn Inside rs, Make the connection more unbalanced .

So when conn_reuse_mode by 0 Means to enable ipvs Connection reuse , by 1 Means not to reuse , Is it a bit counter intuitive ? This is really controversial .

conn_resue_mode=1 Of bug

Turn on this kernel parameter (conn_reuse_mode=1) The purpose is to improve the performance of the new , The actual result is a significant reduction in performance , Found in actual test cps from 3w Down to 1.5K, This also shows some of the core community patch No strict performance test .

Turning on this kernel parameter actually means ipvs Connection multiplexing is not performed during forwarding , Each new connection is rescheduled rs And new ip_vs_conn, But there is a problem with its implementation : When creating a new connection (SYN package ), If client ip:client port It's a match ipvs Old connection (TIME_WIAT state ), And used conntrack, You lose the first one SYN package , Wait for retransmission (1s) To successfully build a company , This leads to a sharp decline in the performance of building and connecting .

Kubernetes The community also found this bug, So when kube-proxy Use ipvs In forwarding mode , By default conn_reuse_mode Set as 0 To avoid this problem , See PR #71114 And issue #70747 .

conn_resue_mode=0 Raised questions

because Kubernetes In order to avoid conn_resue_mode=1 Performance issues , stay ipvs In mode , Give Way kube-proxy On startup, the conn_resue_mode Set for 0 , That is to use ipvs The ability to reuse connections , but ipvs There are two problems with connection reuse :

  1. As long as there is client ip:client port On the match ip_vs_conn ( Reuse occurs ), Directly forward to the corresponding rs, No matter rs What is the current state , Even if rs Of weight by 0 ( Usually TIME_WAIT state ) Will also forward ,TIME_WAIT Of rs Usually Terminating Status destroyed Pod, If you forward the past, the connection must be abnormal .
  2. High concurrency delivers a large number of multiplexes , No scheduling for new connections rs, Directly forward to the corresponding... Of the multiplexed connection rs On , As a result, many new connections are " curing " To part rs On .

There may be many phenomena encountered in business :

  1. Rolling update connection exception . When the accessed service is scrolled ,Pod Some are newly built and some are destroyed ,ipvs When connection multiplexing occurs, it is forwarded to the destroyed Pod Cause abnormal connection (no route to host).
  2. Rolling update load is uneven . Because the connection will not be rescheduled during reuse , As a result, the new connection is also " curing " In some Pod Yes .
  3. Newly expanded Pod Receive less traffic . It is also because the connection will not be rescheduled during reuse , As a result, many new connections are " curing " Before capacity expansion Pod Yes .

Avoid scheme

We know the cause of the problem , So in ipvs How to avoid it in forwarding mode ? Let's consider from the north-south direction and the east-west direction respectively .

North South flow

  1. Use LB Straight through Pod. For the north-south flow , Usually rely on NodePort To expose , The previous load balancer turns the traffic to NodePort On , And then through ipvs Forward to the back end Pod. Now many cloud manufacturers support LB Straight through Pod, In this mode, the load balancer forwards the request directly to Pod, Not pass NodePort, There is no ipvs forward , Thus, this problem can be avoided in the traffic access layer .
  2. Use ingress forward . Deploy in the cluster ingress controller ( such as nginx ingress), The flow reaches ingress When you turn back ( Forward to... In the cluster Pod), Will not pass service forward , Instead, it is forwarded directly to service Corresponding Pod IP:Port, It bypasses ipvs.Ingress controller Use the above mentioned in combination with itself LB Straight through Pod Mode deployment , The effect is better. .

East West flow

Inter service calls within the cluster ( East West flow ), By default, I will still go ipvs forward . For businesses with this high concurrency scenario , We can consider using Serivce Mesh ( Such as istio) To manage traffic , Inter service forwarding is performed by sidecar agent , And will not go through ipvs.

Ultimate solution : Kernel repair

conn_resue_mode=1 Cause performance degradation urgently bug, Currently, Tencent cloud provides TencentOS-kernel The open source kernel has been fixed , Corresponding PR #17, TKE The solution on is to use this kernel patch, Dependency disable ipvs Connection reuse (conn_resue_mode=1), In this way, the problem of ipvs A series of problems caused by connection reuse , And it has been verified by mass production .

However, the above fixes are not directly incorporated into linux Community , There are currently two related patch Merge into linux Kernel backbone ( since v5.9), Solve separately conn_resue_mode by 0 and 1 Above when bug, One of them also draws on the idea of Tencent cloud repair , See k8s issue #93297 .

If you use v5.9 The kernel above , Theoretically, there is no problem described in this article . since v5.9 The above kernel has fixed the above bug, that kube-proxy There is no need to explicitly set conn_resue_mode This kernel parameter , This is also PR #102122 does . But here's the thing , Community patch At present, there is no large-scale production verification , Trial use is risky .

原网站

版权声明
本文为[imroc]所创,转载请带上原文链接,感谢
https://yzsam.com/2021/05/20210519220400427o.html