当前位置:网站首页>Talk about TCP time_ WAIT

Talk about TCP time_ WAIT

2022-07-26 21:34:00 Brother Xing plays with the clouds

origin

Recently, I have colleagues using ab Carry out service pressure test , To QPS After the bottleneck, it is suspected that it is the problem of starting the compressor , Come and borrow me a test machine , So I took the opportunity to analyze the possibility that a wave of starting press may become the bottleneck of pressure measurement , Besides the Internet I/O、 Besides the performance of the machine , The problem of network protocol is also considered .

Of course, the protagonist of this article is not pressure testing , Later, the analysis proved that colleagues really wanted more , The bottleneck is on the server side .

In the process of analyzing the bottleneck of starting press , about TCP TIME_WAIT A conjecture of state intrigued me . Due to the previous troubleshooting , Simply touched this state , But I don't know much about , So I decided to take the time to analyze , Take my guess apart .

TCP State transition

We all know TCP Three handshakes of , Four waves , To put it simply , But in an unstable physical network , Every action can fail , In order to ensure that data is effectively transmitted ,TCP A lot of handling of these exceptions has been added to the specific implementation of .

State analysis

First, use a picture to recall TCP State transition .

At a glance , So many states , Lines in all directions , It makes people feel a little confused . But analyze it carefully , There is still a reason to follow .

First , The whole picture can be divided into three parts , That is to say, the process of building a company in the first half , The lower left part actively closes the connection process and the lower right part passively closes the connection process .

Let's look at the parts : The process of building a company is the three handshakes we are familiar with , It's just that there's another server in this picture LISTEN state ; And active close connection and passive close connection , It's all four waves .

Check the connection status

stay Linux On , We often use  netstat  To see the status of the network connection . Of course, we can also use more efficient  ss (Socket Statistics) To replace netstat.

Both tools will list the socket The state of the connection , Through simple statistics, we can analyze this time The server Network state of .

TIME_WAIT

Definition

We can see from the picture above , When TCP When the connection is actively closed , Will pass by TIME_WAIT state . And we're on the machine curl One url Create a TCP After connection , Use ss And other tools can continuously observe the continuous in a certain period of time TIME_WAIT state .

therefore TIME_WAIT It's a state of being :TCP After four handshakes , Both sides of the connection no longer exchange messages , But the active shutdown party keeps the connection unavailable for a period of time .

that , What's the use of maintaining such a state ?

reason

As mentioned above , For complex network state ,TCP The realization of the proposed a variety of countermeasures ,TIME_WAIT State is put forward to deal with one of the abnormal conditions .

In order to understand TIME_WAIT The necessity of state , Let's start by assuming that there is no problem that such a state can cause . Temporarily A、B To refer to TCP Both ends of the connection ,A For the active close end .

  • Four waves ,A Hair FIN, B Respond to ACK,B Reissue FIN,A Respond to ACK Close the connection . And if the A Responsive ACK The bag is missing ,B Would think A Didn't receive your own shutdown request , Then it will try again to A Reissue FIN package . without TIME_WAIT state ,A Don't save this connection again , Received a non-existent connection package ,A Will respond to RST package , Lead to B End exception response . here , TIME_WAIT To ensure full duplex TCP Normal termination of connection .
  • We also know that ,TCP Under the IP Layer protocol can't guarantee the order of packet transmission . If both sides wave , A network quadruple (src/dst ip/port) Being recycled , At this time, there is a late packet in the network B receive ,A The application immediately uses the same quad to create a new connection , This late packet arrived B, Then this packet will let B Thought it was A Just sent . here , TIME_WAIT In order to ensure the normal expiration of lost packets in the network .

For two reasons ,TIME_WAIT The existence of state is very meaningful .

The determination of time

It's a matter of reason ,TIME_WAIT The duration of the state can be understood . determine TIME_WAIT The second case above is mainly considered , Ensure that all packets connected to the network expire after the connection is closed .

When it comes to expiration time , I have to bring up another concept : Maximum segment life (MSL, Maximum Segment Lifetime), It represents a TCP Segments can exist in the Internet system for the maximum time , from TCP The implementation of the , Pieces beyond this lifetime will be discarded .

TIME_WAIT The state is actively closed by A To keep , So let's think about A Come on , The maximum length of time that a packet may have received the last connection :A Just sent out the packet , Can keep MSL A long life , It's here B After end ,B End due to closed connection , Will respond to RST package , This RST The longest bag will be in MSL After a long time A, that A Just keep it up TIME_WAIT arrive 2MS It can guarantee that all the connected packets in the network will disappear .

MSL For a long time RFC Defined as 2 minute , But in different unix Implementation , The value is not certain , That we use a lot CentOS On , It is defined as 30s, We can go through  /proc/sys/net/ipv4/tcp_fin_timeout  This file view and modify this value .

ab Of ” strange ” performance

guess

From above , We know that because of TIME_WAIT The existence of , After each connection is actively closed , This connection has to be kept 2MSL(60s) Duration , A network quadruple will also be frozen 60s. The default port number that can be assigned to our machine is about 30000 individual ( It can be done by  /proc/sys/net/ipv4/ip_local_port_range The file to view ).

So if we use curl Yes The server When asked , As a client , Use a port number of the machine , All port numbers are assigned to 60s Inside , Every second should be controlled in 500 QPS, More , The system can no longer assign port numbers .

But in use ab When the pressure test is carried out , Per second 4000 Of QPS Run for a few minutes , The starting press still works normally , Use ss When viewing connection details , Find a TIME_WAIT There is no connection of states .

analysis

At first I thought it was ab Using connection multiplexing and other technologies , Have a close look at ss Found that the local port number has been changing , What's going on ?

therefore , I started a simple service on a test machine , Port number 8090, Then start the pressure on another machine , And at the same time tcpdump Grab the bag .

Results found , first FIN All the bags are made of The server Sent , namely ab Will not actively close the connection .

On the The server Take a look , Sure enough , A large number of TIME_WAIT State connection .

But because the port that the server listens to will be reused , these TIME_WAIT The state of the connection does not have a significant impact on the server , It just takes up some system resources .

Summary

Of course , High concurrency , Too many TIME_WAIT It also puts a lot of pressure on the server , After all, maintain so much socket It also consumes resources , About how to solve TIME_WAIT Too many questions , You can see  tcp Short connection TIME_WAIT The whole solution to the problem .

tcp Connection is the most basic concept in network programming , Based on different usage scenarios , We generally divide it into “ A long connection ” and “ Short connection ”, The advantages and disadvantages of long and short connection are not detailed here , The students who want to go directly to google Inquire about , This paper focuses on how to solve the problem of tcp Short connected TIME_WAIT problem .

The biggest advantage of short connection is convenience , Especially scripting languages , Because the process of script language is finished after execution , It's basically short connections . But the biggest disadvantage of short connection is that it will take up a lot of system resources , for example : Local port 、socket Handle . The reason for this problem is very simple :tcp There is no concept of long short connection in protocol layer , So whether it's a long connection or a short connection , Connection is established -> The data transfer -> The process and processing of connection closing are the same .

natural TCP After the client connection is closed , Will enter a TIME_WAIT The state of , The duration is usually 1~4 minute , For a scenario with a low number of connections ,1~4 Minutes are not long , It won't affect the system either , But if in a short time ( for example 1s Inside ) Make a lot of short connections , Then there may be such a situation : The operating system of the client socket Ports and handles are exhausted , The system can no longer initiate new connections !

for instance : Let's assume that every second we establish 1000 A short connection (Web It's very common in scenes , For example, every request goes to visit memcached), hypothesis TIME_WAIT The time is 1 minute , be 1 You need to build... In minutes 6W A short connection , because TIME_WAIT Time is 1 minute , These short connections 1 It's been... For minutes TIME_WAIT state , Will not release , and Linux The default local port range configuration is :net.ipv4.ip_local_port_range = 32768    61000 Less than 3W, Therefore, in this case, a new request cannot be established without a local port .

This problem can be solved in the following ways : 1) Can be changed to long connection , But it costs a lot , Too many long connections can cause server performance problems , and PHP Wait for script language , Need to pass through proxy Such software can achieve long connection ; 2) modify ipv4.ip_local_port_range, Increase the range of available ports , But it can only alleviate the problem , It can't solve the problem at all ; 3) Settings in client program socket Of SO_LINGER Options ; 4) The client machine opens tcp_tw_recycle and tcp_timestamps Options ; 5) The client machine opens tcp_tw_reuse and tcp_timestamps Options ; 6) Client machine settings tcp_max_tw_buckets For a very small value ;

Solving php Connect Memcached In the process of the short connection problem , We mainly verified 3)4)5)6) Several ways , The basic function verification and code verification are adopted , There is no performance stress test verification , Therefore, we need to pay attention to observe the business operation in the actual application , Packet loss found 、 Disconnection 、 Can't connect , We need to pay attention to whether these options lead to .

Although these methods can be used google Find the relevant information , But most of the information is general , And most of them are copycat , It's not of great reference value . In the process of positioning and dealing with these problems , Meet some doubts and difficulties , It also took some time to locate and solve , The following is a summary of relevant experience .

Only when we know more about the principle, can we find out the root cause faster , Network related knowledge will continue to consolidate .

原网站

版权声明
本文为[Brother Xing plays with the clouds]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/207/202207262035394746.html