Netflix functions a highly effective cnoisy computing infraset up that helps a expansive array of applications essential for our SVOD (Subscription Video on Demand), inhabit streaming and gaming services. Utilizing Amazon AWS, our infraset up is arrangeed apass multiple geoexplicit regions worldexpansive. This global distribution apshows our applications to deinhabitr satisfyed more effectively by serving traffic shutr to our customers. Like any allotd system, our applications occasionassociate need data synchronization between regions to upgrasp seamless service deinhabitry.
The adhereing diagram shows a simplified cnoisy netlabor topology for pass-region traffic.
Our Cnoisy Netlabor Engineering on-call team getd a ask to includeress a netlabor rehire impacting an application with pass-region traffic. Initiassociate, it ecombineed that the application was experiencing timeouts, probable due to subselectimal netlabor carry outance. As we all comprehend, the lengthyer the netlabor path, the more devices the packets traverse, increasing the appreciatelihood of rehires. For this incident, the client application is findd in an inside subnet in the US region while the server application is findd in an outside subnet in a European region. Therefore, it is organic to accparticipate the netlabor since packets need to travel lengthy distances thraw the internet.
As netlabor engineers, our initial reaction when the netlabor is accparticipated is typicassociate, “No, it can’t be the netlabor,” and our task is to show it. Given that there were no recent alters to the netlabor infraset up and no telled AWS rehires impacting other applications, the on-call engineer doubted a noisy neighbor rehire and sought aidance from the Host Netlabor Engineering team.
In this context, a noisy neighbor rehire occurs when a includeer allots a arrange with other netlabor-intensive includeers. These noisy neighbors use excessive netlabor resources, causing other includeers on the same arrange to suffer from degraded netlabor carry outance. Despite each includeer having bandwidth restrictations, oversubscription can still direct to such rehires.
Upon spendigating other includeers on the same arrange — most of which were part of the same application — we speedyly deleted the possibility of noisy neighbors. The netlabor thrawput for both the problematic includeer and all others was transport inantly below the set bandwidth restricts. We tryed to resettle the rehire by removing these bandwidth restricts, apshowing the application to participate as much bandwidth as essential. However, the problem persisted.
We watchd some TCP packets in the netlabor taged with the RST flag, a flag indicating that a combineion should be instantly endd. Although the frequency of these packets was not alarmingly high, the presence of any RST packets still liftd suspicion on the netlabor. To choose whether this was indeed a netlabor-transport aboutd rehire, we carry outed a tcpdump on the client. In the packet seize file, we spotted one TCP stream that was shutd after exactly 30 seconds.
SYN at 18:47:06
After the 3-way handshake (SYN,SYN-ACK,ACK), the traffic commenceed flotriumphg normassociate. Noleang strange until FIN at 18:47:36 (30 seconds tardyr)
The packet seize results evidently proposed that it was the client application that begind the combineion termination by sfinishing a FIN packet. Follotriumphg this, the server persistd to sfinish data; however, since the client had already choosed to shut the combineion, it reacted with RST packets to all subsequent data from the server.
To determine that the client wasn’t closing the combineion due to packet loss, we also carry outed a packet seize on the server side to validate that all packets sent by the server were getd. This task was complicated by the fact that the packets passed thraw a NAT gateway (NGW), which unbenevolentt that on the server side, the client’s IP and port ecombineed as those of the NGW, contrasting from those seen on the client side. Consequently, to accurately suit TCP streams, we needed to accomprehendledge the TCP stream on the client side, find the raw TCP sequence number, and then participate this number as a filter on the server side to find the correacting TCP stream.
With packet seize results from both the client and server sides, we validateed that all packets sent by the server were rightly getd before the client sent a FIN.
Now, from the netlabor point of see, the story is evident. The client begind the combineion asking data from the server. The server kept sfinishing data to the client with no problem. However, at a certain point, despite the server still having data to sfinish, the client chose to end the reception of data. This led us to doubt that the rehire might be rhappy to the client application itself.
In order to brimmingy comprehend the problem, we now need to comprehend how the application labors. As shown in the diagram below, the application runs in the us-east-1 region. It reads data from pass-region servers and authors the data to users wilean the same region. The client runs as includeers, whereas the servers are EC2 instances.
Notably, the pass-region read was problematic while the author path was dainty. Most cruciassociate, there is a 30-second application-level timeout for reading the data. The application (client) errors out if it flunks to read an initial batch of data from the servers wilean 30 seconds. When we incrrelieved this timeout to 60 seconds, everyleang labored as foreseeed. This elucidates why the client begind a FIN — becaparticipate it lost patience postponeing for the server to transfer data.
Could it be that the server was modernized to sfinish data more sluggishly? Could it be that the client application was modernized to get data more sluggishly? Could it be that the data volume became too big to be finishly sent out wilean 30 seconds? Sadly, we getd adverse answers for all 3 asks from the application owner. The server had been operating without alters for over a year, there were no transport inant modernizes in the tardyst rollout of the client, and the data volume had remained reliable.
If both the netlabor and the application weren’t alterd recently, then what alterd? In fact, we finded that the rehire coincided with a recent Linux kernel upgrade from version 6.5.13 to 6.6.10. To test this hypothesis, we rolled back the kernel upgrade and it did repair standard operation to the application.
Honestly speaking, at that time I didn’t count on it was a kernel bug becaparticipate I supposed the TCP carry outation in the kernel should be firm and constant (Spoiler attentive: How wrong was I!). But we were also out of ideas from other angles.
There were about 14k promises between the excellent and horrible kernel versions. Engineers on the team methodicassociate and diligently bisected between the two versions. When the bisecting was leaned to a couple of promises, a alter with “tcp” in its promise message caught our attention. The final bisecting validateed that this promise was our culprit.
Interestingly, while appraiseing the email history rhappy to this promise, we set up that another participater had telled a Python test flunkure adhereing the same kernel upgrade. Although their solution was not honestly applicable to our situation, it proposeed that a straightforwardr test might also reoriginate our problem. Using strack, we watchd that the application configured the adhereing socket selections when communicating with the server:
[pid 1699] setsockselect(917, SOL_IPV6, IPV6_V6ONLY, [0], 4) = 0
[pid 1699] setsockselect(917, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0
[pid 1699] setsockselect(917, SOL_SOCKET, SO_SNDBUF, [131072], 4) = 0
[pid 1699] setsockselect(917, SOL_SOCKET, SO_RCVBUF, [65536], 4) = 0
[pid 1699] setsockselect(917, SOL_TCP, TCP_NODELAY, [1], 4) = 0
We then growed a minimal client-server C application that transfers a file from the server to the client, with the client configuring the same set of socket selections. During testing, we participated a 10M file, which recurrents the volume of data typicassociate transferred wilean 30 seconds before the client rehires a FIN. On the elderly kernel, this pass-region transfer finishd in 22 seconds, whereas on the novel kernel, it took 39 seconds to finish.
With the help of the minimal reproduction setup, we were ultimately able to pinpoint the root caparticipate of the problem. In order to comprehend the root caparticipate, it’s essential to have a comprehend of the TCP get triumphdow.
TCP Receive Window
Spropose put, the TCP get triumphdow is how the getr tells the sfinisher “This is how many bytes you can sfinish me without me ACKing any of them”. Assuming the sfinisher is the server and the getr is the client, then we have:
The Window Size
Now that we comprehend the TCP get triumphdow size could impact the thrawput, the ask is, how is the triumphdow size calcutardyd? As an application authorr, you can’t choose the triumphdow size, however, you can choose how much memory you want to participate for buffering getd data. This is configured using SO_RCVBUF socket selection we saw in the strack result above. However, remark that the appreciate of this selection unbenevolents how much application data can be queued in the get buffer. In man 7 socket, there is
SO_RCVBUF
Sets or gets the highest socket get buffer in bytes.
The kernel doubles this appreciate (to apshow space for
bookgrasping overhead) when it is set using setsockselect(2),
and this doubled appreciate is returned by getsockselect(2). The
default appreciate is set by the
/proc/sys/net/core/rmem_default file, and the highest
apshowed appreciate is set by the /proc/sys/net/core/rmem_max
file. The smallest (doubled) appreciate for this selection is 256.
This unbenevolents, when the participater gives a appreciate X, then the kernel stores 2X in the variable sk->sk_rcvbuf. In other words, the kernel supposes that the bookgrasping overhead is as much as the actual data (i.e. 50% of the sk_rcvbuf).
sysctl_tcp_adv_triumph_scale
However, the assumption above may not be real becaparticipate the actual overhead reassociate depfinishs on a lot of factors such as Maximum Transleave oution Unit (MTU). Therefore, the kernel supplyd this sysctl_tcp_adv_triumph_scale which you can participate to tell the kernel what the actual overhead is. (I count on 99% of people also don’t comprehend how to set this parameter rightly and I’m definitely one of them. You’re the kernel, if you don’t comprehend the overhead, how can you foresee me to comprehend?).
According to the sysctl doc,
tcp_adv_triumph_scale — INTEGER
Obsolete since linux-6.6 Count buffering overhead as bytes/2^tcp_adv_triumph_scale (if tcp_adv_triumph_scale > 0) or bytes-bytes/2^(-tcp_adv_triumph_scale), if it is <= 0.
Possible appreciates are [-31, 31], inclusive.
Default: 1
For 99% of people, we’re equitable using the default appreciate 1, which in turn unbenevolents the overhead is calcutardyd by rcvbuf/2^tcp_adv_triumph_scale = 1/2 * rcvbuf. This suites the assumption when setting the SO_RCVBUF appreciate.
Let’s recap. Assume you set SO_RCVBUF to 65536, which is the appreciate set by the application as shown in the setsockselect syscall. Then we have:
- SO_RCVBUF = 65536
- rcvbuf = 2 * 65536 = 131072
- overhead = rcvbuf / 2 = 131072 / 2 = 65536
- get triumphdow size = rcvbuf — overhead = 131072–65536 = 65536
(Note, this calculation is simplified. The genuine calculation is more complicated.)
In illogicalinutive, the get triumphdow size before the kernel upgrade was 65536. With this triumphdow size, the application was able to transfer 10M data wilean 30 seconds.
The Change
This promise obsoleted sysctl_tcp_adv_triumph_scale and presentd a scaling_ratio that can more accurately calcutardy the overhead or triumphdow size, which is the right leang to do. With the alter, the triumphdow size is now rcvbuf * scaling_ratio.
So how is scaling_ratio calcutardyd? It is calcutardyd using skb->len/skb->realsize where skb->len is the length of the tcp data length in an skb and realsize is the total size of the skb. This is certainly a more accurate ratio based on genuine data rather than a difficultcoded 50%. Now, here is the next ask: during the TCP handshake before any data is transferred, how do we choose the initial scaling_ratio? The answer is, a magic and conservative ratio was chosen with the appreciate being rawly 0.25.
Now we have:
- SO_RCVBUF = 65536
- rcvbuf = 2 * 65536 = 131072
- get triumphdow size = rcvbuf * 0.25 = 131072 * 0.25 = 32768
In illogicalinutive, the get triumphdow size halved after the kernel upgrade. Hence the thrawput was cut in half, causing the data transfer time to double.
Naturassociate, you may ask, I comprehend that the initial triumphdow size is minuscule, but why doesn’t the triumphdow grow when we have a more accurate ratio of the payload tardyr (i.e. skb->len/skb->realsize)? With some debugging, we eventuassociate set up out that the scaling_ratio does get modernized to a more accurate skb->len/skb->realsize, which in our case is around 0.66. However, another variable, triumphdow_clamp, is not modernized accordingly. triumphdow_clamp is the highest get triumphdow apshowed to be backd, which is also initialized to 0.25 * rcvbuf using the initial scaling_ratio. As a result, the get triumphdow size is capped at this appreciate and can’t grow bigger.
In theory, the mend is to modernize triumphdow_clamp alengthy with scaling_ratio. However, in order to have a straightforward mend that doesn’t present other unforeseeed behaviors, our final mend was to incrrelieve the initial scaling_ratio from 25% to 50%. This will originate the get triumphdow size backward compatible with the distinct default sysctl_tcp_adv_triumph_scale.
Meanwhile, accomprehendledge that the problem is not only caparticipated by the alterd kernel behavior but also by the fact that the application sets SO_RCVBUF and has a 30-second application-level timeout. In fact, the application is Kafka Connect and both settings are the default configurations (get.buffer.bytes=64k and ask.timeout.ms=30s). We also originated a kafka ticket to alter get.buffer.bytes to -1 to apshow Linux to auto tune the get triumphdow.
This was a very engaging debugging exercise that covered many layers of Netflix’s stack and infraset up. While it technicassociate wasn’t the “netlabor” to accparticipate, this time it turned out the culprit was the gentleware components that originate up the netlabor (i.e. the TCP carry outation in the kernel).
If tackling such technical disputes excites you, think about combineing our Cnoisy Infraset up Engineering teams. Explore opportunities by visiting Netflix Jobs and searching for Cnoisy Engineering positions.
Special thanks to our stunning colleagues Alok Tiagi, Artem Tkachuk, Ethan Adams, Jorge Rodriguez, Nick Mahilani, Tycho Andersen and Vinay Rayini for spendigating and mitigating this rehire. We would also appreciate to thank Linux kernel netlabor expert Eric Dumazet for appraiseing and utilizeing the patch.