Azure Load Balancer health probes and the four way handshake
It's always the fun little things that cause you pain.
We've got an Azure Load Balancer running over a RabbitMQ cluster with a health probe set to check port 5672 every 60 seconds.
The RabbitMQ logs were filling up "handshake_timeout" errors every 60 seconds. Very odd.
Time for a packet capture where we find the following
1. Load balancer SYN
2. RabbitMQ ACK
3. Load Balancer ACK
4. 10 seconds later RabbitMQ RST
5. Another 50 seconds later Load Balancer FIN
Eh?
Azure load balancer documentation declares that it does a four way handshake to terminate a probe. What it fails to tell you is that the FIN isn't sent until the start of the next probe.
This leads to RabbitMQ sitting there waiting for data, not getting any in its default 10 second handshake period, terminating the connection and logging it as an error.
So the horrible workaround/compromise was to set the handshake_timeout config in RabbitMQ to 30 seconds and the load balancer interval to 25 seconds.
Why on earth the load balancer would leave the probe connections hanging like that is beyond me.
Update: MS have said they've no plans to fix this and suggested the workaround I'm already doing or moving to ServiceBus. Thanks for that.
We've got an Azure Load Balancer running over a RabbitMQ cluster with a health probe set to check port 5672 every 60 seconds.
The RabbitMQ logs were filling up "handshake_timeout" errors every 60 seconds. Very odd.
Time for a packet capture where we find the following
1. Load balancer SYN
2. RabbitMQ ACK
3. Load Balancer ACK
4. 10 seconds later RabbitMQ RST
5. Another 50 seconds later Load Balancer FIN
Eh?
Azure load balancer documentation declares that it does a four way handshake to terminate a probe. What it fails to tell you is that the FIN isn't sent until the start of the next probe.
This leads to RabbitMQ sitting there waiting for data, not getting any in its default 10 second handshake period, terminating the connection and logging it as an error.
So the horrible workaround/compromise was to set the handshake_timeout config in RabbitMQ to 30 seconds and the load balancer interval to 25 seconds.
Why on earth the load balancer would leave the probe connections hanging like that is beyond me.
Update: MS have said they've no plans to fix this and suggested the workaround I'm already doing or moving to ServiceBus. Thanks for that.
Thanks for this. I can verify that they stuck to their word, and haven't fixed it.
ReplyDelete