Google Cloud Platform Blog
Debugging Health Checks in Load Balancing on Google Compute Engine
Wednesday, July 15, 2015
When a health check fails, how can you debug it? It's easier to understand how to debug a health check if you know what a correct load-balancing configuration looks like. In this post, I'll walk you through a correct configuration, talk a bit about how health checks work, and then discuss some typical kinds of failures and how to think about debugging health checks in general. I'll assume that you have some experience with load balancing on Compute Engine. If you're new to the subject, first try the steps in
Network Load Balancing
in the Compute Engine documentation.
Load balancing configuration
Let's look at a
Debian GNU/Linux 7.8 (wheezy)
instance running on Compute Engine. There is a package called
google-compute-daemon
that owns the
/etc/init.d/google-address-manager
startup script, as shown by running the following command:
$
dpkg-query -S /etc/init.d/google-address-manager
google-compute-daemon: /etc/init.d/google-address-manager
The address manager's job is to configure the network settings for the instance, including settings for load-balanced IP addresses. Starting with an instance that is not part of a load balancer's target pool, you can see the IP configuration by running the following commands:
$
/sbin/ifconfig
eth0 Link encap:Ethernet HWaddr 42:01:0a:f0:6c:91
inet addr:192.0.2.0 Bcast:192.0.2.0 Mask:255.255.255.255
UP BROADCAST RUNNING MULTICAST MTU:1460 Metric:1
RX packets:263618 errors:0 dropped:0 overruns:0 frame:0
TX packets:311301 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:114548264 (109.2 MiB) TX bytes:29265762 (27.9 MiB)
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
UP LOOPBACK RUNNING MTU:65536 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)
$
ip route list table local
local 192.0.2.0 dev eth0 proto kernel scope host src 192.0.2.0
broadcast 192.0.2.0 dev eth0 proto kernel scope link src 192.0.2.0
broadcast 127.0.0.0 dev lo proto kernel scope link src 127.0.0.1
local 127.0.0.0/8 dev lo proto kernel scope host src 127.0.0.1
local 127.0.0.1 dev lo proto kernel scope host src 127.0.0.1
broadcast 127.255.255.255 dev lo proto kernel scope link src 127.0.0.1
Now, let’s look at what happens when you add the instance to a network load balancing target pool. For this example, assume that the load balancer has the IP address 198.51.100.0. When you add the instance to the target pool, the address manager logs the change in
syslog
:
$
grep google-address /var/log/syslog
Feb 19 00:22:04 instance-1 google-address-manager: INFO Changing public IPs from None to ['198.51.100.0'] by adding ['198.51.100.0'] and removing None
The route list also has a new entry for the load balancer's IP address:
$
ip route list table local
local 192.0.2.0 dev eth0 proto kernel scope host src 192.0.2.0
broadcast 192.0.2.0 dev eth0 proto kernel scope link src 192.0.2.0
broadcast 127.0.0.0 dev lo proto kernel scope link src 127.0.0.1
local 127.0.0.0/8 dev lo proto kernel scope host src 127.0.0.1
local 127.0.0.1 dev lo proto kernel scope host src 127.0.0.1
broadcast 127.255.255.255 dev lo proto kernel scope link src 127.0.0.1
local 198.51.100.0 dev eth0 proto 66 scope host
When the load balancer sends a packet to the backend, the packet is forwarded, not rewritten. In other words, when an instance receives load-balanced traffic, the destination IP address of the packet matches the external address of the load balancer. This is different than traffic that's directed to the external address of the instance itself. Such traffic goes through 1:1 network address translation (NAT) and arrives with its destination IP address set to the instance's internal IP address.
Health Checks
Now that you know how traffic flows from the load balancer to the instance, you can see how the health check works. The metadata server at IP address 169.254.169.254 is responsible for sending traffic to the health check URL. The destination address of the health check is the load balancer's external address. This process mimics real incoming traffic.
The health check must be answered with an
HTTP 200
status followed by a normal TCP connection closure within the time specified by the
timeoutSec
setting. For more information about health check options, see the
documentation
.
Types of health check failures
Here are some common reasons that health checks fail.
Failure 1: Not listening on the load balancer's address
The most common cause of health check failure is to bind a service only to the instance's external IP address. Here's an example set up with the following
netcat
command:
$
sudo nc -l -p 80 -s
192.0.2.0
$
netstat -an | grep :80
tcp 0 0 192.0.2.0:80 0.0.0.0:* LISTEN
You can see that there is a service listening on port 80 but, because it's bound to the instance's address, it will never answer queries for the load balancer's external address. It's easy to fix this problem: have your server process listen on
0.0.0.0
so it responds for any address. A server configured like this responds on port 80 for the external address:
$
netstat -an | grep :80
tcp 0 0 0.0.0.0:80 0.0.0.0:* LISTEN
Failure 2: Address not configured
In an earlier version, there was a race condition in the
google-address-manager
startup script between the address manager and
syslog.
You can see
the fix on GitHub
.
When this race condition occurs, the instance will never be configured to accept traffic on the load balancer's external address because there’s no entry in the routing table. Recall that you can view the routing table by running
ip route table list local
. A similar issue can also occur if the Linux out-of-memory (OOM) killer runs and kills the
google-address-manager
daemon. If this is the case, you need to fix the condition causing the daemon not to run. In the meantime, you can start it manually as a workaround.
Failure 3: Sending an RST packet
The web server on the instance may be configured to close the health check's TCP connection with a reset (RST) packet instead of the usual TCP four-way closing handshake; some streaming media servers, for example, offer this option. In this case, running
tcpdump
will show what seems to be good traffic from the webserver, until you look at the flags. You can see the R(ST) flag in the following output:
Flags [R.], seq 59, ack 92, win 8096
If your web server offers this option, make sure it is disabled for the health check URL.
Failure 4: Taking too long to answer
If the webserver does not finish responding to the health check within the configured timeout, it will be deemed unhealthy, even if it eventually sends an
HTTP 200 response
code with a proper TCP connection closure. This is an example of the kind of failure that health checks are designed to circumvent. However, if this happens on a server that you consider healthy, you can address it by increasing the health check timeout period.
Failure 5: Not answering directly with a 200 response code
The web server may be configured to redirect to a page that returns an
HTTP 200
response code. The health check will not follow the redirect; it expects the health check page to return a 200 directly.
Debugging
Here are some things to think about when you're trying to debug a load balancer that has failing health checks.
Run
ip route list table local
to check whether the load balancer address is properly configured. If it's not, look into why the address manager is not running. If the address is configured, run a
tcpdump
on the instances in the load balancer's pool that are in an unhealthy state. For example, run the following command:
$
tcpdump host 169.254.169.254
This command prints all the packets from the metadata server, which is the server that issues the health checks. You may need to filter further, because the instance sometimes queries the metadata server for project metadata. You don't need to see those queries.
You might be tempted to try to debug health checks by browsing to the instance's external address, but this approach is insufficient because it doesn't probe for failures 1 and 2 above, and can also miss failure 3.
One last point to be aware of when you're debugging is how health checks affect the load balancer. If any instance is marked healthy, the load balancer will send traffic to it. If all the instances are marked unhealthy, they’ll be sent traffic anyway, so as not to drop traffic. (See items 3 and 4 in the
backupPool
section of the Target pools documentation.) Therefore, sometimes your load balancer looks like it's working even though all instances are marked unhealthy. This could be the case for failure modes 3 or 4; either way, you should still address the health checks so you can properly distinguish between nodes that should get traffic and ones that should not.
- Posted by Charles Bacon, Technical Solutions Engineer
No comments :
Post a Comment
Don't Miss Next '17
Use promo code NEXT1720 to save $300 off general admission
REGISTER NOW
Free Trial
GCP Blogs
Big Data & Machine Learning
Kubernetes
GCP Japan Blog
Labels
Announcements
56
Big Data & Machine Learning
91
Compute
156
Containers & Kubernetes
36
CRE
7
Customers
90
Developer Tools & Insights
80
Events
34
Infrastructure
24
Management Tools
39
Networking
18
Open Source
105
Partners
63
Pricing
24
Security & Identity
23
Solutions
16
Stackdriver
19
Storage & Databases
111
Weekly Roundups
16
Archive
2017
Feb
Jan
2016
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2015
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2014
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2013
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2012
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2011
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2010
Dec
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2009
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2008
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Feed
Subscribe by email
Technical questions? Check us out on
Stack Overflow
.
Subscribe to
our monthly newsletter
.
Google
on
Follow @googlecloud
Follow
Follow
No comments :
Post a Comment