Monday, April 26, 2004

Analysis of why creating socket in Java takes 3 minutes


Here's what we think is happening, thanks to Evan's suggestion to use gdb and Carol's assistance in recreating the loopback's IP address and route:
The first attempt by Java to open a socket is preceeded with an initialization of its socket code.
The socket initialization code calls java.net.PlainDatagramSocketImpl.leave, as is indicated in this stack trace from gdb:
#0  0xb75ebc32 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
#1  0xb75e3bf8 in connect () from /lib/tls/libpthread.so.0
#2  0xaa4b7c24 in Java_java_net_PlainDatagramSocketImpl_leave ()
   from /opt/scalent/jre/lib/i386/libnet.so
#3  0xaa4b8029 in Java_java_net_PlainSocketImpl_initProto ()
   from /opt/scalent/jre/lib/i386/libnet.so
#4  0xb2fa6bf2 in ?? ()
#5  0xb2fa0ddb in ?? ()
#6  0xb2f9e104 in ?? ()
#7  0xb721bb44 in JavaCalls::call_helper ()
   from /opt/scalent/jre/lib/i386/client/libjvm.so
#8  0xb72cfa6d in os::os_exception_wrapper ()
   from /opt/scalent/jre/lib/i386/client/libjvm.so
#9  0xb721bd96 in JavaCalls::call ()
   from /opt/scalent/jre/lib/i386/client/libjvm.so
#10 0xb7200f6f in instanceKlass::call_class_initializer_impl ()
   from /opt/scalent/jre/lib/i386/client/libjvm.so
#11 0xb720569c in instanceKlass::call_class_initializer ()
   from /opt/scalent/jre/lib/i386/client/libjvm.so
#12 0xb72001cb in instanceKlass::initialize_impl ()
   from /opt/scalent/jre/lib/i386/client/libjvm.so
#13 0xb72059af in instanceKlass::initialize ()
   from /opt/scalent/jre/lib/i386/client/libjvm.so
#14 0xb720d6d4 in InterpreterRuntime::_new ()
   from /opt/scalent/jre/lib/i386/client/libjvm.so
#15 0xb2fad510 in ?? ()
#16 0xb2fa0ddb in ?? ()
#17 0xb2fa0ddb in ?? ()
#18 0xb2fa0ddb in ?? ()
#19 0xb2fa0ddb in ?? ()
#20 0xb2fa0ddb in ?? ()
#21 0xb2fa0d04 in ?? ()
#22 0xb2fa0ddb in ?? ()
#23 0xb2fa10e1 in ?? ()
#24 0xb2f9e104 in ?? ()
#25 0xb721bb44 in JavaCalls::call_helper ()
   from /opt/scalent/jre/lib/i386/client/libjvm.so
#26 0xb72cfa6d in os::os_exception_wrapper ()
---Type  to continue, or q  to quit---
   from /opt/scalent/jre/lib/i386/client/libjvm.so
#27 0xb721bd96 in JavaCalls::call ()
   from /opt/scalent/jre/lib/i386/client/libjvm.so
#28 0xb721b666 in JavaCalls::call_virtual ()
   from /opt/scalent/jre/lib/i386/client/libjvm.so
#29 0xb721c1df in JavaCalls::call_virtual ()
   from /opt/scalent/jre/lib/i386/client/libjvm.so
#30 0xb7274f25 in thread_entry ()
   from /opt/scalent/jre/lib/i386/client/libjvm.so
#31 0xb7319caa in JavaThread::thread_main_inner ()
   from /opt/scalent/jre/lib/i386/client/libjvm.so
#32 0xb7315674 in JavaThread::run ()
   from /opt/scalent/jre/lib/i386/client/libjvm.so
#33 0xb72d1083 in _start () from /opt/scalent/jre/lib/i386/client/libjvm.so
#34 0xb75dedac in start_thread () from /lib/tls/libpthread.so.0
  • The call in question seems to be to make the machine leave the multicast group. See here.
  • Leaving the multicast group must involve connecting to loopback on a random port (the port it chooses changes and is always above 32768), and then shoving some random bytes through. That's my theory. Update: This is almost certainlyrtnetlink.
  • It just sits there trying to communicate with itself, and times out after ~3.5 minutes.
  • It appears that it has its problem because the loopback device is not configured with an IP address and is not in the route table.
  • After we issued the following commands, everything works just fine and the 3.5 minute delay turns into 18ms delay:
    • ip addr add 127.0.0.1/8 dev lo
    • /sbin/route add -net 127.0.0.0/8 dev lo
  • We found a most interesting thread on the Java forum that seems to mirror our problem. But the guy apparently never figured it out. Maybe I should post a solutionthere.
  • We went through the following other possible problems:
    • I always thought it was some kind of nfs file locking problem, but that's not the case at all.
    • We redirected the logging output to a local disk on the machine. That didn't help at all.
    • We saw the process was blocked reading /dev/random/dev/random must use loopback to generate random numbers. To solve this we used/dev/urandom which is not as random, but removed the block. But the connection delay persisted. This might explain why it took 50 minutes to send the first message on the SSL connection. Once we removed the block on/dev/random the 50 minute message send delay seemed to go away.
    • We wrote some code to try to connect without SSL, but I'm not convinced that ever worked. It's still in the code and can be activated with a configuration setting, and I tested that configuration setting in my client.
    • We then thought it was a delay caused by doing a reverse DNS lookup on the peer's IP address--likely so it could do certificate validation/throw nice exceptions. We saw in gdb that the stack trace was deep in some Java code that was trying to do some kind of DNS operation. /etc/resolv.conf was empty, so we added our name server to it and rebooted the machine. That didn't help, but the stack trace changed.
    • Then the stack trace was stuck inJava_java_net_PlainDatagramSocketImpl_leave; I thought that might have still been some DNS hosage, so I changed /etc/hosts to include the addresses of the peers. That didn't help and didn't change the stack trace.
    • Finally we typed /sbin/ifconfig. That showed us that lo did not have an IP address.
    • Carol told us the correct magic commands to type.

Creating a socket in Java takes 3 minutes

Sometimes we would see on some of our Linux boxes a 3 minute delay between an attempt to open a socket and a successful connection. This did not make sense... but I eventually determined that this was caused by the loopback device not having an address, or not having a route in the route table.

You can probably fix this by typing the following:
/sbin/ip addr add 127.0.0.1/8 dev lo
/sbin/route add -net 127.0.0.0/8 dev lo

Also, we had these other symptoms:
  • /dev/random would block for about 3 minutes, probably because it depends on loopback to get its results.
  • Java trying to do a reverse DNS lookup would block for about 3 minutes, probably because it was trying to get results from 0.0.0.0, because /etc/resolv.conf was empty, and 0.0.0.0 was being interpreted as 127.0.0.1... Update: see this postabout a related fix...

Here is a thread where I replied with this information; the thread also suggests other solutions, perhaps to the same or similar problems.

Now if I could just figure out how that Linux box got into that situation...