Too many open files

We had just upgraded our application to the latest version of Grails. Once deployed, our Jetty servers started mysteriously crashing just after a few hours of routine testing, complaining about “Too many open files”.

java.io.IOException: Too many open files
    at java.io.File.createNewFile(File.java:1006)
    at org.grails.plugin.resource.AggregatedResourceMeta.initFile(AggregatedResourceMeta.groovy:61)
    at org.grails.plugin.resource.CSSBundleResourceMeta.beginPrepare(CSSBundleResourceMeta.groovy:19)
    at org.grails.plugin.resource.ResourceProcessor.prepareResource(ResourceProcessor.groovy:545)

Follow along to understand how an ordinary combination of factors could have led to a potentially catastrophic failure.

So what, can’t we just increase the limit?

After some preliminary Google searches we found out that this problem could be alleviated temporarily by increasing the limit of the number of open file descriptors on our Linux servers. For CentOS and Ubuntu we could do this simply by adding the below lines in /etc/security/limits.conf file.

*    soft    nofile    1500
*    hard    nofile    1500

The default value of this limit is set to be 1024 which is a rather low number for modern day operating systems and hardware. IBM recommends a value of at least 8000 for running WebSphere.

We can then verify that the change has taken effect by executing (in a new terminal) the following command.

ulimit -n

Now this is fine, but what if there is a leak in the application that would render setting any limit useless?

Diagnostic logging

In order to help us identify the source of the leak, we used a diagnostic tool called File Leak Detector. This tool, when configured and installed correctly, dumps the list of all open file descriptors whenever the stated exception occurs.

To use the leak detector, we just have to modify the command that starts the Java process. For example,

java -javaagent:/path/to/file-leak-detector-1.8-jar-with-dependencies.jar=trace=/path/to/file-leak-trace.log,error=/path/to/file-leak-error.log,strong,dumpatshutdown

More options and explanations can be found on the tools’s website.

Anyway, after configuring the Java agent, restart the JVM (Jetty server in our case). Now the application is ready to be pounded. Once we start seeing the exception in the logs, it is probably a good idea to kill the server before further processing the log file.

One thing to keep in mind is that the tool dumps all the open file handles each time the JVM tries to open a file past the limit.

Here is a sample log. Notice that the line “Too many open files” appears twice, implying that this sample contains two separate dumps.

Too many open files
974 descriptors are open
#1 /usr/java/jdk1.7.0_45/jre/lib/resources.jar by thread:main on Tue Dec 30 11:38:53 GMT 2014
    at java.util.zip.ZipFile.<init>(ZipFile.java:146)
    at java.util.jar.JarFile.<init>(JarFile.java:153)
#2 /usr/java/jdk1.7.0_45/jre/lib/resources.jar by thread:main on Tue Dec 30 11:38:53 GMT 2014
    at java.util.zip.ZipFile.<init>(ZipFile.java:146)
    at java.util.jar.JarFile.<init>(JarFile.java:153)
#3 /opt/jetty/start.jar by thread:main on Tue Dec 30 11:38:53 GMT 2014
    at java.util.zip.ZipFile.<init>(ZipFile.java:146)
    at java.util.jar.JarFile.<init>(JarFile.java:153)
#4 /opt/jetty/lib/jetty-util-9.0.2.v20130417.jar by thread:main on Tue Dec 30 11:38:53 GMT 2014
Too many open files
820 descriptors are open
#1 /usr/java/jdk1.7.0_45/jre/lib/resources.jar by thread:main on Tue Dec 30 11:38:53 GMT 2014
    at java.util.zip.ZipFile.<init>(ZipFile.java:146)
    at java.util.jar.JarFile.<init>(JarFile.java:153)
#2 /usr/java/jdk1.7.0_45/jre/lib/resources.jar by thread:main on Tue Dec 30 11:38:53 GMT 2014
    at java.util.zip.ZipFile.<init>(ZipFile.java:146)
    at java.util.jar.JarFile.<init>(JarFile.java:153)

Along with files, the log can also contain open sockets.

#844 socket channel by thread:qtp1136976529-226 on Tue Dec 30 11:45:18 GMT 2014
    at sun.nio.ch.SocketChannelImpl.<init>(SocketChannelImpl.java:107)
    at sun.nio.ch.SelectorProviderImpl.openSocketChannel(SelectorProviderImpl.java:60)

Analysing the logs

Let’s start by extracting the line range which contains the first dump.

grep -n 'Too many' file-leak-error.log

Which prints

1:Too many open files
103544:Too many open files
207087:Too many open files
310630:Too many open files
414173:Too many open files
547724:Too many open files
681275:Too many open files
814826:Too many open files
948377:Too many open files

So, lines 1 to 103544 should contain our first dump which is all we should care about.

Next, let’s get a list of all file names in this range, group them by their count and get the top 10. In other words,

head -103544 file-leak-error.log | grep "#" | awk '{print $2}' | sort | uniq -c | sort -nr | head -10

Which prints

106 socket
42 /opt/jetty/lib/servlet-api-3.0.jar
42 /opt/jetty/lib/jetty-xml-9.0.2.v20130417.jar
37 /opt/jetty/lib/jetty-http-9.0.2.v20130417.jar
30 /opt/jetty/lib/jetty-continuation-9.0.2.v20130417.jar
27 /opt/jetty/lib/jetty-server-9.0.2.v20130417.jar
23 /opt/jetty/lib/jetty-security-9.0.2.v20130417.jar
20 /opt/jetty/lib/jetty-servlet-9.0.2.v20130417.jar
17 /opt/jetty/lib/jetty-webapp-9.0.2.v20130417.jar
14 /opt/jetty/lib/jetty-deploy-9.0.2.v20130417.jar

The first line is a bit worrying because it means there are 106 open socket connections, which seems quite a large number relative to the number of open files. It looks like our prime candidate for a leak. In fact, later on I found out from the full stacktrace that all the 106 open sockets were coming from the same piece of code.

Cross verification

We can easily view the open sockets by using netstat (just remember to run the command before killing the process).

sudo netstat -p | grep CLOSE_WAIT

The output would look something like this.

tcp    1  0 localhost:59132    localhost:http-alt   CLOSE_WAIT  28079/java
tcp    1  0 localhost:59336    localhost:http-alt   CLOSE_WAIT  28079/java
tcp    1  0 localhost:59115    localhost:http-alt   CLOSE_WAIT  28079/java
tcp    1  0 localhost:59411    localhost:http-alt   CLOSE_WAIT  28079/java
tcp    1  0 localhost:59299    localhost:http-alt   CLOSE_WAIT  28079/java
tcp    1  0 localhost:59265    localhost:http-alt   CLOSE_WAIT  28079/java
tcp    1  0 localhost:59197    localhost:http-alt   CLOSE_WAIT  28079/java

Unsurprisingly, the number of sockets in CLOSE_WAIT state closely matched the total number of open sockets as reported in the error log.

Yet another foolproof way to view all the files opened by a process is by executing sudo ls -l /proc/<pid>/fd, which is a great alternative, in case lsof isn’t available on your machine.

Tip: I would typically use a one-liner such as the below one to monitor the number of open files.

watch "sudo ls -l /proc/$(ps -ef | grep java | grep -v grep | awk '{print $2}')/fd | wc -l"

Diagnosis

So what’s causing so many unclosed socket connections? When I took another look at the stacktrace in the file-leak-error.log, I noticed that each and every one of the socket related stacktraces had a reference to net.spy.memcached.MemcachedClient. As it turned out, a Grails controller method was creating multiple instances of MemcachedClient and not letting go of the references.

Also, since controllers in Grails are prototype scoped by default, a new instance of the controller would be created for each request. This combination meant that for each invocation of the faulty method, the number of MemcachedClient instances increased several times.

Just invoking the constructor of MemcachedClient class is enough to open a new socket connection. This is by design of the library.

But why are the connections in CLOSE_WAIT state? Remember I mentioned right at the beginning of this article that this was discovered during routine testing? Yes, that’s correct, MemcachedClient was connecting to a fake endpoint such as localhost:8080 in order to simulate a broken connection. Also since there was already a Jetty server running on port 8080, there were no connection refused / socket timeout error being thrown and instead it was throwing the connections into a CLOSE_WAIT state.

All this makes perfect sense now.

More proof

We know where the problem lies. Now, let’s go a step further and put the final nail in the coffin. In order to do this, we shall fire up Java VisualVM using the command jvisualvm. (This ships along with JDK).

Then, simply connect to the Java process, navigate to the Sampler tab and click on the Memory button. I also added .*MemcachedClient to the filter box to eliminate any unwanted noise.

Before GC

Observe the unusually high number of MemcachedClient instances. Carefully observe what happens when I click Perform GC.

After GC

The number of instances immediately drops down to 1. This means there isn’t any memory leak involved. It is just that since we are connecting to a fake endpoint, the Java process is not handling the CLOSE_WAIT state gracefully and leaves sockets open. Since sockets are also files on Unix / Linux, it gets counted towards the total number of open files.

Solution

It can’t get simpler than this. We just decided to use a static reference to our broken Memcached connection instead of trying to create it a gazillion times. This ensured that the number of MemcachedClient instances and therefore the socket connections always stayed at 1.

Comments


comments powered by Disqus