Mysterious question

It was the Christmas code freeze of 2013. We had successfully upgraded our production servers to Jetty 9. Us developers were quite content about not having to maintain the app for an extended period of time. However, none of us had anticipated the mysterious errors that would eventually consume the better parts of our holidays.

The problem

It had been nearly three weeks since our last deployment. Someone on the team tried to access an admin URL and all they got to see was an error page. When we looked into the logs, it looked like a class that was used to render the page had mysteriously disappeared. Consequently, a NoClassDefFoundError was being thrown.

The next day we picked up a few more stacktraces in the logs, for seemingly random classes. This sent a chill down our spines as the admin URLs that had stopped working were critical control points of our application. Everyone feared that if the inevitable were to happen, without access to these admin pages, we would be in a lot of trouble. Besides, it was the holiday period and as we were operating with a fourth of our normal team, no one wanted to take any chances.

Desperate attempts

Since this problem only manifested after leaving the servers running for an extended period of time (our usual deployments are fortnightly), we initially suspected it to be some kind of JVM issue, even though our routine performance and soak tests hadn’t highlighted any issues.

So, we started looking closely at possible JVM settings. One particular setting that caught our attention was CMSClassUnloadingEnabled. In theory this setting should unload classes that are no longer needed.

From Oracle’s official documentation,

-XX:+CMSClassUnloadingEnabled

Enables class unloading when using the concurrent mark-sweep (CMS) garbage collector. This option is enabled by default. To disable class unloading for the CMS garbage collector, specify -XX:-CMSClassUnloadingEnabled.

Could this setting have been pushing the JVM into an over aggressive mode, deleting the classes altogether instead of just unloading them? Bear in mind that our servers were highly underutilised in terms of memory, heap space, perm gen space and CPU utilisation.

Since CMSClassUnloadingEnabled option is enabled by default, we tried turning it off. Unfortunately, it made no difference and we were still seeing the problem after about 10 days.

We even tried bumping up the heap and perm gen sizes even further with no positive outcome.

Looking in the right place

So, what could be causing the mysterious exceptions? One interesting clue is in the exact number of days it takes to first spot the exceptions after a Jetty restart — 10 days.

We run Jetty servers on CentOS. Out of desperation, I started comparing the directory layout of the exploded WAR file from just after deployment to the point in time when the exception had started to show up.

To my disbelief, I found some .class files missing from the 10+ day old snapshot. It was as if someone or something had simply deleted those class files from a running server. Leaving the server in the same state for another day or two resulted in even more files, including compiled JSP classes being deleted.

Now that was worrying and at the same time hopeful. At this stage, I must highlight the fact that Jetty extracts WAR files to a subdirectory under /tmp. For example, /tmp/jetty-0.0.0.0-8080-test.war-_test-any-8900275691885214790.dir. This is Jetty’s documented default behaviour.

By default, jetty will create this directory inside the directory named by the java.io.tmpdir System property.

On unix / Linux, /tmp is designed to live up to its name — Most programs use /tmp as a temporary storage and delete any files created in it upon exit. Sometimes the operating system even offers a helping hand in keeping /tmp clean by running housekeeping services.

tmpwatch

On CentOS this housekeeping service is called tmpwatch. Here is a brief description of tmpwatch from its manpage.

tmpwatch recursively removes files which haven’t been accessed for a given time. Normally, it’s used to clean up directories which are used for temporary holding space such as /tmp.

When changing directories, tmpwatch is very sensitive to possible race conditions and will exit with an error if one is detected. It does not follow symbolic links in the directories it’s cleaning (even if a symbolic link is given as its argument), will not switch filesystems, skips lost+found directories owned by the root user, and only removes empty directories, regular files, and symbolic links.

By default, tmpwatch dates files by their atime (access time), not their mtime (modification time). If files aren’t being removed when ls -l implies they should be, use ls -u to examine their atime to see if that explains the problem.

Also, CentOS has a script under /etc/cron.daily, also called tmpwatch:

$ cat /etc/cron.daily/tmpwatch

#! /bin/sh
flags=-umc
/usr/sbin/tmpwatch "$flags" -x /tmp/.X11-unix -x /tmp/.XIM-unix \
	-x /tmp/.font-unix -x /tmp/.ICE-unix -x /tmp/.Test-unix \
	-X '/tmp/hsperfdata_*' 10d /tmp
/usr/sbin/tmpwatch "$flags" 30d /var/tmp
for d in /var/{cache/man,catman}/{cat?,X11R6/cat?,local/cat?}; do
    if [ -d "$d" ]; then
	/usr/sbin/tmpwatch "$flags" -f 30d "$d"
    fi
done

The line /usr/sbin/tmpwatch "$flags" -x /tmp/foo -X '/tmp/bar*' 10d /tmp can be translated to English by saying:

Delete all files under /tmp which haven’t been accessed in the last 10 days except /tmp/foo and the ones matching the pattern /tmp/bar*.

In other words, tmpwatch had been deleting all files in the Jetty’s work directory that had not been accessed in the last 10 days.

This was a eureka moment for me!

Solution

So we have finally nailed down the root cause of this problem. But how can we fix it? There are at least two possible solutions for this.

1. Exclude Jetty’s directory from tmpwatch

Remember the -x and -X options of tmpwatch command from the previous section? Yes, these are exclude options and can be used to exclude specific file or file name pattern from being purged. So, we could simply add a -X '/tmp/jetty*' option and tmpwatch would leave our Jetty directory alone.

But something about this solution doesn’t seem right.

2. Make Jetty extract WAR files in a different location

How? Jetty has already made a provision for this. Again quoting from their documentation,

You can instruct Jetty to use a different parent directory by setting the context attribute org.eclipse.jetty.webapp.basetempdir to the name of the desired parent directory. The directory named by this attribute must exist and be writeable.

There are ways to set this context attribute. But an even cleaner solution exists.

Mostly for backward compatibility, from jetty-9.1.1 onwards, it will be possible to create a directory named "work" in the ${jetty.base} directory. If such a directory is found, it is assumed you want to use it as the parent directory for all of the temporary directories of the webapps in that ${jetty.base}.

So that was it. All we had to do was to create an empty directory called work under Jetty home. The new directory layout of Jetty home looks as follows.

|
|- bin
|- etc
|- lib
|- logs
|- modules
|- resources
|- webapps
|- work
|- notice.html
|- README.TXT
|- start.ini
|- start.jar
\- VERSION.txt

We restarted the servers after this change, and everything went back to normal as before.

Comments


comments powered by Disqus