Thursday, September 19, 2013

Amazon EC2 Instance inaccessible because of s3fs

Hi again!

Long time since my last post but always wanted to continue this blog - as it turns out some might be actually reading it ;)

During the time I was gone, I changed jobs a couple of times, even changed countries. I am now based in London, UK - a rather exciting city. Now working a lot with Amazon AWS and Java EE.

This blog post is about problems we had setting up auto scaling on Amazon AWS EC2 cloud. We were using a GUI front-end named Asgard (the home of the gods!) by Netflix trying to set up clusters and test them.

As we are moving more and more of our infrastructure into EC2, some problems become more and more apparent, especially the ability (or in our case - inability) of applications to work in clusters. There are some things that need to be considered, before creating a cluster of you application, which are not in the scope of this post.

Long story short, we needed to make some short cuts and instead of using Amazon S3 as it should - via its HTTP API, we went the s3fs way... it is a neat helper tool but can have nasty effects like memory leaks (saw it grab 50% of the machines RAM) and wild CPU usage... as it had in our case.

I was during a jMeter performance test of our two moved clusters, one is a PHP frontend app cluster and the other is a Perl service application cluster. Both of them started of as small instance clusters however the tests revealed that our Perl apps have a rather annoying problem with going really well and then just dying when overwhelmed. So we thought to go with medium instance sizes for the Perl cluster...

The medium instances hover did not start properly, they couldn't be ssh'd and the "get system log" option revealed that the setup didn't finish. The problematic thing about this was, that even though the small instances started without issue, the medium or large (we didn't try higher ones as it didn't seam there is a point) won't start. In the AWS console the instances passed all checks (which seam like a bit of shitty tests IMHO) but still there was zero movement.

After a few hours of trying different options to connect to the instances, we actually paid for Amazon's business level support - which is amazing. It costs you about 10% of your monthly fee or $100 whichever is higher, but assuming we would spend a week on figuring this out... well worth it. After about 30 minutes with one of their support staff, we finally found the problem.

We had a s3fs share mounted straight to /mnt and with no __netdev options (which basically means, that it need the network connection in otder to be mounted). After changing the mount point to some other folder and adding the fstab option, everything went back to normal and the instances

What we assumed is that the small instance was slow enough to start s3fs after it had started networking, therefore finishing the boot sequence correctly.

Hope this helps anyone stuck with the same problem. If an instance is not booting properly - look at fstab and use s3fs as a temporary option.