someGreatTechName

Saturday, April 5, 2014

Fixing a problem with s3fs, updatedb and high S3 bills

S3FS is a very convenient piece of software. It lets you easily keep all your data in "the cloud" without the possibility of ever overfilling. It is pretty good for things like back end backups copies and infrequently accessed files, but I would never advise for it to be used in conjunction with a web application that may actually serve the files from the mounted resource.

In continuous effort to get our AWS bill as low as possible, I've stumbled upon an extraordinary number of Tier2 and Tier1 request on one of our Regions.
Out S3 buckets had over 400 million Tier 1, and almost 1 million Tier2 requests per month which I thought is odd, as we paid more for accessing S3 than for storing several TB of data in it.

The problem

As our AWS infrastructure is quite modest compared to some, there is no way we could have as much traffic to generate this amount of S3 usage.

Amazon Simple Storage Service EU-Requests-Tier1
$0.005 per 1,000 PUT, COPY, POST, or LIST requests - 86,694,315 Requests

Amazon Simple Storage Service EU-Requests-Tier2

$0.004 per 10,000 GET and all other requests - 447,163,447 Requests

Looking at our traffic usage on our monitoring TV, powered by ElasticSearch and Kibana (very cool) I could quickly see that 447mln requests is more than we get in 6 months on our servers. There had to be another explanation for this, than just plain traffic.

The search

Amazon gives you a few possibilities to drill down into your usage of their resources. The easiest would be to look at usage reports, under Billing and Cost Management. You get a CSV file in which you can easily check, which actual buckets are responsible for the high usage on a per-month, per-day and per-hour basis.

Looking at the usage patterns on the buckets, it was very quickly apparent, that 95% of the usage was on 2 buckets, used by a legacy perl application, which in order to connect to S3, unfortunately had to use S3FS.

After googling around a bit I found the culprit, it was updatedb ! The indexing process had hammered s3fs with indexing a 4TB bucket every day, all day, for who knows how long...

The solution

On our MAPS cluster, I added 2 lines in the /etc/init/updatedb.conf file, one with the path to be excluded, the other one with the file system:

PRUNEPATHS="/tmp /var/spool /media /home/.ecryptfs /mnt/s3"

PRUNEFS="NFS nfs nfs4 rpc_pipefs afs binfmt_misc proc smbfs autofs iso9660 ncpfs coda devpts ftpfs devfs mfs shfs sysfs cifs lustre tmpfs usbfs udf fuse.glusterfs fuse.sshfs curlftpfs ecryptfs fusesmb devtmpfs fuse.s3fs"

The last step was baking new AMI's on the cluster and launching a new one that was to replace the faulty one. I would also suggest to check in a couple of hours the hourly based s3 usage report if it has gone down.

The other thing you absolutely have to do is to enable tag based billing and programmatic access, as it will work for the months after you have enabled it. This will help to get insight into billing in the future with tools like Netflix's Ice.

Hope you enjoyed this solution.

Cheers,
P

Friday, October 4, 2013

From zero to cloud - guide to setting up a dev process with the Amazon AWS Cloud pt. 1

Hi all,

once again I managed to get some time off to write.

There have been interesting things happening in my career as I had the opportunity to plan and execute an entire companies move from Rackspace to Amazon AWS with a very diverse app base with PHP Symfony 1, 1.4, 2.X apps, Perl (mod perl) apps and Java (Red5) apps, Magento stores, vendor specific tools like Active Collab.

It has taken me quite some time to figure out some problems as there is no real comprehensive thing out there to start easy with the Amazon AWS adventure and the Amazon guides are a bit dry.

To move you closer to the problem domain faced here, I will point out the steps that you need to take to have a complete application development cycle. Later I will hopefully manage to write a post about every step of the way, so it is much easier to understand the big picture, how things work together in an AWS based application.

I will not write about all the problems, that we had to overcome, because of the nature of the applications themselves, as the capacity for an app to be clustered is beyond the scope of this article. The most important rule however is - do not write anything on the server that is important to anyone, either upload it to Amazon S3, some attached webdav or the database (some db's have the option to save files like MongoDB's GridFS).

So the first usual step in the apps development is the actual development no surprises here.

1. Version control and simple collaboration (e.g. code reviews).

GIT and github.com to keep the code, collaborate and make it easy to pull from anywhere. I have actually pushed the company to move away from CVS, which wasn't very difficult as it is frustrating to work with it. Github has a fantastic interface and with the addition of their in-code search it is quite more practical than the dreadful search in Eclipse.

All you need is to register with github.com and follow simple instructions on getting started. We have a pro account with github, so we can keep our repositories private, for small teams you can however go to their competition - bitbucket.org, which has a pay-by-users model and lets you have free private repos.

2. Continuous Integration - making sure your code stays in good condition

Jenkins to run test and build. Well that is also one of the easy ones, there are some more sophisticated vendor solutions out there, but Jenkins with some extensions offers everything we needed. I added GIT, SVN, Google OpenID login, so that everyone from the organisation can log in and Apache Maven support.

The other cool thing Jenkins lets you do is have code statistics, like phpmd and copy/paste detection, code style checking and keeping the statistics for later. We also plan on making Jenkins directly fire up testing infrastructure in Amazon AWS via CloudFormation scripts so basically a file describing exactly your deployment and the relationships between machines. For the sake of simplicity and cost (which will come later in the post) just put Jenkins locally or on a traditional server.

3. Figuring out if everything belongs in the cloud

Now you have the means to code, collaborate and make it a really nice product so now you need to figure out which parts of your infrastructure should go into Amazon AWS.

As with everything, the cloud is only the answer to a very specific question, but certainly no to all of it. The true advantage of cloud computing isn't scaling up, as you can do that with normal infrastructures - it's scaling down in minutes. The advantage of being able to scale the number and size of instances is only relevant if you have fluctuating traffic. That means if the differences of usage during the day are quite higher than 1-2 servers of difference (in terms of computing power) than it's a classic case of lets-go-cloud.

There are some applications however, that are not really suited or would be missing the point of cloud computing. We have found that some applications simply don't belong there:

Windows based stuff - it's just a question of cost. If you don't need the scalability for windows boxes, it's far less expensive to put them with a traditional provider, like OVH
All utility apps - everything you need to have running in single-server mode, so in other words there will never be need to make 2+ of them, but also making less than 1 would not be good. Every utility app, like Jenkins for instance, may or may not need additional nodes to work with, if it doesn't buy a normal server (or 2 if you like resiliency) and put it on there, make regular backups though. In our deployment Jenkins went onto a normal server.
Everything requiring constantly needing hight computing power. Well every app that doesn't need to scale beyond one/two powerful servers and is constantly on-line is IMHO better off on a traditional server.

What you have to take into account, that the AWS instances, even though even called "large" are nowhere near a £200 per month server, when it comes to computing power and usually costs the same or more. You get however the ability to run 1 tiny instance during the nights and 20 of them during heave days traffic which at the end might save you some major money.

Our traffic is not only fluctuating weekly, daily but is also strongly seasonal, as we are a ed-tech company. During the summer and all school breaks there is just need for a minimum infrastructure.

4. Moving domains to Route53

Route53 is Amazons DNS service. It offers a very nice interface for your dns administration and it's certainly easier to configure than bind9. This step of course isn't necessary, but it is good to move your domains so that you would have "one (key)ring to rule them all". As with AWS you get API access to almost everything so it is much easier to automate if everything is in one place, plus it's a lot tidier than having to log into somewhere else (or several locations) to control your domains.
We found a small Perl script that runs in CRON, that actually checks the instance tags of each instance for a specific tag and creates domains based on that name, so you can have a bit more simple to write name than the usual public DNS AWS instance domain names (e.g. ec2-54-228-109-17.eu-west-1.compute.amazonaws.com). In order for the script to work, you need to have the Amazon CLI tools installed.

This concludes the things you need to know to start getting familiar with AWS. In the next posts I will write about:

Getting your team to have access to AWS services with IAM
Managing some aspects of your deployments with the AWS Console
Setting up and managing your workflows with Netflix's Asgard control panel
Deploying your applications to your cloud servers with Capistrano
Setting up and testing your auto scaling cloud deployments with jMeter and AWS CloudWatch

Hope you enjoyed reading, See you soon.

Thursday, September 19, 2013

Amazon EC2 Instance inaccessible because of s3fs

Hi again!

Long time since my last post but always wanted to continue this blog - as it turns out some might be actually reading it ;)

During the time I was gone, I changed jobs a couple of times, even changed countries. I am now based in London, UK - a rather exciting city. Now working a lot with Amazon AWS and Java EE.

This blog post is about problems we had setting up auto scaling on Amazon AWS EC2 cloud. We were using a GUI front-end named Asgard (the home of the gods!) by Netflix trying to set up clusters and test them.

As we are moving more and more of our infrastructure into EC2, some problems become more and more apparent, especially the ability (or in our case - inability) of applications to work in clusters. There are some things that need to be considered, before creating a cluster of you application, which are not in the scope of this post.

Long story short, we needed to make some short cuts and instead of using Amazon S3 as it should - via its HTTP API, we went the s3fs way... it is a neat helper tool but can have nasty effects like memory leaks (saw it grab 50% of the machines RAM) and wild CPU usage... as it had in our case.

I was during a jMeter performance test of our two moved clusters, one is a PHP frontend app cluster and the other is a Perl service application cluster. Both of them started of as small instance clusters however the tests revealed that our Perl apps have a rather annoying problem with going really well and then just dying when overwhelmed. So we thought to go with medium instance sizes for the Perl cluster...

The medium instances hover did not start properly, they couldn't be ssh'd and the "get system log" option revealed that the setup didn't finish. The problematic thing about this was, that even though the small instances started without issue, the medium or large (we didn't try higher ones as it didn't seam there is a point) won't start. In the AWS console the instances passed all checks (which seam like a bit of shitty tests IMHO) but still there was zero movement.

After a few hours of trying different options to connect to the instances, we actually paid for Amazon's business level support - which is amazing. It costs you about 10% of your monthly fee or $100 whichever is higher, but assuming we would spend a week on figuring this out... well worth it. After about 30 minutes with one of their support staff, we finally found the problem.

We had a s3fs share mounted straight to /mnt and with no __netdev options (which basically means, that it need the network connection in otder to be mounted). After changing the mount point to some other folder and adding the fstab option, everything went back to normal and the instances

What we assumed is that the small instance was slow enough to start s3fs after it had started networking, therefore finishing the boot sequence correctly.

Hope this helps anyone stuck with the same problem. If an instance is not booting properly - look at fstab and use s3fs as a temporary option.

Cheers,
P