0
votes

Firstly, Appreciate your patience in reading and thinking through this problem I have mentioned here.

I had unique problem on one of my AWS EC2 instances(Ubuntu 14.04), where the instance just goes unreachable through either http or ping. It also locked me out of ssh access. I had to log in to aws console everytime, and reboot the instance manually. As a solution, I have configured cloudwatch monitoring to reboot the instance automatically and send a notification email to me, on any occasion where the system check has failed.

So far, so good.

Now, what I really want is the root cause / reason for instance going unreachable. I assuming that to be a memory issue. I have gone through the get-system-logs, which helped a bit. But, is there anyway, I can configure cloudwatch to send me the fail logs or something similar when it sends me the alert email. Or is there any way, I can alert myself with sufficient log info like - example : memory usage being 80%, network not responding etc- when I instance goes unreachable. I have heard of swap tool, but I am looking for something more generic, just not limited to memory monitoring.

Anything? Anyone has any idea?

1
CloudWatch cannot send you logfiles, but you can configure your instance to send logfiles to CloudWatch which will be stored in S3 that you can later analyze. I would start off with setting up a CloudWatch Dashboard with various monitoring metrics (memory, CPU, disk I/O etc.) to give you an idea of the system state at the time of failure. You can write custom software (python scripts) to send your own metrics to CloudWatch, such as the number of processes, free disk space, etc.John Hanley
docs.aws.amazon.com/AWSEC2/latest/UserGuide/… This is the very clean demo on how to collect the memory usage metric from your ec2 and send it to cloudwatch so you can keep them monitoring . This is the issue of memory usage , after you implement this solution , when same happen again , check the memory usage in cloudwatch of that timeAshwini

1 Answers

0
votes

I would go old skool and use a script on the server to log to a file

Presumably ( you don't mention this detail in the above ) there is a particular program running on the system that is giving you this problem

Usually system programs store their PID in a file. Let's assume the file is /var/run/nginx.pid. You can work this out for your particular system

Write a script to read the PID and record the memory use, for example add this file as "/usr/local/bin/mymemory"

PID=`cat /var/run/crond.pid`
# the 3 fields are %mem, VSZ and RSS
DATA=`ps uhp $PID| awk '{print $4, $5, $6}'`
NOW=`date --rfc-3339=sec`
echo "$NOW $DATA" >> /var/log/memory.log     

Add a line to crontab as root

* * * * * /usr/local/bin/mymemory.log

This will make an ever growing file for memory per minute. I suggest you login once a day and check it, download it if interesting and delete it. (In a real production context log rotation could be used)

Every time there is a crash the file should contain memory use data