3
votes

I'm running a wordpress page on an AWS EC2 t2.micro instance. Nothing fancy but just a simple wordpress site.

It has been happening pretty consistently that every few weeks:

  1. my page will be unreachable, ssh not reachable either
  2. when check on dashboard, everything looks right, no warning or complaints
  3. reboot it on AWS admin panel, one of the check will fail: "Instance reachability check failed at (time) "

System log shows that there's a kernel panic (all log copied below). What could cause this? bad hardware at AWS side? This really puzzles me a lot, please help. Thanks!

[2950123.794183] end_request: I/O error, dev xvda, sector 13514688
[2950123.797618] end_request: I/O error, dev xvda, sector 13514712
[2950123.798170] end_request: I/O error, dev xvda, sector 13514776
[2950123.798170] end_request: I/O error, dev xvda, sector 13514816
[2950123.798170] end_request: I/O error, dev xvda, sector 13514872
[2950123.798170] end_request: I/O error, dev xvda, sector 12894512
[2950123.798170] end_request: I/O error, dev xvda, sector 12875536
[2950123.798170] end_request: I/O error, dev xvda, sector 511456
[2950123.798170] end_request: I/O error, dev xvda, sector 13403944
[2950123.798170] end_request: I/O error, dev xvda, sector 515968
[2950124.114201] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000007
[2950124.114201] 
[2950124.118093] CPU: 0 PID: 1 Comm: init Not tainted 3.14.35-28.38.amzn1.x86_64 #1
[2950124.118093] Hardware name: Xen HVM domU, BIOS 4.2.amazon 05/06/2015
[2950124.118093]  ffff88003d578ae0 ffff88003da2bc80 ffffffff814867ca ffffffff81788cf0
[2950124.118093]  ffff88003da2bcf8 ffffffff814825ab ffffffff00000010 ffff88003da2bd08
[2950124.118093]  ffff88003da2bca8 ffffffff81c9af20 0000000000000007 ffff88003da30480
[2950124.118093] Call Trace:
[2950124.118093]  [<ffffffff814867ca>] dump_stack+0x45/0x56
[2950124.118093]  [<ffffffff814825ab>] panic+0xc8/0x1cd
[2950124.118093]  [<ffffffff8105ffd1>] do_exit+0xa41/0xa50
[2950124.118093]  [<ffffffff8106005f>] do_group_exit+0x3f/0xa0
[2950124.118093]  [<ffffffff8106f707>] get_signal_to_deliver+0x1c7/0x6e0
[2950124.118093]  [<ffffffff81014458>] do_signal+0x48/0x6f0
[2950124.118093]  [<ffffffff811e7c38>] ? fsnotify+0x228/0x2f0
[2950124.118093]  [<ffffffff81014b68>] do_notify_resume+0x68/0x90
[2950124.118093]  [<ffffffff8148d822>] retint_signal+0x48/0x86
1
what happens if you move to other t2.micros?tedder42
Not sure if that makes a difference - I happen to have landed in a hardware that's "lemon"? :)Kitetaka
it would rule that out, yes.tedder42
Instead of reboot, if I stop the instance, then start it again, it'll work (for another few weeks before the hiccup again), plus this way will change the instance's ip, and it's a hassle to go to route 53 and redirect the dns to the new ip address / endpointKitetaka

1 Answers

3
votes

You should be designing your solution to fail. Failure is inevitable, but aws provides all the services to deal with the problem.

Setup your ec2 instance in an autoscaling group, and create/setup a health check that AWS can use to determine if you instance is running OK or not.

If you set it up correctly, when AWS sees that you instance is failing/failed, it will replace your instance automatically with another.

This will require work on your part to architect things correctly, but then you are no longer going to have to worry about watching/checking ou instance and spinning up a new one when something goes wrong.

http://docs.aws.amazon.com/AutoScaling/latest/DeveloperGuide/WhatIsAutoScaling.html

Don't treat your EC2 instance like a normal hosting provider package, i.e. one you buy and put your solution it and just expect it to run as is forever. If thats your plan, then you are better of going with a regular hosting provider - and they will take care of making sure you website runs forever by managing the underlying hardware/software for you.

If you are going to be on AWS, take advantage of their platform.

As to your specific problem, I would tend to think of a memory leak - the symptons sound right, you start fresh it runs for days/weeks at a time, and then crashes.