Why is my EC2 Linux instance unreachable and failing one or both of its status checks?

- March 30, 2021

My Amazon Elastic Compute Cloud (Amazon EC2) Linux instance is unreachable, and is failing one or both of its status checks. How do I troubleshoot status check failure?

Short Description

Amazon EC2 monitors the health of each EC2 instance with two status checks:

System status check: The system status check detects issues with the underlying host that your instance runs on. If the underlying host is unresponsive or unreachable due to network, hardware, or software issues, then this status check fails.

Instance status check: An instance status check failure indicates a problem with the instance due to operating system-level errors such as the following:

Failure to boot the operating system
Failure to mount volumes correctly
File system issues
Incompatible drivers
Kernel panic

Instance status checks might also fail due to severe memory pressures caused by over-utilization of instance resources.

Resolution

View the status check metrics of your instance to determine if the instance failed the system status check or the instance status check.

If the system status check failed, see My instance failed the system status check. How do I troubleshoot this?

If the instance status check failed, it might be due to operating system-level issues causing boot errors or over-utilization of the instance's resources. Check the instance's system logs for errors. The following are common errors you might see in the system logs:

Boot errors

If the system logs contain boot errors, see My EC2 Linux instance failed the instance status check due to operating system issues. How do I troubleshoot this?

Exhaustive memory or disk full errors

If the system logs contain exhaustive memory or disk full errors, the instance might have entered emergency mode because the root device is full. For instructions on how to troubleshoot this, see My EC2 Linux instance failed the instance status check due to over-utilization of its resources. How do I troubleshoot this?

Spike in CPU usage

If the system logs don't contain disk full errors, view the CPUUtilization metric for your instance. If the CPUUtilization metric is at or near 100%, the instance might not have enough compute capacity for the kernel to run.

For T2 or T3 instances, check the CPU credit metrics in the CloudWatch metrics table to determine if the CPU credits are at or near zero. If the CPU credits are at zero, the CPUUtilization metric shows a saturation plateau at the baseline performance for the instance. The baseline performance might be 20%, 40%, and so on, depending on the instance type.

CloudWatch metrics indicating CPU utilization at or near 100%, or at a saturation plateau for T2 or T3 instances, indicate that the status check failed due to over-utilization of the instance's resources. For instructions on how to troubleshoot this, see My EC2 Linux instance failed the instance status check due to over-utilization of its resources. How do I troubleshoot this?

Block device errors, software bugs, or other memory errors

Warning: Before stopping and starting your instance, be sure you understand the following:

Instance store data is lost when you stop and start an instance. If your instance is instance store-backed or has instance store volumes containing data, the data is lost when you stop the instance. For more information, see Determining the root device type of your instance.
If your instance is part of an Amazon EC2 Auto Scaling group, stopping the instance may terminate the instance. If you launched the instance with Amazon EMR, AWS CloudFormation, or AWS Elastic Beanstalk, your instance might be part of an AWS Auto Scaling group. Instance termination in this scenario depends on the instance scale-in protection settings for your Auto Scaling group. If your instance is part of an Auto Scaling group, then temporarily remove the instance from the Auto Scaling group before starting the resolution steps.
Stopping and starting the instance changes the public IP address of your instance. It's a best practice to use an Elastic IP address instead of a public IP address when routing external traffic to your instance. If you are using Route 53, you might have to update the Route 53 DNS records when the public IP changes.
If the shutdown behavior of the instance is set to Terminate, the instance is terminated when stopped. You can change the instance shutdown behavior to avoid this.

Block device errors, software bugs, or unusual system issues might cause an unusual CPU usage spike. If the CPUUtilization metric is at 100%, and the system logs contain errors related to block devices, memory issues, or other unusual system errors, reboot or stop and start the instance.