The goal of monitoring is to be alerted when an issue arises so that they can be fixed quickly and without any interruption to services, prepare to scale nodes before they are oversaturated, tune clusters for best performance, keep an eye on traffic spikes, receive notifications on misconfigured log or temp files rotation, be aware of hardware degradation and failures, and anything else that the "chaos monkey" may throw at a server.

Server-level monitoring

You need to ensure that your server levels are well optimized and are operating within pre-defined parameters. Some metrics you can use to monitor are by looking at CPU, RAM, disk, I/O, and network usage.

App-level monitoring

In addition to server monitoring, you need to keep an eye on your app services and ensure that they are operating within pre-defined parameters. Some metrics you can use to monitor apps are RPM, response times, database transactions, errors, and memory leaks.

Monitoring tools

There are many tools to choose from to perform various monitoring duties. At Appcelerator, we use the following services or feature to monitor both servers and apps:

  • Pingdom for endpoint monitoring. The best approach is to implement health checks within Pingdom. This check should involve topdown functionality that should include remote resource monitoring that the app uses for the most common queries.
  • For deeper app level monitoring, New Relic and application agents bundled in services are utilized. New Relic is also Appcelerator's tool to watch server-level metrics.
  • AWS CloudWatch  is used to monitor and trigger scaling processes whenever additional server-level needs arise.
  • PagerDuty  is used for all critic metrics and endpoint checks. On-call rotation schedule is one week of both day and night duty.
  • Custom scripts
    • In addition to these services, we have several scripts running at 5, 10, and 15 minute intervals that complete complex health checks to evaluate cluster service health. These scripts are a subset of full-blown QE testing packages that cover all basic functionality of a cluster. Reports from these scripts are displayed in various graphical feedback displays (like green, yellow, or red dots) on an office monitor next to the DigOps team.
    • To monitor more complex metrics (expressions like true, false, greater, and lesser), we use custom scripts running onJenkins nodes. For example, we periodically check all database node volume snapshot's age. This script rotates snapshots according to a son-father-grandfather rotation policy. The exact rotation setting depends on the specific needs of the cluster. To further this example, enterprise snapshots have a longer shelf life and are usually more granular than non-enterprise snapshots.

Hardened monitoring

To ensure our monitoring system works as designed, there is a need to monitor the monitoring system. For that, we need additional tools like DRILL scripts. One such script imitates a failure and we check to see if an alert was sent. All alerts go to a robot channel and/or email inbox (such as specific channel in Flowdock). When a health check fails, the alert should notify the robot channel.

A second script runs on two Jenkins nodes with the same set of monitoring scripts but in different regions. This will ensure high availability while also periodically checking each Jenkins node to see if all monitoring jobs are running.

Scaling up

When a node (like PEM or ArrowCloud) starts treading above defined metrics, the nodes are scaled up by AWS auto scaling groups. Scaling events (up or down) are triggered by CloudWatch traffic plus CPU metrics. However, this triggering event is dependent on the nature of the service. For example, if an service is near it's CPU limit, the whole fleet may become CPU saturated. In this case, we scale up on the CPU load threshold.

In some cases, horizontal scaling will not help. One such case would be load balancers that are running close to their limits or if other metrics are triggering an increase in node(s) size.

For database nodes, we only vertically scale as the data is very trick to handle. For horizontal scaling, we need to implement sharding services but for now, we manage to keep performance at good levels via replica sets. A basic replica set consists of a primary, secondary, and arbiter nodes. Asynchronous data replication happens from the primary node to the secondary while the arbiter node watches the other nodes. For database nodes, the first limit is the disk I/O. There can not be enough RAM on a database node but at some level, it will handle most hot data for quick reads (depending on query traffic patterns).

Related Links