2016 Cartell IT QOS Report

2016
Uptime percentage (excluding scheduled maintenance) 100%
Uptime percentage (including scheduled maintenance) 99.99%
Scheduled downtime 0 hrs
Unscheduled downtime 22 min

System Outages

During November we experienced two events that led to a severe degradation of service for our customers. During these time frames our services would have initially begun to slow down, then time out, and in the latter phases of the incident our servers would have begun rejecting new connections.

Date Incident Start-End Duration Reason
Tue Nov 1st 16:54-17:09 15 minutes SI DOS (see below)
Wed Nov 2nd 13:53-13:59 7 minutes SI DOS (see below)

2016-11-01 Incident Description

On Monday 1st November at 17:01 our customer service team began reporting issues with our online applications. This was quickly followed by calls from customers. SysAdmin investigation of infrastructure showed no failures, however a large number of DB connections were open and blocking new connections. Analysis of application logs showed no unusual activity, our servers were processing data requests, however the thread count was maxed out, and we ran out of heap memory in one instance. To unblock the jam we decided to reboot the tomcat processes at 17:09. This immediately restored service. No specific reason for the outage was identified at the time.

2016-11-02 Incident Description

On Tuesday 1st November at 13:57 our customer service team reported similar issues. A quick analysis of the situation showed it was almost identical to the previous day. We immediately restarted tomcat processes but kept one in its ‘locked’ state for analysis. This immediately restored service at 13:59.

Analyis of Outage

We checked our codebase for any code changes in previous release and found nothing that would have caused this. Ditto no changes on the AWS environment or any evidence of suspicious activity. Analysis of the Java Heap showed large amounts of vehicle objects but didn’t pinpoint the cause. Differential Analysis of logs from both incidents showed a common pattern of lookups from a single source. Tracking this back we found the same registration was looked up repeatedly at a rate of five per second. This rate of lookup is normal, however, investigation of the service showed it to be massively resource hungry in terms of DB search time and memory, but only for a particular vehicle. Normally the lookup would have taken about 250ms but for this vehicle it took up to 1.5 seconds as it was loading 15,000 distinct data sets into memory, for every invocation (normally it’s about 500). Many concurrent requests for the same vehicle over a prolonged period of time tied up the HTTP listeners, Database Pools and most critically, heap memory, slowing other customers from accessing the systems until the pools became exhausted and the system was not in a position to accept new connections. Cartell systems worked as expected, and although we began rejecting new requests to allow recovery, it was too late –  the load taken on was abnormal. The massive memory footprint could not be cleared by GC and this prevented the system from recovering.

Self Infilicted Denial of Service (SI DOS)

Tracing the source of the requests ended at one of our own workstations. One of our developers was stress testing a new product in a test environment to see how it would handle timeouts and retries. In order to get a successful volume of timeouts he picked the slowest service, using a vehicle with a slow response time, set a very short timeout and hit it at a rate of 300 per minute. As it turns out the test system was connected to a live data feed in the back end. The developer was testing on both occasions. A perfect case of Self Inflicted Denial of Service.

Preventative Action

A series of unusual events led to this situation; in any case we will learn from it and avoid it occurring in future. Recommendations are:

  • Educating Developers to check the affect their performance tests may have on third party systems;
  • Applying rate throttling per IP address on the network level as a short term fix;
  • Rate limiting services on a per customer basis – probably limiting each customer to no more than 5 concurrent connections;
  • We are investigating the abnormal memory footprint for this vehicle and will be updating code to limit resources consumed for certain processes. This will facilitate the natural rate limiting of services based upon available capacity.

We apologise for any inconvenience that this outage may have caused to our customers.