3 Replies - 1332 Views - Last Post: 23 July 2013 - 09:23 AM Rate Topic: ***** 1 Votes

#1 JackOfAllTrades  Icon User is offline

  • Saucy!
  • member icon

Reputation: 6107
  • View blog
  • Posts: 23,663
  • Joined: 23-August 08

My Non-Code Nightmare

Posted 29 June 2013 - 06:13 AM

Not a code issue per se, but this nightmare is currently occurring in my place of employ.

We have a third-party product that pretty much runs the guts of our system. This product is pretty shitty overall, so one of my tasks is to create our own, in-house version of that system (totally re-architected, of course). This is an ongoing process that will likely get lots of pressure to complete post-haste after yesterday.

In this system, there is a master server that marshals all of the other sub-servers, each of which in turn run other servers. We have been having some issues with a particular part of the third-party product -- not directly involved with running processes -- so we employed their support to look into that issue.

Yesterday afternoon operations started noticing the sub-servers taking themselves off-line and rebooting. We immediately contacted the third-party's upper-level support who logged into the system, noticed lack of network connectivity (DUH! Because shit was rebooting!!!) and said it's a networking issue. Knowing that wasn't the case, Operations was pissed, but too busy firefighting to address it at that time. We started fixing the symptoms and end-result with no one having a CLUE what was causing it.

As things started to settle down a bit, the boss noticed that the master server had run out of memory, apparently triggering this chain of events. In the end it took hours of downtime for my boss and operations just to get things back to a mainly-functional state. Customers were pissed off that their systems all went offline for extended periods. It was a clusterfuck of epic proportions.

During this time I started looking for the cause.

I went to the master server logs in a search. What process could possibly have spiked? On first glance, is appeared that the process eating all the memory was...vi! vi eating all the memory on a server? WTF was this about?

I went through the command line history trying to figure out what the hell was going on. It was then that I noticed someone using vi to open an unusual file. I used stat on the file and determined that the last access time was about 45 minutes prior to the out-of-memory condition. The size of this file? About 5GB. I went to the boss about 10 minutes before closing up shop yesterday and told him this, to which he said -- "Oh, that's the file that I've been having issues with and was working with third-party's support on. Are you telling me THEY caused this by opening the file?"

The head of operations sent a strongly-worded email to the support person suggesting a network connectivity issue. This was before my discovery.

The shit is about to hit the fan.

Is This A Good Question/Topic? 1
  • +

Replies To: My Non-Code Nightmare

#2 ogadit  Icon User is offline

  • New D.I.C Head

Reputation: 1
  • View blog
  • Posts: 22
  • Joined: 31-July 12

Re: My Non-Code Nightmare

Posted 03 July 2013 - 03:04 PM

reading this post is a nightmare!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Was This Post Helpful? 0
  • +
  • -

#3 DblAAssassin  Icon User is offline

  • D.I.C Regular

Reputation: 34
  • View blog
  • Posts: 261
  • Joined: 11-May 13

Re: My Non-Code Nightmare

Posted 17 July 2013 - 12:05 PM

I feel bad for that support person.
Was This Post Helpful? 0
  • +
  • -

#4 deery5000  Icon User is offline

  • D.I.C Addict

Reputation: 78
  • View blog
  • Posts: 979
  • Joined: 09-May 09

Re: My Non-Code Nightmare

Posted 23 July 2013 - 09:23 AM

In fairness the 3rd line support agent should have been able to track this, I mean common it was in the logs (like most issues)

Out of memory issue , this guy should have got the boot haha
Was This Post Helpful? 0
  • +
  • -

Page 1 of 1