A piece of equipment failure in the central network location on the TJC campus lead to a near-complete shutdown of access for TJC servers, websites and phone systems.
Chief Information Officer, Dr. Larry Mendez, said that a core hardware failure lead to the massive shutdown.
“That core networking piece of equipment basically had a hardware failure. That hardware failure is something we’ve never experienced. I’ve been with the college since 1996 and we haven’t experienced that,” said Mendez. “When we think about it, in 1996 to current that’s basically when networks started to bloom and blossom. We’ve just never had catastrophic failure to where everything is down.”
The piece of equipment that failed has a lifespan between eight and 15 years. The one that failed was about four years old. The initial shutdown started around 4:15 a.m. Mendez was informed on the situation at 7:30 a.m.
Dr. Mendez said the only way to architect a solution for this specific problem comes down to money. You need to have two of them. For the college, two is out of the budget for the department at half of a million dollars.
“If I had a half of a million dollar piece of equipment that I add as extra, not really doing a workload but there just in case, I kind of consider that to be an expense that we don’t need to absorb. We don’t need to pass that cost onto you guys [students] from a tuition perspective,” said Mendez.
Mendez says there is much more to IT than hardware but also says the hardware portion of the IT department to be the highway. For minor failures, such as when a classroom network goes down, they can put in a new switch for and be back up in 20 minutes.
Most outages that occur are usually one-offs only relegated to a certain amount of users or to a software package. Those software packages include Canvas or Banner (Apache Access).
“The redundancy built into that core has something called supervisor modules. Those supervisor modules really provide the connectivity to the rest of the equipment in the other closets,” said Mendez. “We have two of those and that is our redundancy built into that. Usually if you have a supervisor module die you don’t have a second one die. We had both of them die.”
The support for that piece of equipment is not just a standby on a shelf. It is an improved support contract. The company that TJC works with for that equipment guarantees a full-hour, on site, hero kits. Meaning they bring equipment and people to take care of it.
Another partner of the college was already on campus when more help arrived.
“Once the first module got here, it took about three hours. So around 11:30 a.m. the first one showed up. We plugged it in and it started showing us a few errors. The second one arrived about an hour after that. So around 1:30 p.m. we plugged that one in and it was dead on arrival. Which basically means we had to reset the counter,” said Mendez. “So another four hours later, around 8 o’clock that evening we got the second module. We got it in place, brought it up and tested it. Things seemed okay, logs were kind of funny though.”
The re-booting process began early Thursday morning. At that time, Mendez and the rest of the IT department began replacing the whole module case. Inside that case, stood the problem that started it all.
“If you have ever opened up a computer, the back of that case, these cards slide into and it provides that connectivity. So that goes the full length of the chassis itself. That’s actually what was causing trouble,” said Mendez. “They sent two more supervisor modules and a new chassis. That’s what we worked on Thursday night.
The team was ready to take care of that problem around 2 p.m., but Mendez had a problem with the work starting then.
“My problem was from my perspective, I can’t have students continuing not to be able to get to class or do things so about every four hours it would re-boot on its own but it would come back up in about five minutes. So I told everyone as long as it comes back up, we’re just holding tight until classes are done,” said Mendez.
Classes that night ended a little before 10 p.m., that is when the crew got to work on the equipment and replacing parts. The physical replacement took about five hours. The testing took a few more additional hours to complete.
TJC is in year three of a five-year plan that will move all of the hosting equipment off-campus.
“You noticed the website stayed up and running; that was phase one [of the plan]. Year one was to get our website disconnected from here, not only hosted somewhere else but backed up and generally managed. We deal with the content, not necessarily the IT component of that,” said Mendez.
Canvas, the learning management system is also hosted off-site. The problem began after Canvas was pulled up. Authentication is required by Canvas to be able to log in. To do so, a username and password is required. The authentication is hosted on-site, which was unable to be performed due to the network outage. That is what year three is for, to move the authentication pieces off-campus.
“We just recently updated the portal, part of that upgrade of the portal was to provide true, single sign on. So if you clicked on Canvas, you are not prompted with another username and password,” said Mendez. “As we continue to help things so that you would have one username, one password, it creates a lot of complexity on how that works.”