Stories from the Network Operations center: June 2023

Saturday, 24 June 2023

User story 1: The client has to pay because their IT guy refuses to replace two patch cables.

Introduction

Actors:
$dude - DevOps hired by the client company.
$colleague - My colleague, stuck in the same quagmire as I am.

To be short, we are an internet provider. We are not IT support. Nothing of the following mentioned in the post is in our job description. Nor is it part of the clients contract or part of an SLA. As an ISP our responsibilities end at the end of the fiber-optic strand. Unless the customers router is owned by us or a contractual obligation or an SLA say otherwise.

However, in the past, when we were a lot smaller and thus had a lot more free time, we used to give a helping hand to our customers pro bono, even for things that were not our direct responsibility. But now we are trying to fend that off when possible. It is ok if it is like 5 mails a year(single mails, not mail chains) and maybe a single visit to the customer a year and it turns out to not be our fault or responsibility. We tolerate that much. But not more than that. And of course this is only if it was not our mistake. When only a single customer has an issue, it tends to not be one of our mistake. When you make a mistake in an ISP it is usually at least the whole block that goes down :), And of course we have other more important work to do, like making sure our WAN network works as expected and has enough capacity to carry increasing amounts of traffic.

The plot thickens

But now, we have an old client from many years ago. Back from the time when we did those 'pro bono favours. We did those more to court clients rather than having any real responsibility to do them. And that was OK. It was not too much, a couple of mails here and there, a couple of minutes of work.

All until they hired their new DevOps $dude. That DevOps came from another company that was also our client, but with that client we have mistakenly signed an SLA that was really unfavourable for us. We still haven't gotten over the mistake that SLA was. But that is a story for another day.

However, spoiler, this story does indeed have a good ending. Even before all of this we have already learned the best way to 'educate' such clients. But more about that later.

Principally, $dude was used to from his last job that we had to be obedient soldiers to him. My $colleague even explicitly told him that he is now in a new company with a very different business relationship to us. He said 'ok' but in reality he did not care at all about that. Go in one ear and out the other. His stance both in his old and new job was 'not my job'. And the only thing I have seen the guy do was to scourge external contractors(I am kidding a bit, but his former company had that disgusting outsource everything mentality).

$dude is a little obsessed with replacing every router he sees with IDS/IPS firewalls. Which is not bad but... A little known thing about ISPs is that ISPs don't really do much with firewalls if anything at all. Mostly because the amount of traffic going trough an ISP makes hardware firewalls prohibitively expensive and impractical. Hardware firewalls capable of doing and IPS filtering traffic over multiple 100G interfaces cost about as much as real estate where I live. Therefore ISPs work with routers, not firewalls, the only firewall features on those routers usually being just access-lists. Over these 6 years working at an ISP there was barely a handful of times when I have been configuring a firewall with IPS/IDS capabilities.

And this time we were smart. We have predicted that tickets are gonna start flowing hard and fast, a lot of tickets. We did not lease the firewall as part of our service. We sold the firewall to them. Because we have sold that firewall, it is their device. We do not have a responsibility to support it. They accepted the offer and bought the device. Although we still have a user account on the device, but so do they. We are willing to delete our user account whenever.

Months passed, tickets came, a lot of ball passing over emails. 2-3 client visits to them.

When one day $dude calls my $colleague. When there are problematic cases like these, $colleague prefers to put them on the speaker phone so that everyone can chime in and understands the situation.

$dude: Hello, this is $dude from company X. There is no wireless in a part of our office, we can barely reach speeds of 20-30 Mbps. Can you please come over to us?
$colleague: Can you please tell me what is the colour of the lights on the affected AP?
$dude: It is white.(Which means the AP has no access to the firewall which also acts as the APs controller).
$dude: I see that the patch cable is damaged.
$colleague: Have you tried to physically reboot the AP and change the patch cable?
$dude(starts shouting): I DON"T GIVE A DAMN. I AM THE CLIENT. THAT IS YOUR JOB. THE PATCH CABLE IS OVERRUN, DAMAGED...
$colleague(visibly irritated, tuns off his microphone): Yes you are a client, but a client of what? Which of our services are you a client off? IT support is not in your package because we don't even do that in the first place.
$colleague(turns on his microphone):OK, someone will come to take a look. Goodbye.

As soon as the call ended, $colleague went to the secretary and asked her nicely if she can tell him what service does the client have contracted with us. The secretary told him: Only an internet service, nothing more. I don't remember the exact speed. Some 200M or 300M symmetric, nothing much. And my colleague had a huge grin on his face.

At the same time I have logged into the Firewall/AP controller and I have confirmed, out of 10 APs one was inaccessible, and another was meshed(it was not supposed to be meshed). Everything else was normal. A problem exists, but alas it was not under our 'jurisdiction'.

We send two of our new interns/technicians out to the client. And we explicitly tell them to fill out the work order in two copies and round out their time to the hour. As an ISP, none of this was our equipment, job or responsibility. We are an ISP NOC.

And we charged them for that :)

The good ending

2 technicians x 1 hour. Around here the market rate for a technician-hour is around 100$. So that ended up costing them 200$ just to replace two patch cables. And they had their own IT guy $dude who even identified the damaged cables, but was unwilling to do anything about it himself.

That is the good ending. Even earlier we have learned that the best way to wean off clients from 'little favours' when it becomes too much... is to simply start charging the favours.

Even better, later $colleague proactively noticed that the firewall needed a firmware upgrade and contacted the client about it. The firmware upgrade was really needed. we did not plan to charge them for that.

However $dude, the first thing he asks is: Will you also charge us for that?

We did charge them.

A little bit unrelated, but why do many DevOps, not all of them, but a big part of them... Why do they hate the 'ops' side of their work? I don't get it. On one side you have boring programming, clunky automation and flaky cloud providers, while on the other side you have interesting networks, sexy servers, innovative on premes infrastructure...

Saturday, 17 June 2023

Friends don't let friends use Dell switches

I have had the displeasure of using Dell network equipment. More specifically Dell N4032F and Dell N4064F. These switches were descendants of the older Dell PowerConnect 8000 seiries switches. All of these switches were bought straight from the factory, not refurbished. The 4032F was a 24 port 10G switch and the 4064F was a 48 port 10G switch, both were stackable. And yes, these switches could push the data. But that was about all that was any good about them. Their reliability was TERRIBLE. They were not PoE switches but it seems Dell had invented PoS(piece of shit) switches.

We have had(past tense) about 20 of those switches in our core network. We have used them as just layer 2 switches as layer 3 was done on routers. And at its height it was our biggest source of pain as some of those switches were able to take down most of our network via hardware/software bugs in Dells spanning tree implementation among other things. Nowadays there is only one dell switch left in our network, outside the core layer. And even that one is planned to be decommissioned by the end of summer. We have tried upgrading the firmware but no firmware updates have ever really helped. Let me show you some cases of those switches 'behaviour'...

CDN media

Case 1 - complete stop of traffic

A single Dell N4032F switch. Every couple months it would completely stop forwarding any traffic. Suddenly and completely any and all traffic going trough the switch would just STOP. All port lights would stay bright orange. When you connected directly to the console you could not do anything. All that you could do is RUN towards the data center and manually reboot the switch by unplugging the power and plugging it back in. We have tried firmware updates but it did not help.

MTBF(Mean time between failure) was 2.5 months. Yes, months, that is how bad it was.

Meme Creator - Funny Skynet has been activated Thank you for choosing Dell switches Meme Generator at MemeCreator.org!

Case 2 - complete loss of management

A single Dell N4064 switch. Every couple of months this switch would lose its management. And after about 12-24 hours more it would eventually stop forwarding traffic. The switch remained pingable, although with increasing packet loss. You could not login via SSH. SNMP stopped working. And if you connected directly via the console you could not do anything. All that you could do is plan for an after hours manual reboot. Again, we have tried applying firmware updates, but those were again useless. You can't polish a turd after all.

MTBF - 1-3 months.

Case 3 - stacking misadventures

Yes, those switches are stackable. If you stack those switches instead of getting higher uptime you will get even less uptime. Dells stacking implementation is incredibly buggy. When stacked those switches often get a stack timer misalignment and the stack then falls apart. So what happens next after the stack falls apart? All switches in the stack presume(falsely) themselves to be the stack master and all switches continue forwarding traffic. But then you probably have port-channels shared across multiple switches in the stack as is sensible when you don't have spawn of satan switches. So when a Dell switch stack falls apart, two now independent switches start sending data over and more importantly spanning tree updates over a single port channel.

What happens next is a massive spanning wave that will affect all interconnected switches even though there technically is no loop but just two stacked switches who have broken up their stack. And the network will keep dancing on and off every 2-3 minutes as spanning tree tries to update itself in vain. If you are lucky you will catch those 2-3 minutes and be fast enough to manually shutdown the ports. If you can't you RUN to the data-center to manually reboot BOTH switches. In fact sometimes it was better not to bother and just go straight to the data center.

We have had about 4 stacks of 2 switches each. That was (not) fun.

MTBF - 1 year. But once it happens you are in a world of pain.

Case 4 - round robin of random reboots

About further 6 of those switches would randomly reboot. At least those rebooted, case 1 just stopped working. Rebooting was an improvement as it at least meant they would go back up in a couple of minutes. How do we know it was not a electrical power issue? Well those pieces of crap at least have redundant power. We also saw that other equipment connected on the same power source had longer uptime. So yeah, it was definitely a Dell issue. Sometimes they rebooted multiple times a week, sometimes they rebooted once a year.

MTBF - completely damn random.

Case 5 - oil leak

Yeah, when removing one switch, we noticed it had an oil leak from the bearings of its ventilators. Can you say... build quality?

MTBF - WTF?

So that is what? Out of 20 switches about 15 had regular catastrophic issues? Firmware updates are futile? Yeah, it is crap.

But there is a probable cause of the issues

Of course, only Dell knows the real root cause. However, word on the street is... The issue with those switches is that they use a Broadcom switch chip. Not all Broadcom switch chips are bad. But this specific Broadcom switch chip was based on an earlier design Broadcom switch chip from the late 90's/early 2000's and was barely designed to push 1G of traffic instead of 10G traffic. So imagine an early 2000's switch chip put on steroids, frankensteined and pushed way beyond it's limits. It was like putting a 500HP engine in a VW Golf 1. Sure it was fast and could do it's speed but the stability really suffered. That is however, just rumors.

The conclusion

The worst piece of network equipment I have had the displeasure to work with so far. I would rather have to work with TP-link, D-Link, Linksys, etc... than with Dell network equipment. Maybe their new equipment is now better. But after the Hell I went trough with Dell(it rhymes), I just can't ever trust them again. Their servers are OK, their laptops are fine too. But network equipment, not even once. Just no.

Overall rating 1/10.

Stories from the Network Operations center

Saturday, 24 June 2023

User story 1: The client has to pay because their IT guy refuses to replace two patch cables.

Introduction

The plot thickens

The good ending

Saturday, 17 June 2023

Friends don't let friends use Dell switches

Case 1 - complete stop of traffic

Case 2 - complete loss of management

Case 3 - stacking misadventures

Case 4 - round robin of random reboots

Case 5 - oil leak

But there is a probable cause of the issues

The conclusion

User story 1: The client has to pay because their IT guy refuses to replace two patch cables.

Report Abuse

Labels