Saturday 17 June 2023

Friends don't let friends use Dell switches

         I have had the displeasure of using Dell network equipment. More specifically Dell N4032F and Dell N4064F. These switches were descendants of the older Dell PowerConnect 8000 seiries switches. All of these switches were bought straight from the factory, not refurbished. The 4032F was a 24 port 10G switch and the 4064F was a 48 port 10G switch, both were stackable. And yes, these switches could push the data. But that was about all that was any good about them. Their reliability was TERRIBLE. They were not PoE switches but it seems Dell had invented PoS(piece of shit) switches. 

    We have had(past tense) about 20 of those switches in our core network. We have used them as just layer 2 switches as layer 3 was done on routers. And at its height it was our biggest source of pain as some of those switches were able to take down most of our network via hardware/software bugs in Dells spanning tree implementation among other things. Nowadays there is only one dell switch left in our network, outside the core layer. And even that one is planned to be decommissioned by the end of summer. We have tried upgrading the firmware but no firmware updates have ever really helped. Let me show you some cases of those switches 'behaviour'...

CDN media

Case 1 - complete stop of traffic

    A single Dell N4032F switch. Every couple months it would completely stop forwarding any traffic. Suddenly and completely any and all traffic going trough the switch would just STOP. All port lights would stay bright orange. When you connected directly to the console you could not do anything. All that you could do is RUN towards the data center and manually reboot the switch by unplugging the power and plugging it back in. We have tried firmware updates but it did not help.

MTBF(Mean time between failure) was 2.5 months. Yes, months, that is how bad it was. 

 Meme Creator - Funny Skynet has been activated Thank you for choosing Dell  switches Meme Generator at MemeCreator.org!

Case 2 - complete loss of management

    A single Dell N4064 switch. Every couple of months this switch would lose its management. And after about 12-24 hours more it would eventually stop forwarding traffic. The switch remained pingable, although with increasing packet loss. You could not login via SSH. SNMP stopped working. And if you connected directly via the console you could not do anything. All that you could do is plan for an after hours manual reboot. Again, we have tried applying firmware updates, but those were again useless. You can't polish a turd after all.

MTBF - 1-3 months.

 

Case 3 - stacking misadventures

    Yes, those switches are stackable. If you stack those switches instead of getting higher uptime you will get even less uptime. Dells stacking implementation is incredibly buggy. When stacked  those switches often get a stack timer misalignment and the stack then falls apart. So what happens next after the stack falls apart? All switches in the stack presume(falsely) themselves to be the stack master and all switches continue forwarding traffic. But then you probably have port-channels shared across multiple switches in the stack as is sensible when you don't have spawn of satan switches. So when a Dell switch stack falls apart, two now independent switches start sending data over and more importantly spanning tree updates over a single port channel.

    What happens next is a massive spanning wave that will affect all interconnected switches even though there technically is no loop but just two stacked switches who have broken up their stack. And the network will keep dancing on and off every 2-3 minutes as spanning tree tries to update itself in vain. If you are lucky you will catch those 2-3 minutes and be fast enough to manually shutdown the ports. If you can't you RUN to the data-center to manually reboot BOTH switches. In fact sometimes it was better not to bother and just go straight to the data center.

    We have had about 4 stacks of 2 switches each. That was (not) fun.

MTBF - 1 year. But once it happens you are in a world of pain.

Case 4 - round robin of random reboots

    About further 6 of those switches would randomly reboot. At least those rebooted, case 1 just stopped working. Rebooting was an improvement as it at least meant they would go back up in a couple of minutes. How do we know it was not a electrical power issue? Well those pieces of crap at least have redundant power. We also saw that other equipment connected on the same power source had longer uptime. So yeah, it was definitely a Dell issue. Sometimes they rebooted multiple times a week, sometimes they rebooted once a year.

MTBF - completely damn random.

Case 5 - oil leak

    Yeah, when removing one switch, we noticed it had an oil leak from the bearings of its ventilators. Can you say... build quality?

MTBF - WTF?

So that is what? Out of 20 switches about 15 had regular catastrophic issues? Firmware updates are futile? Yeah, it is crap.

But there is a probable cause of the issues

    Of course, only Dell knows the real root cause. However, word on the street is... The issue with those switches is that they use a Broadcom switch chip. Not all Broadcom switch chips are bad. But this specific Broadcom switch chip was based on an earlier design Broadcom switch chip from the late 90's/early 2000's and was barely designed to push 1G of traffic instead of 10G traffic. So imagine an early 2000's switch chip put on steroids, frankensteined and pushed way beyond it's limits. It was like putting a 500HP engine in a VW Golf 1. Sure it was fast and could do it's speed but the stability really suffered. That is however, just rumors.

The conclusion

    The worst piece of network equipment I have had the displeasure to work with so far. I would rather have to work with TP-link, D-Link, Linksys, etc... than with Dell network equipment. Maybe their new equipment is now better. But after the Hell I went trough with Dell(it rhymes), I just can't ever trust them again. Their servers are OK, their laptops are fine too. But network equipment, not even once. Just no. 

Overall rating 1/10.

 

No comments:

Post a Comment

User story 1: The client has to pay because their IT guy refuses to replace two patch cables.

 Introduction Actors: $dude - DevOps hired by the client company. $colleague - My colleague, stuck in the same quagmire as I am. To be short...