Recently, we had a branch site complete the renovation of a new wing. In doing so, we had to build a new telecom closet to feed the area. We added another stack of 3750’s, that connected back to the pre-existing closet via 1Gig fiber link (which that closet then connects back to our core via Optiman). We’ve basically added another stack to the same subnet for that site. Nothing out of the ordinary for us networkers.
Then a few weeks after they opened, they started having issues with certain pc’s in that new wing. There were approximately 15 pc’s that would have a daily Duplicate IP Address Conflict error message on them. So, Help Desk sends a ticket to our group. It was a Saturday afternoon, when I called the lady back and she confirms she has had these issues since they opened. A reboot fixes the issue, but only temporary for that shift. I promised this user that I would follow-up until we found the issue.
Meanwhile, internal to my department, everyone is pointing fingers to the network – which is MY Team, a very proud people, right?! I spoke to DHCP Server guys, Desktop Engineer guys, Desktop Hardware Guys…..I’m like something is wrong with those pc’s. Nonetheless, no one ever proves it is the network. We always have to prove it isn’t the network first. In proving this, we resolved the issue. It was a virus on a vendor machine. A Conficker virus to be exact.
How did we figure this out? I’d love to say that someone else did their job and thoroughly looked at the desktop machines, but unfortunately, I didn’t see a lot of motivation. Something that should have been resolved within days took many, many weeks. We were sniffing packets all day long for UDP, ICMP, port 67 and port 137, on 5 of the machines for several weeks – deciphering the data…. We could see where some wouldn’t check in for days, but we couldn’t tell why a machine was not sending the renew. We are network gurus, not DESKTOP gurus. And the only way to get into the IPCONFIG /all was to break the security commands and do a reboot. The Engineers said they could write a script to get the logs, but again, not much initiative for that.
Then one day, my fellow network guru decided to momentarily capture all broadcasts on the uplink. WHOA! we started seeing some weird stuff. It looked like a virus. I google “DHCP virus” and there it was. We then went to our other network tools and found the infected machine. Next, we found a tool and was able to grab the IPCONFIG /all info -daily on a whim. So, now we have resolved the issue – the network team – 6+ weeks later.
In this case, from the beginning, people were asking me the difference about that new ‘site’. How many times can I tell you it is not a site-wide issue. 12 out of 50 machines is not site-wide. No one proved it to be. How can you say that to someone? As with any argument, you really need to have the facts. This site had been up and working for 3 years prior to the expansion. I can only point out that everything was fine until you put your pc’s in. Recreating the scope is not the test OR answer. Making a completely new scope is not the test Or answer either.
It is always annoying to have to prove it isn’t the network before the analysts or vendors have to prove it is. I’ve come to accept that. However, I haven’t come to accept the fact that I’m constantly taken off of my important projects to prove it isn’t. In this particular case, there was plenty of wasted money on time to resolve. I’m glad we fixed it, but that comes with a catch. As always, we do too good a job, and I’m beginning to understand that some people have come to expect that we will ultimately fix everything.
I’ve learned some lessons on this one, for sure. I’ve had this happen before, when someone had a DHCP Server setup on their laptop and that was the path I was looking at for this issue, initially. I guess I’m still learning to expect anything.