Recently, we had a branch site complete the renovation of a new wing. In doing so, we had to build a new telecom closet to feed the area. We added another stack of 3750’s, that connected back to the pre-existing closet via 1Gig fiber link (which that closet then connects back to our core via Optiman). We’ve basically added another stack to the same subnet for that site. Nothing out of the ordinary for us networkers.
Then a few weeks after they opened, they started having issues with certain pc’s in that new wing. There were approximately 15 pc’s that would have a daily Duplicate IP Address Conflict error message on them. So, Help Desk sends a ticket to our group. It was a Saturday afternoon, when I called the lady back and she confirms she has had these issues since they opened. A reboot fixes the issue, but only temporary for that shift. I promised this user that I would follow-up until we found the issue.
Meanwhile, internal to my department, everyone is pointing fingers to the network – which is MY Team, a very proud people, right?! I spoke to DHCP Server guys, Desktop Engineer guys, Desktop Hardware Guys…..I’m like something is wrong with those pc’s. Nonetheless, no one ever proves it is the network. We always have to prove it isn’t the network first. In proving this, we resolved the issue. It was a virus on a vendor machine. A Conficker virus to be exact.
How did we figure this out? I’d love to say that someone else did their job and thoroughly looked at the desktop machines, but unfortunately, I didn’t see a lot of motivation. Something that should have been resolved within days took many, many weeks. We were sniffing packets all day long for UDP, ICMP, port 67 and port 137, on 5 of the machines for several weeks – deciphering the data…. We could see where some wouldn’t check in for days, but we couldn’t tell why a machine was not sending the renew. We are network gurus, not DESKTOP gurus. And the only way to get into the IPCONFIG /all was to break the security commands and do a reboot. The Engineers said they could write a script to get the logs, but again, not much initiative for that.
Then one day, my fellow network guru decided to momentarily capture all broadcasts on the uplink. WHOA! we started seeing some weird stuff. It looked like a virus. I google “DHCP virus” and there it was. We then went to our other network tools and found the infected machine. Next, we found a tool and was able to grab the IPCONFIG /all info -daily on a whim. So, now we have resolved the issue – the network team – 6+ weeks later.
In this case, from the beginning, people were asking me the difference about that new ‘site’. How many times can I tell you it is not a site-wide issue. 12 out of 50 machines is not site-wide. No one proved it to be. How can you say that to someone? As with any argument, you really need to have the facts. This site had been up and working for 3 years prior to the expansion. I can only point out that everything was fine until you put your pc’s in. Recreating the scope is not the test OR answer. Making a completely new scope is not the test Or answer either.
It is always annoying to have to prove it isn’t the network before the analysts or vendors have to prove it is. I’ve come to accept that. However, I haven’t come to accept the fact that I’m constantly taken off of my important projects to prove it isn’t. In this particular case, there was plenty of wasted money on time to resolve. I’m glad we fixed it, but that comes with a catch. As always, we do too good a job, and I’m beginning to understand that some people have come to expect that we will ultimately fix everything.
I’ve learned some lessons on this one, for sure. I’ve had this happen before, when someone had a DHCP Server setup on their laptop and that was the path I was looking at for this issue, initially. I guess I’m still learning to expect anything.
Hi there Josh,
I could be missing something, but wouldn’t DHCP snooping & dynamic ARP inspection mitigate this entirely?
Unfortunately, in this circumstance, the above mentioned would not work. We did have DHCP Snooping turned on, but the problem was that the conficker virus on the one client was affecting the service on the other clients, without infecting them. Therefore, everything we saw communicating with the server seemed fine. There were no signs of ARP poinsoning or anything like that.
The problem was that the clients weren’t checking in and the server was giving out the IP address to someone else. This looked like normal behavior on the network. It wasn’t until we installed the McAfee Conficker Detection Tool that we were able to gather the data on a scheduled basis to see which machines were not checking in during their leased time. All-in-all, not something a network engineer should be spinning her cycles on, but it needed done.
You are right, and given the weirdness of the whole thing, we did try those options, but they didn’t help. To us, we saw machines getting their renewals just fine – but we didn’t have the tangible evidence of what devices were misbehaving.
I should add, nothing else has been done of this. These are all assumptions on my part of how the virus actually worked. Once we removed the infected device and things went back to working again as expected – I never heard anything else related to this subject.
Yeah, I understand how it goes, especially with it affecting this subset of user’s for so long, it would have been relieving for you to have found the solution and moved on to your core tasks.. :)
I havn’t confirmed this, but I suspect Snooping & ARP inspection might have helped – e.g. client gets 8 hour lease, continues to use it after the lease is expired. DHCP server gives address to another PC. At this point, the switch’s snoop database will have the IPMAC mapping installed for the second PC – the first PC would from this point be replying to ARP queries & the switch will drop those frames (from the switch’s perspective, it looks like a poisoning attack by this infected PC).
And if this was detected by Snooping & DAI, IP Sourceguard would have further protected you by blocking IP traffic from the infected PC.
Plus I suspect if you had something like arpwatch installed & listening on the relevant VLAN, you would have detected the MAC flapping between authoritative PC (the second DHCP lease holder) and un-authoritative PC (the first that never renewed).
These technologies are great – I havn’t actually deployed them too often due to customer’s wanting to take this on in-house but if I was a resident network-admin, I’d be taking advantage of every one of them.
Thanks Chris, most helpful. I expect I will be turning to these tools and your advice next time…..because we know there is always a next time.
The only other thing I should add, is that I work for the HealthCare Industry and the users at the site were constantly rebooting to “fix” their immediate issues -because they always had patients in these rooms. It was kind of hard to get tangible data.
I’m still not convinced these tools could’ve led me to the source virus faster, but maybe. I just wanted others to maybe think of me next time they’ve “got a weird one”.