We’re now rolling through week four of the Windows 10 migration at DWR with only a few disruptions. One particular issue though popped up rather suddenly during wave two and strongly in wave three. We found a growing number of computers with networking issues. Specifically they were all failing to obtain DHCP leases both on startup or through manual renewal.
After realizing the issue went away after the firewall service was stopped I loaded up Wireshark on one of the affected laptops. From the packet capture I could clearly see the DHCPDISCOVER packet exit the client and a DHCPOFFER packet return. But the communication stopped here, the DHCPREQUEST step never appeared to leave the client and after several seconds the protocol reset and a new discover packet was sent.
As it turns out, this is actually a pretty old Windows 7 bug that somehow managed to get carried forward through the migration. Microsoft KB2344959 documents a bug in the Windows 7 firewall that causes the DHCP handshake to be disrupted when the following two policies are set:
- Do not allow local exceptions
- Prohibit unicast response to multicast or broadcast requests
During the DHCP handshake, the DHCPOFFER is a broadcast packet back from the DHCP server to the client. The following DHCPREQUEST response is then a unicast reply back to the DHCP server. As a result of the two rules, the Windows firewall blocks the reply.
Microsoft identified this as a bug at least as far back as October 2010 but never released more than a hotfix to address the issue. Worse, we found the hotfix not applicable to Win7 SP1 x64 and so we couldn’t work it into our migration task sequence.
Previous to the upgrade, our Windows 7 policies disabled the firewall entirely but with Windows 10 we’re applying Microsoft’s security baseline as part of the upgrade and the firewall was activated for all profiles. We also determined the issue was more prevalent on newer hardware due to a race condition between the network stack and the firewall service. On older machines the firewall service initiated too slowly and DHCP consistently completed first. Newer hardware with SSDs, however, nearly always lost the race and brought the firewall up faster than networking.
The fix ended up being simple as we adjusted the unicast response rule on the affected clients by hand and then by domain group policy. Unfortunately it took more than a week for the appropriate channels to decide upon and approve the change.