Hi all,
I've been worrying this one all weekend, and woke up this morning at 3:30AM with it all whirling round my head. I think I have an outline of how we're going to manage this, and I would like to get your comments/additions/brickbats before I add it all to the BP:
1) Move over to MAC address based DHCP serving for all infrastructure Currently the IP address of every device in the infrastructure (cloud nodes, gateway, control etc etc etc) is all done by static assignment. This is bad for a number of reasons. My only concern here (lack of knowledge, rather than well founded fear) is on control itself. Can the DHCP server serve itself an IP address, or is this going to be the one exception?
2) Reserve 192.168.0.x for the public IP addresses for Cloud instances and 192.168.1.x for infrastructure. I'm pretty sure I can do this in dhcp.conf. Essentially this is a block list that only serves in those spaces if they are explicitly assigned by MAC addresses. The reason for this is trying to maintain the small existing cloud IP pool, which can't be assigned by dhcp MAC address, because it's managed by the cloud. (See my self argument in point 4.)
3) Go round every infrastructure node, and move it to dhcp served address
4) Drop the cloud pool of IP addresses and create the new range and restart the instances of the cloud. Alternatively, I could just extend the pool to add the 192.168.0.x range. This is less disruptive because it means the existing instances won't have to be re-started/re-created. I think I +1 this myself. :D
5) Reconfigure dhcp.conf to 192.168.x.x/16
6 Restart DHCP
7) Restart networking on every node
My concern here would be that this will mean some disruption, so I would recommend that we wait until I have the new scheduler backup server in place so that we don't lose any jobs. Once control is back up we should run some test jobs through it just to be on the safe side.
All thoughts very welcome!
Thanks
Dave