Hi all,
I've been worrying this one all weekend, and woke up this morning at 3:30AM with it all whirling round my head. I think I have an outline of how we're going to manage this, and I would like to get your comments/additions/brickbats before I add it all to the BP:
1) Move over to MAC address based DHCP serving for all infrastructure Currently the IP address of every device in the infrastructure (cloud nodes, gateway, control etc etc etc) is all done by static assignment. This is bad for a number of reasons. My only concern here (lack of knowledge, rather than well founded fear) is on control itself. Can the DHCP server serve itself an IP address, or is this going to be the one exception?
2) Reserve 192.168.0.x for the public IP addresses for Cloud instances and 192.168.1.x for infrastructure. I'm pretty sure I can do this in dhcp.conf. Essentially this is a block list that only serves in those spaces if they are explicitly assigned by MAC addresses. The reason for this is trying to maintain the small existing cloud IP pool, which can't be assigned by dhcp MAC address, because it's managed by the cloud. (See my self argument in point 4.)
3) Go round every infrastructure node, and move it to dhcp served address
4) Drop the cloud pool of IP addresses and create the new range and restart the instances of the cloud. Alternatively, I could just extend the pool to add the 192.168.0.x range. This is less disruptive because it means the existing instances won't have to be re-started/re-created. I think I +1 this myself. :D
5) Reconfigure dhcp.conf to 192.168.x.x/16
6 Restart DHCP
7) Restart networking on every node
My concern here would be that this will mean some disruption, so I would recommend that we wait until I have the new scheduler backup server in place so that we don't lose any jobs. Once control is back up we should run some test jobs through it just to be on the safe side.
All thoughts very welcome!
Thanks
Dave
On 12/03/2012 02:37 AM, Dave Pigott wrote:
Hi all,
I've been worrying this one all weekend, and woke up this morning at 3:30AM with it all whirling round my head. I think I have an outline of how we're going to manage this, and I would like to get your comments/additions/brickbats before I add it all to the BP:
- Move over to MAC address based DHCP serving for all
infrastructure Currently the IP address of every device in the infrastructure (cloud nodes, gateway, control etc etc etc) is all done by static assignment. This is bad for a number of reasons. My only concern here (lack of knowledge, rather than well founded fear) is on control itself. Can the DHCP server serve itself an IP address, or is this going to be the one exception?
The DHCP server should have its IP done statically. Also - I think we should use dnsmasqd's dhcp server. This would allow us to manage DHCP from our ssh-gateway (192.168.1.32). However, I'm not sure if that would interfere with our main lab router setup. dnsmasq is cool because it would allow us to define things in one like:
#somewhere in /etc/dnsmasq.conf: dhcp-host=11:22:33:44:55:66,foo,192.168.0.10
- Reserve 192.168.0.x for the public IP addresses for Cloud
instances and 192.168.1.x for infrastructure. I'm pretty sure I can do this in dhcp.conf. Essentially this is a block list that only serves in those spaces if they are explicitly assigned by MAC addresses. The reason for this is trying to maintain the small existing cloud IP pool, which can't be assigned by dhcp MAC address, because it's managed by the cloud. (See my self argument in point 4.)
I'm not sure how much we have to worry about MACs and if it goes in 192.168.0.x or 192.168.1.x. I *think dnsmasqd will register the ip address with its DNS server when the DHCP server gives out an address. dhcpd has something similar as well (a rough link not verified):
http://www.held.org.il/blog/2011/01/make-dhcp-auto-update-dynamic-dns/
- Go round every infrastructure node, and move it to dhcp served
address
+1
- Drop the cloud pool of IP addresses and create the new range and
restart the instances of the cloud. Alternatively, I could just extend the pool to add the 192.168.0.x range. This is less disruptive because it means the existing instances won't have to be re-started/re-created. I think I +1 this myself. :D
don't know enough about openstack, but sounds sensible.
- Reconfigure dhcp.conf to 192.168.x.x/16
if you do 4, you might want to make sure you exclude 192.168.0.x from the range it assigns. Also - maybe not dhcp.conf :)
6 Restart DHCP
- Restart networking on every node
My concern here would be that this will mean some disruption, so I would recommend that we wait until I have the new scheduler backup server in place so that we don't lose any jobs. Once control is back up we should run some test jobs through it just to be on the safe side.
I think we'll have to assume some downtime for this. I'm not sure how having a backup scheduler helps much. These changes are probably going to know everything offline for a period of time, no?
On 3 Dec 2012, at 17:25, Andy Doan andy.doan@linaro.org wrote:
On 12/03/2012 02:37 AM, Dave Pigott wrote:
Hi all,
I've been worrying this one all weekend, and woke up this morning at 3:30AM with it all whirling round my head. I think I have an outline of how we're going to manage this, and I would like to get your comments/additions/brickbats before I add it all to the BP:
- Move over to MAC address based DHCP serving for all
infrastructure Currently the IP address of every device in the infrastructure (cloud nodes, gateway, control etc etc etc) is all done by static assignment. This is bad for a number of reasons. My only concern here (lack of knowledge, rather than well founded fear) is on control itself. Can the DHCP server serve itself an IP address, or is this going to be the one exception?
The DHCP server should have its IP done statically. Also - I think we should use dnsmasqd's dhcp server. This would allow us to manage DHCP from our ssh-gateway (192.168.1.32). However, I'm not sure if that would interfere with our main lab router setup. dnsmasq is cool because it would allow us to define things in one like:
#somewhere in /etc/dnsmasq.conf: dhcp-host=11:22:33:44:55:66,foo,192.168.0.10
OK - So you're suggesting that gateway becomes our dhcp server, right? I was thinking about doing this as well, but I worried that we might be changing too many things at once. However, I'll take a look at dnsmasq as a way forward.
- Reserve 192.168.0.x for the public IP addresses for Cloud
instances and 192.168.1.x for infrastructure. I'm pretty sure I can do this in dhcp.conf. Essentially this is a block list that only serves in those spaces if they are explicitly assigned by MAC addresses. The reason for this is trying to maintain the small existing cloud IP pool, which can't be assigned by dhcp MAC address, because it's managed by the cloud. (See my self argument in point 4.)
I'm not sure how much we have to worry about MACs and if it goes in 192.168.0.x or 192.168.1.x. I *think dnsmasqd will register the ip address with its DNS server when the DHCP server gives out an address. dhcpd has something similar as well (a rough link not verified):
http://www.held.org.il/blog/2011/01/make-dhcp-auto-update-dynamic-dns/
- Go round every infrastructure node, and move it to dhcp served
address
+1
- Drop the cloud pool of IP addresses and create the new range and
restart the instances of the cloud. Alternatively, I could just extend the pool to add the 192.168.0.x range. This is less disruptive because it means the existing instances won't have to be re-started/re-created. I think I +1 this myself. :D
don't know enough about openstack, but sounds sensible.
- Reconfigure dhcp.conf to 192.168.x.x/16
if you do 4, you might want to make sure you exclude 192.168.0.x from the range it assigns. Also - maybe not dhcp.conf :)
Yeah, that was sort of what I was trying to say in point 2.
6 Restart DHCP
- Restart networking on every node
My concern here would be that this will mean some disruption, so I would recommend that we wait until I have the new scheduler backup server in place so that we don't lose any jobs. Once control is back up we should run some test jobs through it just to be on the safe side.
I think we'll have to assume some downtime for this. I'm not sure how having a backup scheduler helps much. These changes are probably going to know everything offline for a period of time, no?
Well, yes, but not much, at least not the way I was thinking of doing it. Essentially, I would make the changes to dhcp, restart dhcp, and then go round each node reconfiguring and restarting networking. That *should* mean very little disruption. If we're going to move to dnsmasq, a similar approach could be taken, in that we simultaneously take down dhcp on control and start dnsmasq on gateway, and then go round restarting everything.
Dave
linaro-validation@lists.linaro.org