?
for help on navigating these slidesm
for a slide menus
for speaker notes SOLVED
(mostly)
… to handle failures at scale, we need to automatically restart VMs somehow.
… we have to resurrect very carefully in order to avoid any zombie pets!
Different cloud operators will want to support different SLAs with different workflows, e.g.
nova-compute
fails, VMs are still perfectly healthy
but unmanageableThe obvious workarounds are ugly!
crowbar.c$cloud_number
blacher.arch.suse.de
ssh controller1
ssh compute2
etc.root@crowbar:~ # tail -f /var/log/crowbar/production.log
root@crowbar:~ # tail -f /var/log/crowbar/chef-client/*.log
Login to one of the controller nodes, and do:
NovaCompute
/ NovaEvacuate
OCF agentsNovaCompute
/ NovaEvacuate
OCF agentsopenstack-resource-agents
reponova-compute
nova
fails during recoverynova
setupnova evacuate
nova
's recovery APInova evacuate
does not really mean evacuation!nova
terminologynova live-migration
nova evacuate
?!
nova
developers considered a renamenova
-related context,
pretend you saw “resurrect”
Two key areas:
/var/lib/glance/images
on controller nodes/var/lib/nova/instances
on compute nodesIf /var/lib/nova/instances
is shared:
Otherwise:
Either way, /var/lib/glance/images
should be shared across
all controllers (unless using Swift / Ceph)
nova
might fail to retrieve image from glance
crowbar batch
set up shared storageWe're using admin server's NFS server:
nfs_client
barclamp/etc/exports
on admin server/etc/fstab
on controller / compute nodesmount
on controller / compute nodescrowbar batch
Unattended batch setup of barclamps:
root@crowbar:~ # crowbar batch build my-cloud.yaml
Dump current barclamps as YAML:
root@crowbar:~ # crowbar batch export
- barclamp: pacemaker
name: services
attributes:
stonith:
mode: libvirt
libvirt:
hypervisor_ip: 192.168.217.1
drbd:
enabled: true
deployment:
elements:
hawk-server:
- "@@controller1@@"
- "@@controller2@@"
pacemaker-cluster-member:
- "@@controller1@@"
- "@@controller2@@"
pacemaker-remote:
- "@@compute1@@"
- "@@compute2@@"
- barclamp: nova
attributes:
use_migration: true
kvm:
ksm_enabled: true
deployment:
elements:
nova-controller:
- cluster:cluster1
nova-compute-kvm:
- remotes:cluster1
Let's boot a VM to test compute node HA!
Connect to one of the controller nodes, and get image / flavor / net names:
source .openrc openstack image list openstack flavor list neutron net-list
Boot the VM using these ids:
nova boot --image image --flavor flavor --nic net-id=net testvm
Test it's booted:
nova show testvm
Create floating IP:
neutron floatingip-create floatingnet
Get VM IP:
nova list
Get port id:
neutron port-list | grep vmIP
Associate floating IP with VM port:
neutron floatingip-associate floatingipID portID
The VM uses the default
security group. Make sure it has ICMP.
Ping VM:
ping vmFloatingIP
Ping host where the VM is running:
nova list --fields host,name ping hostIP
Check log messages for NovaEvacuate
workflow:
tail -f /var/log/messages | grep NovaEvacuate
Monitor cluster status:
crm_mon
Login to compute node where VM runs, and type:
pkill -9 -f pacemaker_remoted
This will cause fencing! (Why?)
NovaEvacuate [...] Initiating evacuation
NovaEvacuate [...] Completed evacuation
crm status
shows compute node offline (then back online)/var/log/messages
on DCnova list --fields host,name
Pacemaker monitors compute nodes via pacemaker_remote
.
If compute node failure detected:
crm_mon
etc. will show node unclean / offlinefence-nova
as secondary fencing resourcecrm configure show fencing_topology
Find node running fence_compute
:
crm resource show fence-nova
fence_compute
script:
nova
server that node is downLog files:
/var/log/nova/fence_compute.log
/var/log/messages
on DC and node running fence-nova
Verify attribute state via:
attrd_updater --query --all --name=evacuate
NovaEvacuate
spots attribute and calls nova evacuate
root@controller1:~ # crm resource show nova-evacuate
resource nova-evacuate is running on: d52-54-77-77-77-02
nova
resurrects VM on other noderoot@controller2:~ # grep nova-evacuate /var/log/messages
NovaEvacuate [...] Initiating evacuation
NovaEvacuate [...] Completed evacuation
Warning: no retries if resurrection fails!
pacemaker_remote
looks after key compute node services.
crmsh
on cl-g-nova-compute
to find out
which services it looks after