本文记录一次完整的 KubeVirt 热迁移实验。实验目标不是只证明虚拟机能够从一个节点移动到另一个节点,而是逐步验证三个层面的能力:
- 在 kind 多节点集群里安装 Kube-OVN 和 KubeVirt。
- 把 KubeVirt VM 放到 Kube-OVN VPC 子网里,验证迁移后 IP 和网络归属保持不变。
- 在迁移期间运行应用级请求、内存写入和外部访问,观察应用请求是否中断。
本文使用固定版本,避免「latest」在未来变化导致实验不可复现:
- kind:
v0.30.0
- Kubernetes node image:
kindest/node:v1.34.0
- Kube-OVN:
v1.16.2
- KubeVirt:
v1.8.3
- virtctl:
v1.8.3
实验目录:
1
|
/home/jimyag/src/github/jimyag/homelab/labs/kubevirt-kubeovn-kind
|
本实验在 Linux 主机上执行,依赖 Docker 和 kind。KubeVirt 优先使用硬件虚拟化,因此主机需要有 /dev/kvm。
检查命令:
1
2
3
4
|
kind version
kubectl version --client=true
docker version
ls -l /dev/kvm
|
期望能看到类似输出:
1
2
3
|
kind v0.30.0 go1.26.1 linux/amd64
Client Version: v1.35.0
crw-rw---- 1 root kvm 10, 232 ... /dev/kvm
|
如果 /dev/kvm 不存在,KubeVirt 仍可以启用软件模拟,但性能会明显下降:
1
2
|
kubectl patch kubevirt kubevirt -n kubevirt --type merge -p \
'{"spec":{"configuration":{"developerConfiguration":{"useEmulation":true}}}}'
|
本次实验中 kind 节点可以看到 /dev/kvm,KubeVirt 最终走的是硬件虚拟化路径。
热迁移至少需要源节点和目标节点。本实验创建 1 个 control-plane 和 2 个 worker:
kv-ovn-control-plane
kv-ovn-worker
kv-ovn-worker2
KubeVirt VM 会在两个 worker 之间迁移。
先用 containerDisk 验证迁移主链路,因为它不引入持久化存储变量。之后再用 RWX PVC 验证带持久盘的 VM。
kind 默认的本地 StorageClass 属于节点本地存储,不适合直接验证运行中 VM 的跨节点迁移。对于带持久盘的迁移,存储需要能被源节点和目标节点同时访问。本实验用一个轻量的 NFS server 和 nfs-subdir-external-provisioner 创建 ReadWriteMany StorageClass,名称是 nfs-rwx。
文件路径:
1
|
labs/kubevirt-kubeovn-kind/kind-config.yaml
|
完整 YAML:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
|
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
name: kv-ovn
networking:
disableDefaultCNI: true
podSubnet: 10.16.0.0/16
serviceSubnet: 10.96.0.0/12
nodes:
- role: control-plane
image: kindest/node:v1.34.0
extraMounts:
- hostPath: /dev/kvm
containerPath: /dev/kvm
- role: worker
image: kindest/node:v1.34.0
extraMounts:
- hostPath: /dev/kvm
containerPath: /dev/kvm
- role: worker
image: kindest/node:v1.34.0
extraMounts:
- hostPath: /dev/kvm
containerPath: /dev/kvm
|
这里有两个关键点:
disableDefaultCNI: true:不安装 kind 默认 CNI,后面由 Kube-OVN 接管集群网络。
extraMounts:把宿主机 /dev/kvm 挂到每个 kind 节点里,让 KubeVirt 使用硬件虚拟化。
1
2
|
cd /home/jimyag/src/github/jimyag/homelab
kind create cluster --config labs/kubevirt-kubeovn-kind/kind-config.yaml
|
创建完成后检查节点:
1
|
kubectl get nodes -o wide
|
在 Kube-OVN 安装前,因为没有 CNI,节点可能短暂处于 NotReady,这是预期状态。
检查 /dev/kvm 是否进入 kind 节点:
1
2
3
|
docker exec kv-ovn-control-plane ls -l /dev/kvm
docker exec kv-ovn-worker ls -l /dev/kvm
docker exec kv-ovn-worker2 ls -l /dev/kvm
|
期望三个节点都能看到 /dev/kvm。
1
2
3
4
|
curl -fsSL https://raw.githubusercontent.com/kubeovn/kube-ovn/v1.16.2/dist/images/install.sh \
-o /tmp/kube-ovn-install.sh
chmod +x /tmp/kube-ovn-install.sh
ENABLE_LIVE_MIGRATION_OPTIMIZE=true bash /tmp/kube-ovn-install.sh
|
ENABLE_LIVE_MIGRATION_OPTIMIZE=true 是本实验保留的关键开关,用来启用 Kube-OVN 针对虚拟机热迁移的网络优化。
安装脚本最后可能尝试把 kubectl-ko 写到 /usr/local/bin,如果当前用户没有权限,可能看到类似错误:
1
|
error: open /usr/local/bin/kubectl-ko: permission denied
|
这个错误只影响本机 kubectl 插件安装,不影响集群内 Kube-OVN 组件。应以组件状态为准。
1
2
3
4
5
|
kubectl -n kube-system get pods -l app=kube-ovn-controller -o wide
kubectl -n kube-system get pods -l app=kube-ovn-cni -o wide
kubectl -n kube-system get pods -l app=ovs -o wide
kubectl -n kube-system get pods -l app=kube-ovn-pinger -o wide
kubectl get nodes -o wide
|
期望:
kube-ovn-controller 为 Running
- 每个节点上都有
kube-ovn-cni
- 每个节点上都有
ovs-ovn
- worker 节点上的
kube-ovn-pinger 为 Running
- 三个 Kubernetes 节点都变成
Ready
1
|
kubectl get crd vpcs.kubeovn.io subnets.kubeovn.io ips.kubeovn.io
|
期望能看到这几个 CRD。
1
2
3
4
|
export KUBEVIRT_VERSION=v1.8.3
kubectl apply -f "https://github.com/kubevirt/kubevirt/releases/download/${KUBEVIRT_VERSION}/kubevirt-operator.yaml"
kubectl apply -f "https://github.com/kubevirt/kubevirt/releases/download/${KUBEVIRT_VERSION}/kubevirt-cr.yaml"
|
启用 LiveMigration feature gate:
1
2
|
kubectl patch kubevirt kubevirt -n kubevirt --type merge -p \
'{"spec":{"configuration":{"developerConfiguration":{"featureGates":["LiveMigration"]}}}}'
|
等待 KubeVirt 可用:
1
2
|
kubectl -n kubevirt wait kv kubevirt --for condition=Available --timeout=15m
kubectl -n kubevirt get pods -o wide
|
期望看到:
virt-api Running
virt-controller Running
virt-handler 在两个 worker 节点上 Running
virt-operator Running
1
|
kubectl get nodes --show-labels | tr ',' '\n' | rg 'kubevirt.io/schedulable|cpu-feature.node.kubevirt.io/vmx'
|
如果看到 worker 节点有 kubevirt.io/schedulable=true,并且有 CPU/KVM 相关标签,说明 KubeVirt 已经识别到虚拟化能力。
本机已有的 virtctl 版本可能和集群里的 KubeVirt 不一致。建议下载和 KubeVirt 一致的 v1.8.3:
1
2
3
4
|
curl -fL https://github.com/kubevirt/kubevirt/releases/download/v1.8.3/virtctl-v1.8.3-linux-amd64 \
-o /tmp/virtctl-v1.8.3
chmod +x /tmp/virtctl-v1.8.3
/tmp/virtctl-v1.8.3 version --client
|
后文统一使用 /tmp/virtctl-v1.8.3。
本实验创建一个专门给 VM 使用的 namespace、VPC 和 Subnet:
- Namespace:
kubevirt-vpc-test
- VPC:
vm-live-migration-vpc
- Subnet:
vm-live-migration-subnet
- CIDR:
10.250.0.0/24
- Gateway:
10.250.0.1
注意:Kube-OVN v1.16.2 不允许 VPC 和 Subnet 同名。如果两者都叫 vm-live-migration,Subnet 会进入错误状态:
1
|
subnet vm-live-migration and vpc vm-live-migration cannot have the same name
|
因此这里显式把 VPC 和 Subnet 拆成两个名字。
文件路径:
1
|
labs/kubevirt-kubeovn-kind/manifests/01-vpc-subnet.yaml
|
完整 YAML:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
|
apiVersion: v1
kind: Namespace
metadata:
name: kubevirt-vpc-test
annotations:
ovn.kubernetes.io/logical_switch: vm-live-migration-subnet
---
apiVersion: kubeovn.io/v1
kind: Vpc
metadata:
name: vm-live-migration-vpc
spec:
namespaces:
- kubevirt-vpc-test
---
apiVersion: kubeovn.io/v1
kind: Subnet
metadata:
name: vm-live-migration-subnet
spec:
protocol: IPv4
provider: ovn
vpc: vm-live-migration-vpc
namespaces:
- kubevirt-vpc-test
cidrBlock: 10.250.0.0/24
gateway: 10.250.0.1
excludeIps:
- 10.250.0.1..10.250.0.10
natOutgoing: true
|
应用资源:
1
|
kubectl apply -f labs/kubevirt-kubeovn-kind/manifests/01-vpc-subnet.yaml
|
检查 Subnet:
1
|
kubectl get subnet vm-live-migration-subnet -o yaml
|
期望 Validated=True、Ready=True。
这个场景先排除持久化存储变量,只验证 VM 可以在 Kube-OVN VPC 网络下完成热迁移。
文件路径:
1
|
labs/kubevirt-kubeovn-kind/manifests/02-vm-containerdisk.yaml
|
完整 YAML:
展开完整 YAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
|
apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
name: cirros-live-migration
namespace: kubevirt-vpc-test
spec:
runStrategy: Always
template:
metadata:
labels:
kubevirt.io/domain: cirros-live-migration
annotations:
kubevirt.io/allow-pod-bridge-network-live-migration: "true"
spec:
evictionStrategy: LiveMigrate
domain:
resources:
requests:
memory: 128Mi
devices:
interfaces:
- name: default
bridge: {}
disks:
- name: containerdisk
disk:
bus: virtio
- name: cloudinitdisk
disk:
bus: virtio
networks:
- name: default
pod: {}
volumes:
- name: containerdisk
containerDisk:
image: quay.io/kubevirt/cirros-container-disk-demo:latest
- name: cloudinitdisk
cloudInitNoCloud:
userData: |
#cloud-config
password: gocubsgo
chpasswd:
expire: false
ssh_pwauth: true
|
这里有两个和热迁移相关的配置:
evictionStrategy: LiveMigrate:节点维护或主动迁移时使用 live migration。
kubevirt.io/allow-pod-bridge-network-live-migration: "true":允许 bridge pod network 场景进行热迁移。
1
2
3
|
kubectl apply -f labs/kubevirt-kubeovn-kind/manifests/02-vm-containerdisk.yaml
kubectl -n kubevirt-vpc-test wait vmi/cirros-live-migration \
--for condition=Ready --timeout=10m
|
查看 VM、VMI、Pod:
1
2
|
kubectl -n kubevirt-vpc-test get vm,vmi,pod -o wide
kubectl get ips | rg 'cirros-live-migration'
|
期望:
1
2
3
4
5
|
VMI PHASE: Running
VMI READY: True
LIVE-MIGRATABLE: True
IP: 10.250.0.x
logical_switch: vm-live-migration-subnet
|
本次实验中第一次 VM IP 是 10.250.0.11,源节点是 kv-ovn-worker2。
1
2
3
|
kubectl -n kubevirt-vpc-test get vmi cirros-live-migration -o wide
/tmp/virtctl-v1.8.3 -n kubevirt-vpc-test migrate cirros-live-migration
kubectl -n kubevirt-vpc-test get virtualmachineinstancemigration -w
|
看到 migration 进入 Succeeded 后,重新检查:
1
2
3
|
kubectl -n kubevirt-vpc-test get vmi cirros-live-migration -o wide
kubectl -n kubevirt-vpc-test get pod -o wide
kubectl get ips | rg 'cirros-live-migration'
|
本次结果:
1
2
3
4
5
6
|
source node: kv-ovn-worker2
target node: kv-ovn-worker
IP before/after: 10.250.0.11
migration phase: Succeeded
VMI phase after migration: Running
LIVE-MIGRATABLE after migration: True
|
1
2
|
kubectl -n kubevirt-vpc-test run ping-vm --rm -i --restart=Never \
--image=busybox:1.36 -- ping -c 3 -W 2 10.250.0.11
|
本次结果:
1
|
3 packets transmitted, 3 packets received, 0% packet loss
|
注意:ping 只能证明三层连通性,不能证明应用无感。后文会补应用级测试。
这个场景验证 VM 带持久化磁盘时的热迁移。关键条件是存储必须能被源节点和目标节点同时访问。本实验用 NFS 提供 ReadWriteMany PVC。
文件路径:
1
|
labs/kubevirt-kubeovn-kind/manifests/03-nfs-rwx-storage.yaml
|
完整 YAML:
展开完整 YAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
|
apiVersion: v1
kind: Namespace
metadata:
name: nfs-provisioner
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: nfs-server
namespace: nfs-provisioner
spec:
replicas: 1
selector:
matchLabels:
app: nfs-server
template:
metadata:
labels:
app: nfs-server
spec:
containers:
- name: nfs-server
image: itsthenetwork/nfs-server-alpine:12
securityContext:
privileged: true
env:
- name: SHARED_DIRECTORY
value: /exports
ports:
- name: nfs
containerPort: 2049
- name: mountd
containerPort: 20048
- name: rpcbind
containerPort: 111
volumeMounts:
- name: exports
mountPath: /exports
volumes:
- name: exports
emptyDir: {}
---
apiVersion: v1
kind: Service
metadata:
name: nfs-server
namespace: nfs-provisioner
spec:
clusterIP: 10.96.200.200
selector:
app: nfs-server
ports:
- name: nfs
port: 2049
targetPort: 2049
- name: mountd
port: 20048
targetPort: 20048
- name: rpcbind-tcp
port: 111
targetPort: 111
protocol: TCP
- name: rpcbind-udp
port: 111
targetPort: 111
protocol: UDP
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: nfs-subdir-external-provisioner
namespace: nfs-provisioner
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: nfs-subdir-external-provisioner-runner
rules:
- apiGroups: [""]
resources: ["persistentvolumes"]
verbs: ["get", "list", "watch", "create", "delete"]
- apiGroups: [""]
resources: ["persistentvolumeclaims"]
verbs: ["get", "list", "watch", "update"]
- apiGroups: ["storage.k8s.io"]
resources: ["storageclasses"]
verbs: ["get", "list", "watch"]
- apiGroups: [""]
resources: ["events"]
verbs: ["create", "update", "patch"]
---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: run-nfs-subdir-external-provisioner
subjects:
- kind: ServiceAccount
name: nfs-subdir-external-provisioner
namespace: nfs-provisioner
roleRef:
kind: ClusterRole
name: nfs-subdir-external-provisioner-runner
apiGroup: rbac.authorization.k8s.io
---
kind: Role
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: leader-locking-nfs-subdir-external-provisioner
namespace: nfs-provisioner
rules:
- apiGroups: [""]
resources: ["endpoints"]
verbs: ["get", "list", "watch", "create", "update", "patch"]
---
kind: RoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: leader-locking-nfs-subdir-external-provisioner
namespace: nfs-provisioner
subjects:
- kind: ServiceAccount
name: nfs-subdir-external-provisioner
namespace: nfs-provisioner
roleRef:
kind: Role
name: leader-locking-nfs-subdir-external-provisioner
apiGroup: rbac.authorization.k8s.io
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: nfs-subdir-external-provisioner
namespace: nfs-provisioner
spec:
replicas: 1
selector:
matchLabels:
app: nfs-subdir-external-provisioner
template:
metadata:
labels:
app: nfs-subdir-external-provisioner
spec:
serviceAccountName: nfs-subdir-external-provisioner
containers:
- name: nfs-subdir-external-provisioner
image: registry.k8s.io/sig-storage/nfs-subdir-external-provisioner:v4.0.2
env:
- name: PROVISIONER_NAME
value: homelab.local/nfs-rwx
- name: NFS_SERVER
value: 10.96.200.200
- name: NFS_PATH
value: /
volumeMounts:
- name: nfs-client-root
mountPath: /persistentvolumes
volumes:
- name: nfs-client-root
nfs:
server: 10.96.200.200
path: /
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: nfs-rwx
provisioner: homelab.local/nfs-rwx
parameters:
archiveOnDelete: "false"
reclaimPolicy: Delete
volumeBindingMode: Immediate
allowVolumeExpansion: true
|
这里有两个容易踩坑的点:
- NFS Service 使用固定 ClusterIP
10.96.200.200,避免 kubelet 在节点侧 mount NFS 时依赖集群 DNS。
itsthenetwork/nfs-server-alpine:12 这个镜像把 /exports 作为 NFSv4 pseudo-root 导出,因此 provisioner 侧挂载路径使用 /,不是 /exports。
1
2
3
4
|
kubectl apply -f labs/kubevirt-kubeovn-kind/manifests/03-nfs-rwx-storage.yaml
kubectl -n nfs-provisioner rollout status deployment/nfs-server --timeout=5m
kubectl -n nfs-provisioner rollout status deployment/nfs-subdir-external-provisioner --timeout=5m
kubectl get storageclass nfs-rwx
|
确认 worker 节点具备 NFS mount 工具:
1
2
|
docker exec kv-ovn-worker sh -c 'command -v mount.nfs || command -v mount.nfs4 || true'
docker exec kv-ovn-worker2 sh -c 'command -v mount.nfs || command -v mount.nfs4 || true'
|
文件路径:
1
|
labs/kubevirt-kubeovn-kind/manifests/04-vm-rwx-pvc.yaml
|
完整 YAML:
展开完整 YAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
|
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: cirros-rwx-data
namespace: kubevirt-vpc-test
spec:
accessModes:
- ReadWriteMany
storageClassName: nfs-rwx
resources:
requests:
storage: 1Gi
---
apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
name: cirros-live-migration-rwx
namespace: kubevirt-vpc-test
spec:
runStrategy: Always
template:
metadata:
labels:
kubevirt.io/domain: cirros-live-migration-rwx
annotations:
kubevirt.io/allow-pod-bridge-network-live-migration: "true"
spec:
evictionStrategy: LiveMigrate
domain:
resources:
requests:
memory: 128Mi
devices:
interfaces:
- name: default
bridge: {}
disks:
- name: containerdisk
disk:
bus: virtio
- name: datadisk
disk:
bus: virtio
- name: cloudinitdisk
disk:
bus: virtio
networks:
- name: default
pod: {}
volumes:
- name: containerdisk
containerDisk:
image: quay.io/kubevirt/cirros-container-disk-demo:latest
- name: datadisk
persistentVolumeClaim:
claimName: cirros-rwx-data
- name: cloudinitdisk
cloudInitNoCloud:
userData: |
#cloud-config
password: gocubsgo
chpasswd:
expire: false
ssh_pwauth: true
|
1
2
3
4
5
6
7
|
kubectl apply -f labs/kubevirt-kubeovn-kind/manifests/04-vm-rwx-pvc.yaml
kubectl -n kubevirt-vpc-test wait pvc/cirros-rwx-data \
--for jsonpath='{.status.phase}'=Bound --timeout=3m
kubectl -n kubevirt-vpc-test wait vmi/cirros-live-migration-rwx \
--for condition=Ready --timeout=10m
kubectl -n kubevirt-vpc-test get pvc,pv
kubectl -n kubevirt-vpc-test get vmi cirros-live-migration-rwx -o wide
|
触发迁移:
1
2
3
4
|
/tmp/virtctl-v1.8.3 -n kubevirt-vpc-test migrate cirros-live-migration-rwx
kubectl -n kubevirt-vpc-test wait virtualmachineinstancemigration \
-l kubevirt.io/vmi-name=cirros-live-migration-rwx \
--for jsonpath='{.status.phase}'=Succeeded --timeout=5m
|
检查结果:
1
2
|
kubectl -n kubevirt-vpc-test get vmi cirros-live-migration-rwx -o wide
kubectl get ips | rg 'cirros-live-migration-rwx'
|
本次结果:
1
2
3
4
5
6
7
8
9
|
PVC: cirros-rwx-data
PVC access mode: RWX
StorageClass: nfs-rwx
source node: kv-ovn-worker
target node: kv-ovn-worker2
IP before/after: 10.250.0.13
migration phase: Succeeded
VMI phase after migration: Running
LIVE-MIGRATABLE after migration: True
|
ping 不足以证明业务无感。这个场景在 VM 内运行一个简单应用,同时制造内存写入和外部 HTTP 访问,再从集群内客户端持续请求 VM,迁移期间统计应用请求是否失败。
这个测试目标定义为:
- migration 对象进入
Succeeded
- VM IP 保持不变
- VMI 迁移后是
Running 和 Ready=True
- 客户端连续 HTTP 请求失败数为
0
文件路径:
1
|
labs/kubevirt-kubeovn-kind/manifests/05-vm-app-workload.yaml
|
完整 YAML:
展开完整 YAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
|
apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
name: cirros-app-live-migration
namespace: kubevirt-vpc-test
spec:
runStrategy: Always
template:
metadata:
labels:
kubevirt.io/domain: cirros-app-live-migration
annotations:
kubevirt.io/allow-pod-bridge-network-live-migration: "true"
spec:
evictionStrategy: LiveMigrate
domain:
resources:
requests:
memory: 256Mi
devices:
interfaces:
- name: default
bridge: {}
disks:
- name: containerdisk
disk:
bus: virtio
- name: cloudinitdisk
disk:
bus: virtio
networks:
- name: default
pod: {}
volumes:
- name: containerdisk
containerDisk:
image: quay.io/kubevirt/cirros-container-disk-demo:latest
- name: cloudinitdisk
cloudInitNoCloud:
userData: |
#!/bin/sh
echo "cirros:gocubsgo" | chpasswd
(
while true; do
dd if=/dev/zero of=/tmp/memload bs=1M count=96 >/dev/null 2>&1
rm -f /tmp/memload
wget -q -T 3 -O /tmp/external.out http://example.com >/dev/null 2>&1 || true
sleep 1
done
) >/tmp/workload.log 2>&1 &
(
while true; do
printf 'HTTP/1.1 200 OK\r\nContent-Length: 3\r\n\r\nok\n' | nc -l -p 8080
done
) >/tmp/nc-http.log 2>&1 &
|
这个 VM 启动后会做三件事:
- 用
nc 在 8080 端口返回一个固定 HTTP 响应。
- 循环写入并删除一个 96 MiB 的
/tmp/memload 文件,用来制造内存和 I/O 变化。
- 循环访问
http://example.com,模拟 VM 内业务连接外部服务。
1
2
3
4
5
|
kubectl apply -f labs/kubevirt-kubeovn-kind/manifests/05-vm-app-workload.yaml
kubectl -n kubevirt-vpc-test wait vmi/cirros-app-live-migration \
--for condition=Ready --timeout=10m
kubectl -n kubevirt-vpc-test get vmi cirros-app-live-migration -o wide
kubectl get ips | rg 'cirros-app-live-migration'
|
本次实验中应用 VM IP 是 10.250.0.19,源节点是 kv-ovn-worker。
1
2
3
|
kubectl -n kubevirt-vpc-test run app-http-check --rm -i --restart=Never \
--image=curlimages/curl:8.10.1 -- \
curl -fsS --connect-timeout 2 --max-time 3 http://10.250.0.19:8080/
|
期望返回:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
|
kubectl -n kubevirt-vpc-test run app-migration-client --restart=Never \
--image=curlimages/curl:8.10.1 -- sh -c '
fail=0
ok=0
max_ms=0
for i in $(seq 1 180); do
start=$(date +%s%3N)
code=$(curl -sS -o /tmp/body --connect-timeout 1 --max-time 1 \
-w "%{http_code}" http://10.250.0.19:8080/ || echo 000)
end=$(date +%s%3N)
dur=$((end-start))
if [ "$code" = "200" ] && grep -q ok /tmp/body; then
ok=$((ok+1))
else
fail=$((fail+1))
echo "FAIL i=$i code=$code dur_ms=$dur"
fi
if [ "$dur" -gt "$max_ms" ]; then max_ms=$dur; fi
echo "i=$i code=$code dur_ms=$dur ok=$ok fail=$fail max_ms=$max_ms"
sleep 0.2
done
echo "SUMMARY ok=$ok fail=$fail max_ms=$max_ms"
'
|
确认客户端开始请求:
1
|
kubectl -n kubevirt-vpc-test logs app-migration-client --tail=10
|
正常情况下会看到连续的 code=200。
1
2
3
4
|
/tmp/virtctl-v1.8.3 -n kubevirt-vpc-test migrate cirros-app-live-migration
kubectl -n kubevirt-vpc-test wait virtualmachineinstancemigration \
-l kubevirt.io/vmi-name=cirros-app-live-migration \
--for jsonpath='{.status.phase}'=Succeeded --timeout=5m
|
1
2
3
4
5
6
7
|
kubectl -n kubevirt-vpc-test wait pod/app-migration-client \
--for jsonpath='{.status.phase}'=Succeeded --timeout=2m
kubectl -n kubevirt-vpc-test logs app-migration-client | tail -30
kubectl -n kubevirt-vpc-test get vmi cirros-app-live-migration -o wide
kubectl -n kubevirt-vpc-test get virtualmachineinstancemigration \
-l kubevirt.io/vmi-name=cirros-app-live-migration -o wide
kubectl get ips | rg 'cirros-app-live-migration'
|
本次结果:
1
2
3
4
5
6
7
|
source node: kv-ovn-worker
target node: kv-ovn-worker2
IP before/after: 10.250.0.19
migration phase: Succeeded
VMI phase after migration: Running
VMI Ready after migration: True
client summary: ok=180 fail=0 max_ms=1
|
这说明在这个小型应用负载下,迁移期间没有观察到 HTTP 请求失败。它不能证明所有真实业务都一定无感,但提供了一种更接近业务层的验证方法。真实业务应把「不能受影响」定义成明确指标,例如:
- 请求失败数或失败率
- p99/p999 延迟
- TCP 重连次数
- 外部依赖错误数
- 迁移耗时
- VM 内存 dirty rate 较高时是否还能在超时时间内收敛
现象:
1
|
subnet vm-live-migration and vpc vm-live-migration cannot have the same name
|
处理方式:VPC 和 Subnet 使用不同名字。例如:
1
2
|
Vpc: vm-live-migration-vpc
Subnet: vm-live-migration-subnet
|
现象:
1
|
mount.nfs: mounting 10.96.200.200:/exports failed, reason given by server: No such file or directory
|
原因:itsthenetwork/nfs-server-alpine:12 把 /exports 作为 NFSv4 pseudo-root 导出。客户端应该挂载 /。
处理方式:provisioner 里使用:
1
2
3
4
5
|
NFS_PATH: /
...
nfs:
server: 10.96.200.200
path: /
|
现象:
1
|
error: open /usr/local/bin/kubectl-ko: permission denied
|
这只影响本机插件写入,不代表 Kube-OVN 安装失败。检查 pod 状态即可:
1
2
3
|
kubectl -n kube-system get pods -l app=kube-ovn-controller -o wide
kubectl -n kube-system get pods -l app=kube-ovn-cni -o wide
kubectl -n kube-system get pods -l app=ovs -o wide
|
现象:
1
|
You are using a client virtctl version that is different from the KubeVirt version running in the cluster
|
处理方式:下载与 KubeVirt 一致的 virtctl:
1
2
3
|
curl -fL https://github.com/kubevirt/kubevirt/releases/download/v1.8.3/virtctl-v1.8.3-linux-amd64 \
-o /tmp/virtctl-v1.8.3
chmod +x /tmp/virtctl-v1.8.3
|
正常清理命令:
1
|
kind delete cluster --name kv-ovn
|
如果 Docker 报告某个 kind 节点容器无法 kill,可以先看残留:
1
|
docker ps -a --filter name=kv-ovn --format '{{.Names}} {{.Status}} {{.ID}}'
|
如果容器状态卡住,再检查进程:
1
|
docker inspect kv-ovn-worker2 --format '{{.State.Status}} {{.State.Running}} {{.State.Pid}}'
|
必要时清理残留容器:
1
2
|
docker rm -f -v kv-ovn-worker2
kind delete cluster --name kv-ovn
|
如果 Docker 本身等不到 exit event,需要处理容器主进程或 containerd shim。这个属于 Docker/containerd 状态收敛问题,不是 KubeVirt 或 Kube-OVN 的热迁移问题。
下面是从零开始复现的命令顺序。假设当前目录是仓库根目录:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
|
cd /home/jimyag/src/github/jimyag/homelab
kind create cluster --config labs/kubevirt-kubeovn-kind/kind-config.yaml
curl -fsSL https://raw.githubusercontent.com/kubeovn/kube-ovn/v1.16.2/dist/images/install.sh \
-o /tmp/kube-ovn-install.sh
chmod +x /tmp/kube-ovn-install.sh
ENABLE_LIVE_MIGRATION_OPTIMIZE=true bash /tmp/kube-ovn-install.sh
kubectl -n kube-system get pods -l app=kube-ovn-controller -o wide
kubectl -n kube-system get pods -l app=kube-ovn-cni -o wide
kubectl -n kube-system get pods -l app=ovs -o wide
kubectl get nodes -o wide
export KUBEVIRT_VERSION=v1.8.3
kubectl apply -f "https://github.com/kubevirt/kubevirt/releases/download/${KUBEVIRT_VERSION}/kubevirt-operator.yaml"
kubectl apply -f "https://github.com/kubevirt/kubevirt/releases/download/${KUBEVIRT_VERSION}/kubevirt-cr.yaml"
kubectl patch kubevirt kubevirt -n kubevirt --type merge -p \
'{"spec":{"configuration":{"developerConfiguration":{"featureGates":["LiveMigration"]}}}}'
kubectl -n kubevirt wait kv kubevirt --for condition=Available --timeout=15m
curl -fL https://github.com/kubevirt/kubevirt/releases/download/v1.8.3/virtctl-v1.8.3-linux-amd64 \
-o /tmp/virtctl-v1.8.3
chmod +x /tmp/virtctl-v1.8.3
kubectl apply -f labs/kubevirt-kubeovn-kind/manifests/01-vpc-subnet.yaml
kubectl apply -f labs/kubevirt-kubeovn-kind/manifests/02-vm-containerdisk.yaml
kubectl -n kubevirt-vpc-test wait vmi/cirros-live-migration \
--for condition=Ready --timeout=10m
/tmp/virtctl-v1.8.3 -n kubevirt-vpc-test migrate cirros-live-migration
kubectl -n kubevirt-vpc-test wait virtualmachineinstancemigration \
-l kubevirt.io/vmi-name=cirros-live-migration \
--for jsonpath='{.status.phase}'=Succeeded --timeout=5m
kubectl apply -f labs/kubevirt-kubeovn-kind/manifests/03-nfs-rwx-storage.yaml
kubectl -n nfs-provisioner rollout status deployment/nfs-server --timeout=5m
kubectl -n nfs-provisioner rollout status deployment/nfs-subdir-external-provisioner --timeout=5m
kubectl apply -f labs/kubevirt-kubeovn-kind/manifests/04-vm-rwx-pvc.yaml
kubectl -n kubevirt-vpc-test wait pvc/cirros-rwx-data \
--for jsonpath='{.status.phase}'=Bound --timeout=3m
kubectl -n kubevirt-vpc-test wait vmi/cirros-live-migration-rwx \
--for condition=Ready --timeout=10m
/tmp/virtctl-v1.8.3 -n kubevirt-vpc-test migrate cirros-live-migration-rwx
kubectl -n kubevirt-vpc-test wait virtualmachineinstancemigration \
-l kubevirt.io/vmi-name=cirros-live-migration-rwx \
--for jsonpath='{.status.phase}'=Succeeded --timeout=5m
kubectl apply -f labs/kubevirt-kubeovn-kind/manifests/05-vm-app-workload.yaml
kubectl -n kubevirt-vpc-test wait vmi/cirros-app-live-migration \
--for condition=Ready --timeout=10m
kubectl -n kubevirt-vpc-test get vmi cirros-app-live-migration -o wide
|
应用级测试里的 IP 需要用实际分配的 VM IP 替换。查询方式:
1
|
kubectl -n kubevirt-vpc-test get vmi cirros-app-live-migration -o jsonpath='{.status.interfaces[0].ipAddress}{"\n"}'
|
假设得到的是 10.250.0.19,再运行持续请求和迁移:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
|
kubectl -n kubevirt-vpc-test run app-migration-client --restart=Never \
--image=curlimages/curl:8.10.1 -- sh -c '
fail=0
ok=0
max_ms=0
for i in $(seq 1 180); do
start=$(date +%s%3N)
code=$(curl -sS -o /tmp/body --connect-timeout 1 --max-time 1 \
-w "%{http_code}" http://10.250.0.19:8080/ || echo 000)
end=$(date +%s%3N)
dur=$((end-start))
if [ "$code" = "200" ] && grep -q ok /tmp/body; then
ok=$((ok+1))
else
fail=$((fail+1))
echo "FAIL i=$i code=$code dur_ms=$dur"
fi
if [ "$dur" -gt "$max_ms" ]; then max_ms=$dur; fi
echo "i=$i code=$code dur_ms=$dur ok=$ok fail=$fail max_ms=$max_ms"
sleep 0.2
done
echo "SUMMARY ok=$ok fail=$fail max_ms=$max_ms"
'
/tmp/virtctl-v1.8.3 -n kubevirt-vpc-test migrate cirros-app-live-migration
kubectl -n kubevirt-vpc-test wait virtualmachineinstancemigration \
-l kubevirt.io/vmi-name=cirros-app-live-migration \
--for jsonpath='{.status.phase}'=Succeeded --timeout=5m
kubectl -n kubevirt-vpc-test wait pod/app-migration-client \
--for jsonpath='{.status.phase}'=Succeeded --timeout=2m
kubectl -n kubevirt-vpc-test logs app-migration-client | tail -30
|
删除 kind 集群:
1
|
kind delete cluster --name kv-ovn
|
确认没有残留:
1
2
|
kind get clusters
docker ps -a --filter name=kv-ovn --format '{{.Names}} {{.Status}} {{.ID}}'
|
期望:
1
|
No kind clusters found.
|
本实验确认了三件事:
- kind 多节点集群可以承载 Kube-OVN
v1.16.2 和 KubeVirt v1.8.3 的基础热迁移验证。
- KubeVirt VM 位于 Kube-OVN VPC 子网时,热迁移后 VMI 仍保持原 IP,Kube-OVN
IP 对象也会更新到目标节点。
- 在一个简单应用负载下,迁移期间连续 HTTP 请求没有失败,结果是
ok=180 fail=0 max_ms=1。
这个结果不能泛化为所有业务都无影响。对于真实应用,应该用业务自己的 SLO 复测,例如请求失败率、长连接重连次数、外部依赖错误数、p99 延迟和迁移耗时。热迁移是否「业务无感」,最终要由这些业务指标证明。