使用 kubespary 部署 k8s 集群

2024-10-22

使用 kubespary 部署 k8s 集群 v1.28.6 版本，集群名称为test，部署的节点为 node1 到 node7。下面是机器的说明

机器类型	机器名	备注
ansible 操作节点	node1	执行 ansible 命令的机器，可以是集群内的机器，也可以不是集群内。
master 节点	node1 node2 node3
worker 节点	node4 node5 node6 node7
etcd 节点	node1 node2 node3

安装的流程

做好安装前的准备和检查。
构建集群的配置文件，在 node1 clone kubespary 的仓库
在 ansible 操作节点执行命令部署集群。
部署完成后检查

安装前的准备

ansible 节点

检查 python 版本

当前的安装脚本对 python 版本有要求 python3.10-3.12

sudo apt update && sudo apt install python3 python3-pip

修改 pypi 镜像源，机器上默认的阿里源中的 ansible 的版本太低了，需要更换为其他源

python3 -m pip3 install --upgrade pip3
pip3 config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple

生成 ssh key

ssh-keygen -t ed25519 
# 一路回车cat ~/.ssh/id_ed25519.pub
ssh-ed25519 somekey root@node1

集群内节点

在所有的节点都需要做的操作

检查 systemd-resolved

因为 kubelet 默认会读取 /run/systemd/resolve/resolv.conf 中的配置

检查是否启动了 systemd-resolved

systemctl status systemd-resolved.service

如果状态不是 running ,就添加开机自启并且立即启动

systemctl enable systemd-resolved.service  --now

如果有报错就需要排查问题。

添加 ansible 机器的 public key

echo "ssh-ed25519 somekey root@node1" >> /root/.ssh/authorized_keys

ubuntu 系统检查 `/etc/systemd/resolved.conf`

查看 /run/systemd/resolve/resolv.conf 是否包含 k8s 内部的 dns 和 /etc/resolv.conf 中 dns 预期的结果应该是

# This file is managed by man:systemd-resolved(8). Do not edit.
#
# This is a dynamic resolv.conf file for connecting local clients directly to
# all known uplink DNS servers. This file lists all configured search domains.
#
# Third party programs must not access this file directly, but only through the
# symlink at /etc/resolv.conf. To manage man:resolv.conf(5) in a different way,
# replace this symlink by a static file or a different symlink.
#
# See man:systemd-resolved.service(8) for details about the supported modes of
# operation for /etc/resolv.conf.

nameserver 8.8.4.4
nameserver 1.1.1.1
nameserver 8.8.8.8
# Too many DNS servers configured, the following entries may be ignored.
nameserver 169.254.25.10
search default.svc.test.local svc.test.local

如果不是需要修改/etc/systemd/resolved.conf

[Resolve]
DNS=8.8.4.4 1.1.1.1 8.8.8.8
#FallbackDNS=
#Domains=
#LLMNR=no
#MulticastDNS=no
#DNSSEC=no
#DNSOverTLS=no
#Cache=no-negative
#DNSStubListener=yes
#ReadEtcHosts=yes

检查 `/etc/resolv.conf`

如果 /etc/resolv.conf 中包含 nameserver 127.0.0.1 (不是 nameserver 127.0.0.53)，需要把 nameserver 127.0.0.1 删除。如果配置了本地回环地址，会导致 dns 回环。

构建 kubespary 配置文件

# 复制配置文件
cp -rfp inventory/sample inventory/test

生成要安装的机器的 hosts 文件

declare -a IPS=(192.168.2.10 192.168.2.11 192.168.2.12 192.168.2.13 192.168.2.14 192.168.2.15 192.168.2.16)
CONFIG_FILE=inventory/test/hosts.yaml python3 contrib/inventory_builder/inventory.py ${IPS[@]}

生成的配置文件需要修改 node 名称，以及 master、worker、etcd 的机器名称

all:
  hosts:
    node1:
      ansible_host: 192.168.2.10
      ip: 192.168.2.10
      access_ip: 192.168.2.10
    node2:
      ansible_host: 192.168.2.11
      ip: 192.168.2.11
      access_ip: 192.168.2.11
    node3:
      ansible_host: 192.168.2.12
      ip: 192.168.2.12
      access_ip: 192.168.2.12
    node4:
      ansible_host: 192.168.2.13
      ip: 192.168.2.13
      access_ip: 192.168.2.13
    node5:
      ansible_host: 192.168.2.14
      ip: 192.168.2.14
      access_ip: 192.168.2.14
    node6:
      ansible_host: 192.168.2.15
      ip: 192.168.2.15
      access_ip: 192.168.2.15
    node7:
      ansible_host: 192.168.2.16
      ip: 192.168.2.16
      access_ip: 192.168.2.16
  children:
    kube_control_plane:
      hosts:
        node1:
        node2:
        node3:
    kube_node:
      hosts:
        node4:
        node5:
        node6:
        node7:
    etcd:
      hosts:
        node1:
        node2:
        node3:
    k8s_cluster:
      children:
        kube_control_plane:
        kube_node:
        calico_rr:
    calico_rr:
      hosts: {}

修改通过镜像安装。使用 daoCloud 的镜像，

cp inventory/test/group_vars/all/offline.yml inventory/test/group_vars/all/mirror.yml
sed -i -E '/# .*\{\{ files_repo/s/^# //g' inventory/mycluster/group_vars/all/mirror.yml
tee -a inventory/mycluster/group_vars/all/mirror.yml <<EOF
gcr_image_repo: "gcr.m.daocloud.io"
kube_image_repo: "k8s.m.daocloud.io"
docker_image_repo: "docker.m.daocloud.io"
quay_image_repo: "quay.m.daocloud.io"
github_image_repo: "ghcr.m.daocloud.io"
files_repo: "https://files.m.daocloud.io"
EOF

修改group_vars/etcd.yml

etcd_deployment_type: host

修改group_vars/k8s_cluster/addons.yml

helm_enabled: true
registry_enabled: true
metrics_server_enabled: true
ingress_nginx_enabled: true
metallb_speaker_enabled: true

修改group_vars/k8s_cluster/k8s-cluster.yml

kube_pods_subnet: 10.199.0.0/16
cluster_name: test.local # 集群的名称 
auto_renew_certificates: true

再次检查 inventory/test/group_vars 下的所有配置没有问题，

部署

cd kubespray
pip install -U -r requirements.txt

# 执行下面的命令就开始安装集群了，安装大概需要半个小时左右
ansible-playbook -i inventory/test/hosts.yaml  --become --become-user=root cluster.yml

部署完成后检查

在任意一个 master 节点操作

# 有没有 pod 没有启动
kubectl get po -ALL | grep -v "Running"

安装失败，问题排查

打开 ansible 的 debug 模式

vim /etc/ansible/ansible.cfg

stdout_callback=debug
stderr_callback=debug

修改报错 task 的命令，在后面增加 -v 5,如下

name: kubeadm | Initialize first master command: >- timeout -k 600s 600s {{ bin_dir }}/kubeadm init --config={{ kube_config_dir }}/kubeadm-config.yaml --ignore-preflight-errors=all --skip-phases=addon/coredns --upload-certs -v 5

查看 kubelet 的日志

journalctl -u kubelet

#Kubespary #K8s