使用 kubespary 部署 k8s 集群 v1.28.6 版本，集群名称为`test`，部署的节点为 node1 到 node7。下面是机器的说明

| 机器类型         | 机器名                  | 备注                                                            |
| ---------------- | ----------------------- | --------------------------------------------------------------- |
| ansible 操作节点 | node1                   | 执行 ansible 命令的机器，可以是集群内的机器，也可以不是集群内。 |
| master 节点      | node1 node2 node3       |                                                                 |
| worker 节点      | node4 node5 node6 node7 |                                                                 |
| etcd 节点        | node1 node2 node3       |                                                                 |

安装的流程

1. 做好安装前的准备和检查。
2. 构建集群的配置文件，在 node1 clone  kubespary 的仓库
3. 在 ansible 操作节点执行命令部署集群。
4. 部署完成后检查

## 安装前的准备

### ansible 节点

### 检查 python 版本

当前的安装脚本对 python 版本有要求 `python3.10-3.12`

```bash
sudo apt update && sudo apt install python3 python3-pip
```

修改 pypi 镜像源，机器上默认的阿里源中的 ansible 的版本太低了，需要更换为其他源

```bash
python3 -m pip3 install --upgrade pip3
pip3 config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple
```

### 生成 ssh key

```bash
ssh-keygen -t ed25519 
# 一路回车cat ~/.ssh/id_ed25519.pub
ssh-ed25519 somekey root@node1
```

### 集群内节点

在所有的节点都需要做的操作

### 检查 systemd-resolved

因为 kubelet 默认会读取 `/run/systemd/resolve/resolv.conf` 中的配置

检查是否启动了 `systemd-resolved`

```bash
systemctl status systemd-resolved.service
```

如果状态不是 running ,就添加开机自启并且立即启动

```bash
systemctl enable systemd-resolved.service  --now
```

如果有报错就需要排查问题。

### 添加 ansible 机器的 public key

```bash
echo "ssh-ed25519 somekey root@node1" >> /root/.ssh/authorized_keys
```

### ubuntu 系统检查 `/etc/systemd/resolved.conf`

查看 `/run/systemd/resolve/resolv.conf` 是否包含 k8s 内部的 dns 和 `/etc/resolv.conf` 中 dns 预期的结果应该是

```conf
# This file is managed by man:systemd-resolved(8). Do not edit.
#
# This is a dynamic resolv.conf file for connecting local clients directly to
# all known uplink DNS servers. This file lists all configured search domains.
#
# Third party programs must not access this file directly, but only through the
# symlink at /etc/resolv.conf. To manage man:resolv.conf(5) in a different way,
# replace this symlink by a static file or a different symlink.
#
# See man:systemd-resolved.service(8) for details about the supported modes of
# operation for /etc/resolv.conf.

nameserver 8.8.4.4
nameserver 1.1.1.1
nameserver 8.8.8.8
# Too many DNS servers configured, the following entries may be ignored.
nameserver 169.254.25.10
search default.svc.test.local svc.test.local
```

如果不是需要修改`/etc/systemd/resolved.conf`

```conf
[Resolve]
DNS=8.8.4.4 1.1.1.1 8.8.8.8
#FallbackDNS=
#Domains=
#LLMNR=no
#MulticastDNS=no
#DNSSEC=no
#DNSOverTLS=no
#Cache=no-negative
#DNSStubListener=yes
#ReadEtcHosts=yes

```

### 检查 `/etc/resolv.conf`

如果 `/etc/resolv.conf` 中包含 `nameserver 127.0.0.1` (不是 `nameserver 127.0.0.53`)，需要把 `nameserver 127.0.0.1` 删除。如果配置了本地回环地址，会导致 dns 回环。

## 构建 kubespary 配置文件

```bash
# 复制配置文件
cp -rfp inventory/sample inventory/test
```

生成要安装的机器的 hosts 文件

```bash
declare -a IPS=(192.168.2.10 192.168.2.11 192.168.2.12 192.168.2.13 192.168.2.14 192.168.2.15 192.168.2.16)
CONFIG_FILE=inventory/test/hosts.yaml python3 contrib/inventory_builder/inventory.py ${IPS[@]}
```

生成的配置文件需要修改 node 名称，以及 master、worker、etcd 的机器名称

```yaml
all:
  hosts:
    node1:
      ansible_host: 192.168.2.10
      ip: 192.168.2.10
      access_ip: 192.168.2.10
    node2:
      ansible_host: 192.168.2.11
      ip: 192.168.2.11
      access_ip: 192.168.2.11
    node3:
      ansible_host: 192.168.2.12
      ip: 192.168.2.12
      access_ip: 192.168.2.12
    node4:
      ansible_host: 192.168.2.13
      ip: 192.168.2.13
      access_ip: 192.168.2.13
    node5:
      ansible_host: 192.168.2.14
      ip: 192.168.2.14
      access_ip: 192.168.2.14
    node6:
      ansible_host: 192.168.2.15
      ip: 192.168.2.15
      access_ip: 192.168.2.15
    node7:
      ansible_host: 192.168.2.16
      ip: 192.168.2.16
      access_ip: 192.168.2.16
  children:
    kube_control_plane:
      hosts:
        node1:
        node2:
        node3:
    kube_node:
      hosts:
        node4:
        node5:
        node6:
        node7:
    etcd:
      hosts:
        node1:
        node2:
        node3:
    k8s_cluster:
      children:
        kube_control_plane:
        kube_node:
        calico_rr:
    calico_rr:
      hosts: {}
```

修改通过镜像安装。使用 daoCloud 的镜像，

```bash
cp inventory/test/group_vars/all/offline.yml inventory/test/group_vars/all/mirror.yml
sed -i -E '/# .*\{\{ files_repo/s/^# //g' inventory/mycluster/group_vars/all/mirror.yml
tee -a inventory/mycluster/group_vars/all/mirror.yml <<EOF
gcr_image_repo: "gcr.m.daocloud.io"
kube_image_repo: "k8s.m.daocloud.io"
docker_image_repo: "docker.m.daocloud.io"
quay_image_repo: "quay.m.daocloud.io"
github_image_repo: "ghcr.m.daocloud.io"
files_repo: "https://files.m.daocloud.io"
EOF

```

修改`group_vars/etcd.yml`

```yaml
etcd_deployment_type: host

```

修改`group_vars/k8s_cluster/addons.yml`

```yaml
helm_enabled: true
registry_enabled: true
metrics_server_enabled: true
ingress_nginx_enabled: true
metallb_speaker_enabled: true
```

修改`group_vars/k8s_cluster/k8s-cluster.yml`

```yaml
kube_pods_subnet: 10.199.0.0/16
cluster_name: test.local # 集群的名称 
auto_renew_certificates: true

```

再次检查 `inventory/test/group_vars` 下的所有配置没有问题，

## 部署

```bash
cd kubespray
pip install -U -r requirements.txt

# 执行下面的命令就开始安装集群了，安装大概需要半个小时左右
ansible-playbook -i inventory/test/hosts.yaml  --become --become-user=root cluster.yml
```

## 部署完成后检查

在任意一个 master 节点操作

```bash
# 有没有 pod 没有启动
kubectl get po -ALL | grep -v "Running"
```

## 安装失败，问题排查

打开 ansible 的 debug 模式

```bash
vim /etc/ansible/ansible.cfg

stdout_callback=debug
stderr_callback=debug
```

修改报错 task 的命令，在后面增加 `-v 5`,如下

```bash
name: kubeadm | Initialize first master command: >- timeout -k 600s 600s {{ bin_dir }}/kubeadm init --config={{ kube_config_dir }}/kubeadm-config.yaml --ignore-preflight-errors=all --skip-phases=addon/coredns --upload-certs -v 5
```

查看 kubelet 的日志

```bash
journalctl -u kubelet
```