我强行将使用Kubespray构建的k8s集群升级到Ubuntu 22.04

2 年 ago

文, 翔

5 minutes

首先

使用Kubespray构建的k8s集群的操作系统均为Ubuntu 20.04。

虽然20.04的支持还将继续一段时间，但必须考虑升级到22.04。

幸いにも試すことができるクラスターがあったので、workerの1ノードをアップグレードしてみました。

环境

Kubespray v2.20.0（Kubernetes v1.24.6）及以上支持Ubuntu 22.04作为操作系统。

这次升级是在以下环境中进行的。

ノード数: 4

确认任务

暂时将其禁止，以防止为新的Pod分配资源。

$ sudo kubectl cordon node4
node/node4 cordoned

检查状态。

$ sudo kubectl get node
NAME    STATUS                     ROLES           AGE    VERSION
node1   Ready                      control-plane   555d   v1.25.6
node2   Ready                      control-plane   555d   v1.25.6
node3   Ready                      <none>          555d   v1.25.6
node4   Ready,SchedulingDisabled   <none>          555d   v1.25.6

在这种情况下，尝试对node4进行do-release-upgrade将导致错误。

$ sudo do-release-upgrade -d
Checking for a new Ubuntu release
Please install all available updates for your release before upgrading.

$ sudo apt dist-upgrade
...
The following packages have been kept back:
  containerd.io docker-ce docker-ce-cli
0 upgraded, 0 newly installed, 0 to remove and 3 not upgraded.

Kubespray将软件包保持在暂停状态中。

$ dpkg -l |grep ^h
hi  containerd.io                         1.6.4-1                           amd64        An open and reliable container runtime
hi  docker-ce                             5:20.10.20~3-0~ubuntu-focal       amd64        Docker: the open-source application container engine
hi  docker-ce-cli                         5:20.10.20~3-0~ubuntu-focal       amd64        Docker CLI: the open-source application container engine

如果解除了这些限制，那么工作本身是可以继续进行的，但是我们还需要考虑接下来该怎么做。

请参考以下资料

Kubesprayを利用してKubernetesをデプロイ・アップグレードした時のメモ – v2.15.1でcontainerd.ioパッケージが更新されてしまう問題について

先试试简单地unhold并使用do-release-upgrade来进行升级，因为这个群集即使损坏也没有问题。

操作流程

事先进行围栏，然后再进行排水。

$ sudo kubectl cordon node1
$ sudo kubectl drain node1 --force --ignore-daemonsets

我們不給予寬限期。

首先，先解开三个包裹。

$ sudo apt-mark unhold containerd.io docker-ce docker-ce-cli
Canceled hold on containerd.io.
Canceled hold on docker-ce.
Canceled hold on docker-ce-cli.

升级整个软件包。

$ sudo apt update
$ sudo apt dist-upgrade

将节点重新启动，并在最新的内核下运行，以确保软件包处于最新状态。

$ sudo shutdown -r now

我将升级到Ubuntu 22.04。
由于GNU screen在此中启动，所以我将从另一个终端通过ssh登录到node4，并继续工作。

$ sudo do-release-upgrade -d

あとは、基本的に’y’キーなどで作業を進め、設定ファイルは現状のまま変更しない’N’を選択しながら見守ります。

我会直接重新启动，并观察情况。

重启后的操作

通过kubespray重新配置node4

在启动后，等待几分钟后从控制平面检查节点的状态，并等待其准备就绪。

$ kcg node
NAME    STATUS                     ROLES           AGE    VERSION
node1   Ready                      control-plane   555d   v1.25.6
node2   Ready                      control-plane   555d   v1.25.6
node3   Ready                      <none>          555d   v1.25.6
node4   Ready,SchedulingDisabled   <none>          555d   v1.25.6

表面上看起来一切都在正常运转。

接下来，将使用–limit来仅在node4上应用kubespray的update-cluster.yml。由于这会导致错误，因此在执行时，请最后加上”–skip-tags=multus”。

$ . venv/k8s/bin/activate
(k8s) $ grep kube_version inventory/mycluster/group_vars/k8s_cluster/k8s-cluster.yml 
kube_version: v1.25.6
(k8s) $ ansible-playbook upgrade-cluster.yml -b -i inventory/mycluster/hosts.yaml -e kube_version=v1.25.6 --limit=node4
...

如果执行此操作，将会由于错误而停止。

...
TASK [kubernetes-apps/network_plugin/multus : Multus | Start resources] ***************************************************
failed: [node4 -> {{ groups['kube_control_plane'][0] }}] (item=None) => {"ansible_loop_var": "item", "changed": false, "item": null, "msg": "Failed to template loop_control.label: 'None' has no attribute 'item'", "skip_reason": "Conditional result was False"}

NO MORE HOSTS LEFT ********************************************************************************************************

PLAY RECAP ****************************************************************************************************************
node4                   : ok=312  changed=23   unreachable=0    failed=1    skipped=451  rescued=0    ignored=0

GitHubのIssuesには同様の事例が登録されていて、Node-based upgrade fails with: “Failed to template loop_control.label: ‘None’ has no attribute ‘item'” #9703 に従って –skip-tags=multus を追加して再度実行します。

(k8s) $ ansible-playbook upgrade-cluster.yml -b -i inventory/mycluster/hosts.yaml -e kube_version=v1.25.6 --limit=node4 --skip-tags=multus
...

接下来会出现另一个错误。

...
TASK [container-engine/docker : ensure docker packages are installed] *****************************************************
fatal: [node4]: FAILED! => {"attempts": 4, "cache_update_time": 1678337172, "cache_updated": true, "changed": false, "msg": "'/usr/bin/apt-get -y -o \"Dpkg::Options::=--force-confdef\" -o \"Dpkg::Options::=--force-confold\"       install 'containerd.io=1.6.4-1' 'docker-ce-cli=5:20.10.20~3-0~ubuntu-jammy' 'docker-ce=5:20.10.20~3-0~ubuntu-jammy'' failed: E: Packages were downgraded and -y was used without --allow-downgrades.\n", "rc": 100, "stderr": "E: Packages were downgraded and -y was used without --allow-downgrades.\n", "stderr_lines": ["E: Packages were downgraded and -y was used without --allow-downgrades."], "stdout": "Reading package lists...\nBuilding dependency tree...\nReading state information...\nThe following packages were automatically installed and are no longer required:\n  docker-buildx-plugin docker-compose-plugin libpython2-stdlib\n  libpython2.7-minimal libpython2.7-stdlib python2 python2-minimal python2.7\n  python2.7-minimal\nUse 'sudo apt autoremove' to remove them.\nSuggested packages:\n  aufs-tools cgroupfs-mount | cgroup-lite\nThe following packages will be DOWNGRADED:\n  containerd.io docker-ce docker-ce-cli\n0 upgraded, 0 newly installed, 3 downgraded, 0 to remove and 4 not upgraded.\n", "stdout_lines": ["Reading package lists...", "Building dependency tree...", "Reading state information...", "The following packages were automatically installed and are no longer required:", "  docker-buildx-plugin docker-compose-plugin libpython2-stdlib", "  libpython2.7-minimal libpython2.7-stdlib python2 python2-minimal python2.7", "  python2.7-minimal", "Use 'sudo apt autoremove' to remove them.", "Suggested packages:", "  aufs-tools cgroupfs-mount | cgroup-lite", "The following packages will be DOWNGRADED:", "  containerd.io docker-ce docker-ce-cli", "0 upgraded, 0 newly installed, 3 downgraded, 0 to remove and 4 not upgraded."]}

これは参考資料に上げたv2.15.1で遭遇した問題と同じなので、force: true を roles/container-engine/docker/tasks/main.yml に追加して、再度実行します。

## venv環境の導入
$ . venv/k8s/bin/activate

## kube_versionの確認
(k8s) $ grep kube_version inventory/mycluster/group_vars/k8s_cluster/k8s-cluster.yml 
kube_version: v1.25.6

## dockerインストール時にforce: trueを設定
(k8s) $ vi roles/container-engine/docker/tasks/main.yml

## kube_versionを指定して、--skip-tags=multusオプションも指定して実行
(k8s) $ ansible-playbook upgrade-cluster.yml -b -i inventory/mycluster/hosts.yaml -e kube_version=v1.25.6 --limit=node4 --skip-tags=multus

最后顺利完成了。

PLAY RECAP ****************************************************************************************************************
node4                   : ok=451  changed=19   unreachable=0    failed=0    skipped=855  rescued=0    ignored=1

最后别忘了uncordon。

$ sudo kubectl uncordon node4

これで問題なく動作するようになりました。

node4のcontainerd.io docker-ce docker-ce-cliの状態は次のようになっています。

$ dpkg -l containerd.io docker-ce docker-ce-cli
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name           Version                     Architecture Description
+++-==============-===========================-============-========================================================
hi  containerd.io  1.6.4-1                     amd64        An open and reliable container runtime
hi  docker-ce      5:20.10.20~3-0~ubuntu-jammy amd64        Docker: the open-source application container engine
hi  docker-ce-cli  5:20.10.20~3-0~ubuntu-jammy amd64        Docker CLI: the open-source application container engine

版本保持与升级之前相同。

/etc/apt/sources.list.d/ の中は次のようになっています。

$ $ ls /etc/apt/sources.list.d
download_docker_com_linux_ubuntu.list  download_docker_com_linux_ubuntu.list.distUpgrade

$ cat /etc/apt/sources.list.d/download_docker_com_linux_ubuntu.list
deb [arch=amd64] https://download.docker.com/linux/ubuntu jammy stable # disabled on upgrade to jammy

$ cat /etc/apt/sources.list.d/download_docker_com_linux_ubuntu.list.distUpgrade 
deb [arch=amd64] https://download.docker.com/linux/ubuntu focal stable

この状態で upgrade を実行すると次のようになって、docker関連のパッケージが少し導入されますが、このままインストールしています。

$ sudo apt dist-upgrade
...
The following packages have been kept back:                                                    
  containerd.io docker-ce docker-ce-cli                                                                     
The following packages will be upgraded:                                               
  docker-buildx-plugin docker-ce-rootless-extras docker-compose-plugin docker-scan-plugin           
4 upgraded, 0 newly installed, 0 to remove and 3 not upgraded.

如果按照现在的情况将所有节点升级到Ubuntu 22.04，我打算考虑从Docker迁移到containerd的方法。

对于传统的可信的.gpg密钥环的兼容性

过了一段时间，我注意到了以下警告。

$ sudo apt update
Hit:1 https://download.docker.com/linux/ubuntu jammy InRelease
...
W: https://download.docker.com/linux/ubuntu/dists/jammy/InRelease: Key is stored in legacy trusted.gpg keyring (/etc/apt/trusted.gpg), see the DEPRECATION section in apt-key(8) for details.

如果按照docker.com上的步骤操作，您可以重新设置gpg密钥。

Install Docker Engine on Ubuntu

这还不足够，还需要从trusted.gpg中删除。
Docker本身无需升级，我打算暂时放置它。

控制面节点的升级

我在发布这篇文章之后，也将控制平面节点从Ubuntu 20.04升级到了22.04。

尽管工作节点与其他节点没有明显的区别，但重新启动却花费了非常长的时间。尽管其原因未知，但工作所需的时间却比预期的多大约两倍，给人留下了深刻的印象。

总结

虽然本身作业并不感觉很危险，但是从重新启动开始到整个集群稳定需要一段时间。

在进行cordon操作期间，执行kubespray命令并重新启动节点等确认步骤不需要麻烦，这样看起来更好。

最终结果是 Rook/Ceph 的运行没有受到影响，之前还曾经回滚过 docker 和 containerd.io 的升级，所以我们能够顺利进行工作，没有太多混乱。

我觉得，尤其是在日本这样的地方，K8s集群很可能会保持稳定，并且没有进行升级等操作，就被搁置不管了。

由于k8s的每个版本的支持期限都非常短，所以请定期检查End-of-Life（EOL）信息，并设立维护窗口来持续升级k8s集群。

https://endoflife.date/kubernetes

以上