k8s安装3节点集群Fate v1.8.0
创始人
2024-03-12 09:00:39
0

集群配置信息

3节点配置信息如下图:
在这里插入图片描述

当kubefate最新版是1.9.0时,依赖的k8s和ingress-ngnix版本如下:
Recommended version of dependent software:
Kubernetes: v1.23.5
Ingress-nginx: v1.1.3

升级K8S到1.23.5

参考博客
https://blog.csdn.net/RivenDong/article/details/121213109
https://www.cnblogs.com/cloud-yongqing/p/16629666.html
以下步骤多次操作,逐级将K8S从1.18.x升级到1.23.5

master节点
yum install -y kubeadm-1.19.16-0 --disableexcludes=kubernetes
kubeadm version
kubectl drain harbor.clife.io --delete-emptydir-data --ignore-daemonsets
kubeadm upgrade plan --ignore-preflight-errors=CoreDNSUnsupportedPlugins,CoreDNSMigration
kubeadm upgrade apply v1.19.16  --ignore-preflight-errors=CoreDNSUnsupportedPlugins,CoreDNSMigration
yum install -y kubelet-1.19.16-0 kubectl-1.19.16-0
systemctl daemon-reload
systemctl restart kubelet
kubectl uncordon  harbor.clife.io节点gpu-51
master节点执行:       kubectl drain gpu-51 --ignore-daemonsets
yum install -y kubeadm-1.20.15-0 --disableexcludes=kubernetes
kubeadm upgrade node
yum install -y kubelet-1.20.15-0 kubectl-1.20.15-0 --disableexcludes=kubernetes
systemctl daemon-reload
systemctl restart kubelet
master节点执行:       kubectl uncordon gpu-51

kate下载

链接: link
软件包:kubefate-k8s-v1.8.0.tar.gz

接下来的操作都在Master节点上完成。

删除旧版Fate

查看之前已安装的f旧版fate,将其删除:
查看:
kubectl get ns

NAME                              STATUS        AGE
default                           Active        504d
fate-10000                        Active        459d
fate-9999                         Active        459d
ingress-nginx                     Active        465d
istio-system                      Active        497d
kube-fate                         Active        465d
kube-node-lease                   Active        504d
kube-public                       Active        504d
kube-system                       Active        504d
kubernetes-dashboard              Terminating   504d
kubernetes-dashboard2             Active        4d17h
kubesphere-controls-system        Active        489d
kubesphere-monitoring-federated   Active        489d
kubesphere-monitoring-system      Active        489d
minio                             Active        363d
monitoring                        Active        362d
seldon                            Active        159d
seldon-system                     Active        502d

删除:
kubectl delete namespace fate-10000
kubectl delete namespace fate-9999
kubectl delete namespace kube-fate

部署ingress-nginx

参考:https://blog.csdn.net/qq_41296573/article/details/125809696
以下deploy.yaml为部署ingress-nginx(1.1.3版本,当时最新1.5.0)的文件,可能需要翻墙才能下载:
https://raw.githubusercontent.com/kubernetes/ingress-nginx/controller-v1.1.3/deploy/static/provider/cloud/deploy.yaml
以上文件中有2个翻墙才能下载的镜像,将镜像改成国内的镜像(3处地方):

k8s.gcr.io/ingress-nginx/controller:v1.1.3@sha256:31f47c1e202b39fadecf822a9b76370bd4baed199a005b3e7d4d1455f4fd3fe2
改为:
registry.cn-hangzhou.aliyuncs.com/google_containers/nginx-ingress-controller:v1.1.3k8s.gcr.io/ingress-nginx/kube-webhook-certgen:v1.1.1@sha256:64d8c73dca984af206adf9d6d7e46aa550362b1d7a01f3a0a91b20cc67868660
改为:
registry.cn-hangzhou.aliyuncs.com/google_containers/kube-webhook-certgen:v1.1.1

然后部署ingress-nginx:
kubectl apply -f ./deploy.yaml
查看ingress-nginx是否成功:

[root@harbor kubefate]#  kubectl get  pods -n ingress-nginx -o wide
NAME                                        READY   STATUS      RESTARTS   AGE     IP            NODE         NOMINATED NODE   READINESS GATES
ingress-nginx-admission-create-zh96h        0/1     Completed   0          2d23h   10.244.1.26   gpu-51                  
ingress-nginx-admission-patch-hmgr5         0/1     Completed   1          2d23h   10.244.1.27   gpu-51                  
ingress-nginx-controller-6995ffb95b-m87gh   1/1     Running     0          2d18h   172.17.0.8    k8s-node02              

可见ingress-nginx被安装到了k8s-node02节点,而不是master节点,这个是正常的(即便是在master操作,也会安装到别处)
输入如下命令,检查配置是否生效:
kubectl -n ingress-nginx get svc

NAME                                 TYPE           CLUSTER-IP    EXTERNAL-IP   PORT(S)                      AGE
ingress-nginx-controller             LoadBalancer   10.1.196.14        80:30428/TCP,443:30338/TCP   16m
ingress-nginx-controller-admission   ClusterIP      10.1.32.33            443/TCP                      16m

可以看到ingress-nginx-controller的EXTERNAL-IP为pending状态,经查阅资料,借鉴如下博客:
链接: link
修改 service中ingress-nginx-controller的EXTERNAL-IP为k8s-node02节点的IP:
kubectl edit -n ingress-nginx service/ingress-nginx-controller
在大概如下位置添加externalIPs:

spec:allocateLoadBalancerNodePorts: trueclusterIP: 10.1.86.240clusterIPs:- 10.1.86.240externalIPs:- 10.6.17.106

再次查看,EXTERNAL-IP已经有了:

[root@harbor kubefate]# kubectl -n ingress-nginx get svc
NAME                                 TYPE           CLUSTER-IP    EXTERNAL-IP   PORT(S)                      AGE
ingress-nginx-controller             LoadBalancer   10.1.86.240   10.6.17.106   80:31872/TCP,443:32412/TCP   2d23h
ingress-nginx-controller-admission   ClusterIP      10.1.41.173           443/TCP                      2d23h

部署Kubefate服务

1.载入Kubefate服务镜像
接着,我们下载KubeFATE服务镜像v1.4.4:
curl -LO https://github.com/FederatedAI/KubeFATE/releases/download/v1.8.0/kubefate-v1.4.4.docker
注意:前边是v1.8.0后边是v1.4.4
然后读入本地Docker环境
docker load < kubefate-v1.4.4.docker
创建目录
mkdir /home/FATE_V180
将kubefate-k8s-v1.8.0.tar.gz拷贝到新目录中解压
tar -zxvf kubefate-k8s-v1.8.0.tar.gz
解压后的目录,可见可执行文件KubeFATE,可以直接移动到path目录方便使用:
chmod +x ./kubefate && sudo mv ./kubefate /usr/bin
测试下kubefate命令是否可用:
kubefate version

* kubefate commandLine version=v1.4.4
* kubefate service connection error, resp.StatusCode=404, error: 

404 - Not Found

404 - Not Found

以上提示的问题算正常,后面会解决。

执行rbac-config.yaml–为 KubeFATE服务创建命名空间
kubectl apply -f ./rbac-config.yaml

因为近期Dockerhub调整了下载限制服务条例 Dockerhub latest limitation, 我建议使用国内网易云的镜像仓库代替官方Dockerhub

1、将kubefate.yaml内镜像federatedai/kubefate:v1.4.4改成hub.c.163.com/federatedai/kubefate:v1.4.4
2、sed 's/mariadb:10/hub.c.163.com\/federatedai\/mariadb:10/g' kubefate.yaml > kubefate_163.yaml
3、sed 's/registry: ""/registry: "hub.c.163.com\/federatedai"/g' cluster.yaml > cluster_163.yaml 

在kube-fate命名空间里部署KubeFATE服务,相关的yaml文件也已经准备在工作目录,直接使用kubectl apply:
kubectl apply -f ./kubefate_163.yaml
【注】如果你是删除了kubefate和ingress-ngnix重新执行这一步,可能会发生一个错误,解决办法参考:https://blog.csdn.net/qq_39218530/article/details/115372879

稍等一会,大概10几秒后用下面命令看下KubeFATE服务是否部署好:
kubectl get all,ingress -n kube-fate
可能的问题会导致kubefate pod crash:

Startup probe failed: Get "http://10.244.1.34:8080/": dial tcp 10.244.1.34:8080: connect: connection refused

如果返回类似下面的信息(特别是pod的STATUS显示的是Running状态),则KubeFATE的服务就已经部署好并正常运行:

[root@harbor kubefate]# kubectl get all,ingress -n kube-fate
NAME                            READY   STATUS                   RESTARTS   AGE
pod/kubefate-5bf485957b-9wltd   0/1     Evicted                  0          2d20h
pod/kubefate-5bf485957b-bh774   0/1     ContainerStatusUnknown   1          3d1h
pod/kubefate-5bf485957b-bs8zc   0/1     Evicted                  0          2d20h
pod/kubefate-5bf485957b-cj7j7   0/1     Evicted                  0          2d20h
pod/kubefate-5bf485957b-hn2xm   0/1     Evicted                  0          2d20h
pod/kubefate-5bf485957b-m4hn6   0/1     Evicted                  0          2d20h
pod/kubefate-5bf485957b-ncbc2   0/1     Evicted                  0          2d20h
pod/kubefate-5bf485957b-tznw6   1/1     Running                  0          2d20h
pod/mariadb-574d4679f8-f5wc2    1/1     Running                  0          2d20h
pod/mariadb-574d4679f8-mw9np    0/1     ContainerStatusUnknown   1          3d1hNAME               TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)          AGE
service/kubefate   NodePort    10.1.151.34            8080:30053/TCP   3d1h
service/mariadb    ClusterIP   10.1.150.151           3306/TCP         3d1hNAME                       READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/kubefate   1/1     1            1           3d1h
deployment.apps/mariadb    1/1     1            1           3d1hNAME                                  DESIRED   CURRENT   READY   AGE
replicaset.apps/kubefate-5bf485957b   1         1         1       3d1h
replicaset.apps/mariadb-574d4679f8    1         1         1       3d1hNAME                                 CLASS   HOSTS         ADDRESS       PORTS   AGE
ingress.networking.k8s.io/kubefate   nginx   example.com   10.6.17.106   80      3d1h

.添加example.com到hosts文件
因为我们要用 example.com 域名来访问KubeFATE服务(该域名在ingress中定义,有需要可自行修改),需要在运行kubefate命令行所在的机器配置hosts文件(注意不是Kubernetes所在的机器,而是ingress-ngnix所在的机器,前面安装ingress-ngnix部分有讲)。 另外下文中部署的FATE集群默认也是使用example.com作为默认域名, 如果网络环境有域名解析服务,可配置example.com域名指向master机器的IP地址,这样就不用配置hosts文件。(IP地址一定要换成你自己的)
sudo -- sh -c "echo \"10.6.17.106 example.com\" >> /etc/hosts"

[root@harbor kubefate]# ping example.com
PING example.com (10.6.17.106) 56(84) bytes of data.
64 bytes from k8s-master (10.6.17.106): icmp_seq=1 ttl=64 time=0.041 ms
64 bytes from k8s-master (10.6.17.106): icmp_seq=2 ttl=64 time=0.054 ms
64 bytes from k8s-master (10.6.17.106): icmp_seq=3 ttl=64 time=0.050 ms
^C
--- example.com ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2000ms
rtt min/avg/max/mdev = 0.041/0.048/0.054/0.007 ms

使用vi修改config.yaml的内容。只需要修改serviceurl: example.com:31872加上映射的端口,如果忘记了重新查看一下80端口对应的映射端口:
kubectl -n ingress-nginx get svc

NAME                                 TYPE           CLUSTER-IP    EXTERNAL-IP   PORT(S)                      AGE
ingress-nginx-controller             LoadBalancer   10.1.86.240        80:31872/TCP,443:32412/TCP   74m
ingress-nginx-controller-admission   ClusterIP      10.1.41.173           443/TCP                      74m

修改完成查看一下,显示如下:

[root@harbor kubefate]# kubefate version
* kubefate commandLine version=v1.4.4
* kubefate service version=v1.4.4

使用KubeFATE安装FATE

按照前面的计划,我们需要安装3联盟方,ID分别9998、9999与10000。现实情况,这3方应该是完全独立、隔绝的组织,为了模拟现实情况,所以我们需要先为他们在Kubernetes上创建各自独立的命名空间(namespace)。 我们创建命名空间fate-9998用来部署9998,fate-9999用来部署9999,fate-10000部署10000

kubectl create namespace fate-9998
kubectl create namespace fate-9999
kubectl create namespace fate-10000

在exmaple目录下,预先设置了3个例子:/kubefate/examples/party-9998/和/kubefate/examples/party-9999/ 和 /kubefate/examples/party-10000 对于/kubefate/examples/party-9999/cluster.yaml,我们可以将其修改如下:
party-9998:

name: fate-9998
namespace: fate-9998
chartName: fate
chartVersion: v1.8.0
partyId: 9998
registry: "hub.c.163.com/federatedai"    # 换成国内镜像库
imageTag: 1.8.0-release
pullPolicy: 
imagePullSecrets: 
- name: myregistrykey
persistence: false
istio:
enabled: false
podSecurityPolicy:
enabled: false
ingressClassName: nginx
modules:
- rollsite
- clustermanager
- nodemanager
- mysql
- python
- fateboard
- clientbackend: eggrollingress:
fateboard:hosts:- name: party9998.fateboard.example.com
client:  hosts:- name: party9998.notebook.example.comrollsite: 
type: NodePort
nodePort: 30081
partyList:
- partyId: 10000partyIp: 10.6.17.104partyPort: 30101
- partyId: 9999partyIp: 10.6.17.106partyPort: 30091python:
type: NodePort
httpNodePort: 30087
grpcNodePort: 30082
logLevel: INFOservingIp: 10.6.14.13
servingPort: 30085

party-9999:

name: fate-9999
namespace: fate-9999
chartName: fate
chartVersion: v1.8.0
partyId: 9999
registry: "hub.c.163.com/federatedai"
imageTag: 1.8.0-release
pullPolicy: 
imagePullSecrets: 
- name: myregistrykey
persistence: false
istio:enabled: false
podSecurityPolicy:enabled: false
ingressClassName: nginx
modules:- rollsite- clustermanager- nodemanager- mysql- python- fateboard- clientbackend: eggrollingress:fateboard:hosts:- name: party9999.fateboard.example.comclient:  hosts:- name: party9999.notebook.example.comrollsite: type: NodePortnodePort: 30091partyList:- partyId: 10000partyIp: 10.6.17.104partyPort: 30101- partyId: 9998partyIp: 10.6.14.13partyPort: 30081python:type: NodePorthttpNodePort: 30097grpcNodePort: 30092logLevel: INFOservingIp: 10.6.17.106
servingPort: 30095

party-10000:

name: fate-10000
namespace: fate-10000
chartName: fate
chartVersion: v1.8.0
partyId: 10000
registry: "hub.c.163.com/federatedai"
imageTag: 1.8.0-release
pullPolicy: 
imagePullSecrets: 
- name: myregistrykey
persistence: false
istio:enabled: false
podSecurityPolicy:enabled: false
ingressClassName: nginx
modules:- rollsite- clustermanager- nodemanager- mysql- python- fateboard- clientbackend: eggrollingress:fateboard: hosts:- name: party10000.fateboard.example.comclient:  hosts:- name: party10000.notebook.example.comrollsite: type: NodePortnodePort: 30101partyList:- partyId: 9999partyIp: 10.6.17.106partyPort: 30091- partyId: 9998partyIp: 10.6.14.13partyPort: 30081python:type: NodePorthttpNodePort: 30107grpcNodePort: 30102logLevel: INFOservingIp: 10.6.17.104
servingPort: 30105

安装FATE集群
如果一切没有问题,那就可以使用kubefate cluster install来部署两个fate集群了,(没遇到坑的步骤按照官方的执行就可以)

kubefate cluster install -f ./examples/party-10000/cluster10000.yaml
kubefate cluster install -f ./examples/party-9999/cluster9999.yaml
kubefate cluster install -f ./examples/party-9998/cluster9998.yaml

这时候,KubeFATE会创建3个任务去分别部署两个FATE集群。我们可以通过kubefate job ls来查看任务,或者直接watch KubeFATE中集群的状态,直至变成Running

[root@harbor kubefate]# watch kubefate cluster ls
UUID                                    NAME            NAMESPACE       REVISION        STATUS          CHART   ChartVERSION    AGE
7bca70c1-236c-4931-81f8-1350cce579d4    fate-9998       fate-9998       1               Running         fate    v1.8.0          18m
143378db-b84d-4045-8615-11d36335d5b2    fate-9999       fate-9999       0               Creating        fate    v1.8.0          17m
d3e27a39-c8de-4615-96f2-29012f3edc68    fate-10000      fate-10000      0               Creating        fate    v1.8.0          17m

因为这个步骤需要到网易云镜像仓库去下载约10G的镜像,所以第一次执行视乎你的网络情况需要一定时间(耐心等待上述下载过程,直至状态变成Running)。 检查下载的进度可以用

kubectl get po -n fate-9998
kubectl get po -n fate-9999
kubectl get po -n fate-10000

全部的镜像下载完成后,结果会呈现如下样子:

[root@harbor kubefate]# kubectl get po -n fate-9998
NAME                             READY   STATUS    RESTARTS   AGE
client-7ccbc89559-rfr2l          1/1     Running   0          20m
clustermanager-fcb86747f-z9vq9   1/1     Running   0          20m
mysql-6d546bd578-r5fl2           1/1     Running   0          20m
nodemanager-0-66dfd58cdc-6z7mc   2/2     Running   0          20m
nodemanager-1-7b7c65c685-fz9bb   2/2     Running   0          20m
python-594cd5c47b-5l88p          2/2     Running   0          20m
rollsite-6b77d9f5f7-ll9sv        1/1     Running   0          20m

验证FATE的部署

通过以上的 kubefate cluster ls 命令, 我们得到 fate-9998 的集群ID是 7bca70c1-236c-4931-81f8-1350cce579d4, fate-9999 的集群ID是 143378db-b84d-4045-8615-11d36335d5b2, 而 fate-10000 的集群ID是 d3e27a39-c8de-4615-96f2-29012f3edc68. 我们可以通过kubefate cluster describe查询集群的具体访问信息:

[root@harbor kubefate]# kubefate cluster describe 7bca70c1-236c-4931-81f8-1350cce579d4
UUID            7bca70c1-236c-4931-81f8-1350cce579d4       
Name            fate-9998                                  
NameSpace       fate-9998                                  
ChartName       fate                                       
ChartVersion    v1.8.0                                     
Revision        1                                          
Age             27m                                        
Status          Running                                    
Spec            backend: eggroll                           chartName: fate                            chartVersion: v1.8.0                       imagePullSecrets:                          - name: myregistrykey                      imageTag: 1.8.0-release                    ingress:                                   client:                                  hosts:                                 - name: party9998.notebook.example.com fateboard:                               hosts:                                 - name: party9998.fateboard.example.comingressClassName: nginx                    istio:                                     enabled: false                           modules:                                   - rollsite                                 - clustermanager                           - nodemanager                              - mysql                                    - python                                   - fateboard                                - client                                   name: fate-9998                            namespace: fate-9998                       partyId: 9998                              persistence: false                         podSecurityPolicy:                         enabled: false                           pullPolicy: null                           python:                                    grpcNodePort: 30082                      httpNodePort: 30087                      logLevel: INFO                           type: NodePort                           registry: hub.c.163.com/federatedai        rollsite:                                  nodePort: 30081                          partyList:                               - partyId: 10000                         partyIp: 10.6.17.104                   partyPort: 30101                       - partyId: 9999                          partyIp: 10.6.17.106                   partyPort: 30091                       type: NodePort                           servingIp: 10.6.14.13                      servingPort: 30085                         Info            dashboard:                                 - party9998.notebook.example.com           - party9998.fateboard.example.com          ip: 10.6.17.106                            port: 30081                                status:                                    containers:                              client: Running                        clustermanager: Running                fateboard: Running                     mysql: Running                         nodemanager-0: Running                 nodemanager-0-eggrollpair: Running     nodemanager-1: Running                 nodemanager-1-eggrollpair: Running     python: Running                        rollsite: Running                      deployments:                             client: Available                      clustermanager: Available              mysql: Available                       nodemanager-0: Available               nodemanager-1: Available               python: Available                      rollsite: Available         

从返回的内容中,我们看到Info->dashboard里包含了:

  1. Jupyter Notebook的访问地址: party9998.notebook.example.com。这个是我们准备让数据科学家进行建模分析的平台。已经集成了FATE-Clients;
  2. FATEBoard的访问地址: party9998.fateboard.example.com。我们可以通过FATEBoard来查询当前训练的状态。

同样的查看 fate-10000的信息,可以看到 dashboard的网址虽然不同,但是ip都是10.6.17.106,也就是ingress-ngnix的地址,所以即使是访问party10000.fateboard.example.com,也是先访问10.6.17.106,而不是fate-10000所在的主机10.6.17.104。

在浏览器访问FATE集群的机器上配置相关的Host信息

如果是Windows机器,我们需要把相关域名解析配置到C:\WINDOWS\system32\drivers\etc\hosts:

10.6.17.106 party9998.notebook.example.com
10.6.17.106 party9998.fateboard.example.com
10.6.17.106 party9999.notebook.example.com
10.6.17.106 party9999.fateboard.example.com
10.6.17.106 party10000.notebook.example.com
10.6.17.106 party10000.fateboard.example.com

注意以上网址都是设置IP为10.6.17.106
用网址party10000.fateboard.example.com:32415,登陆party10000的fateboard,用户名和密码如下图:
在这里插入图片描述

参考:
https://blog.csdn.net/qq_32202885/article/details/125998028
https://blog.csdn.net/haveanybody/article/details/108253667

相关内容

热门资讯

监控摄像头接入GB28181平... 流程简介将监控摄像头的视频在网站和APP中直播,要解决的几个问题是:1&...
Windows10添加群晖磁盘... 在使用群晖NAS时,我们需要通过本地映射的方式把NAS映射成本地的一块磁盘使用。 通过...
protocol buffer... 目录 目录 什么是protocol buffer 1.protobuf 1.1安装  1.2使用...
在Word、WPS中插入AxM... 引言 我最近需要写一些文章,在排版时发现AxMath插入的公式竟然会导致行间距异常&#...
Fluent中创建监测点 1 概述某些仿真问题,需要创建监测点,用于获取空间定点的数据࿰...
educoder数据结构与算法...                                                   ...
MySQL下载和安装(Wind... 前言:刚换了一台电脑,里面所有东西都需要重新配置,习惯了所...
MFC文件操作  MFC提供了一个文件操作的基类CFile,这个类提供了一个没有缓存的二进制格式的磁盘...
有效的括号 一、题目 给定一个只包括 '(',')','{','}'...
【Ctfer训练计划】——(三... 作者名:Demo不是emo  主页面链接:主页传送门 创作初心ÿ...