k8s Paas实战——Promtheus监控

k8s Paas实战——Promtheus监控

Administrator 846 2022-03-05

Promtheus的特点:

  • 多维数据模型:由度量名称和键值对标识的时间序列数据
  • 内置时间序列数据库:TSDB
  • promQL:一种灵活的查询语言,可以利用多维数据完成复杂查询
  • 基于HTTP的pull(拉取)方式采集时间序列数据
  • 同时支持PushGateway组件收集数据
  • 通过服务发现或静态配置发现目标
  • 支持作为数据源接入Grafana

官方架构图

Prometheus Server:服务核心组件,通过pull metrics从 Exporter 拉取和存储监控数据,并提供一套灵活的查询语言(PromQL)。

pushgateway:类似一个中转站,Prometheus的server端只会使用pull方式拉取数据,但是某些节点因为某些原因只能使用push方式推送数据,那么它就是用来接收push而来的数据并暴露给Prometheus的server拉取的中转站,这里我们不做它。

Exporters/Jobs:负责收集目标对象(host, container…)的性能数据,并通过 HTTP 接口供 Prometheus Server 获取。

Service Discovery:服务发现,Prometheus支持多种服务发现机制:文件,DNS,Consul,Kubernetes,OpenStack,EC2等等。基于服务发现的过程并不复杂,通过第三方提供的接口,Prometheus查询到需要监控的Target列表,然后轮训这些Target获取监控数据。

Alertmanager:从 Prometheus server 端接收到 alerts 后,会进行去除重复数据,分组,并路由到对方的接受方式,发出报警。常见的接收方式有:电子邮件,pagerduty 等。

UI页面的三种方法

  • Prometheus web UI:自带的(不怎么好用)
  • Grafana:美观、强大的可视化监控指标展示工具
  • API clients:自己开发的监控展示工具

工作流程:Prometheus Server定期从配置好的Exporters/Jobs中拉metrics,或者来着pushgateway发过来的metrics,或者其它的metrics,收集完后运行定义好的alert.rules(这个文件后面会讲到),记录时间序列或者向Alertmanager推送警报。

Prometheus交付

交付kube-state-metric

kube-state-metric为prometheus采集k8s资源数据的exporter,能够采集绝大多数k8s内置资源的相关数据,例如pod、deploy、service等等,同时它也提供自己的数据,主要是资源采集个数和采集发生的异常次数统计。

  1. 在200机器上下载镜像:

    [root@hdss7-200 ~]# docker pull quay.io/coreos/kube-state-metrics:v1.5.0
    [root@hdss7-200 ~]# docker images|grep kube-state
    [root@hdss7-200 ~]# docker tag 91599517197a harbor.od.com/public/kube-state-metrics:v1.5.0
    [root@hdss7-200 ~]# docker push harbor.od.com/public/kube-state-metrics:v1.5.0
    
    
  2. 200机器,在/data/k8s-yaml/kube-state-metrics目录下创建资源配置清单:

    rbac.yaml

    apiVersion: v1
    kind: ServiceAccount
    metadata:
      labels:
        addonmanager.kubernetes.io/mode: Reconcile
        kubernetes.io/cluster-service: "true"
      name: kube-state-metrics
      namespace: kube-system
    ---
    apiVersion: rbac.authorization.k8s.io/v1
    kind: ClusterRole
    metadata:
      labels:
        addonmanager.kubernetes.io/mode: Reconcile
        kubernetes.io/cluster-service: "true"
      name: kube-state-metrics
    rules:
    - apiGroups:
      - ""
      resources:
      - configmaps
      - secrets
      - nodes
      - pods
      - services
      - resourcequotas
      - replicationcontrollers
      - limitranges
      - persistentvolumeclaims
      - persistentvolumes
      - namespaces
      - endpoints
      verbs:
      - list
      - watch
    - apiGroups:
      - policy
      resources:
      - poddisruptionbudgets
      verbs:
      - list
      - watch
    - apiGroups:
      - extensions
      resources:
      - daemonsets
      - deployments
      - replicasets
      verbs:
      - list
      - watch
    - apiGroups:
      - apps
      resources:
      - statefulsets
      verbs:
      - list
      - watch
    - apiGroups:
      - batch
      resources:
      - cronjobs
      - jobs
      verbs:
      - list
      - watch
    - apiGroups:
      - autoscaling
      resources:
      - horizontalpodautoscalers
      verbs:
      - list
      - watch
    ---
    apiVersion: rbac.authorization.k8s.io/v1
    kind: ClusterRoleBinding
    metadata:
      labels:
        addonmanager.kubernetes.io/mode: Reconcile
        kubernetes.io/cluster-service: "true"
      name: kube-state-metrics
    roleRef:
      apiGroup: rbac.authorization.k8s.io
      kind: ClusterRole
      name: kube-state-metrics
    subjects:
    - kind: ServiceAccount
      name: kube-state-metrics
      namespace: kube-system
    

    dp.yaml

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      annotations:
        deployment.kubernetes.io/revision: "2"
      labels:
        grafanak8sapp: "true"
        app: kube-state-metrics
      name: kube-state-metrics
      namespace: kube-system
    spec:
      selector:
        matchLabels:
          grafanak8sapp: "true"
          app: kube-state-metrics
      strategy:
        rollingUpdate:
          maxSurge: 25%
          maxUnavailable: 25%
        type: RollingUpdate
      template:
        metadata:
          labels:
            grafanak8sapp: "true"
            app: kube-state-metrics
        spec:
          containers:
          - name: kube-state-metrics
            image: harbor.od.com/public/kube-state-metrics:v1.5.0
            imagePullPolicy: IfNotPresent
            ports:
            - containerPort: 8080
              name: http-metrics
              protocol: TCP
            readinessProbe:
              failureThreshold: 3
              httpGet:
                path: /healthz
                port: 8080
                scheme: HTTP
              initialDelaySeconds: 5
              periodSeconds: 10
              successThreshold: 1
              timeoutSeconds: 5
          serviceAccountName: kube-state-metrics
    
  3. 应用清单,验证kube-metrics是否正常启动:

    [root@hdss7-21 ~]# for i in rbac dp ; do kubectl apply -f http://k8s-yaml.od.com/kube-state-metrics/$i.yaml;done
    
    # 查询pod的ip地址
    [root@hdss7-21 ~]# curl 172.7.21.5:8080/healthz
    [root@hdss7-21 ~]# curl 172.7.21.5:8080/metric
    

交付node-exporter

node-exporter是用来监控运算节点上的宿主机的资源信息的,需要部署到所有的运算节点;

node-exporter官方dockerhub地址

  1. 200机器,下载镜像:

    [root@hdss7-21 ~]# docker pull prom/node-exporter:v0.15.0
    v0.15.0: Pulling from prom/node-exporter
    
    [root@hdss7-21 ~]# docker images|grep node-exporter
    
    [root@hdss7-21 ~]# docker tag 12d51ffa2b22 harbor.od.com/public/node-exporter:v0.15.0
    [root@hdss7-21 ~]# docker push harbor.od.com/public/node-exporter:v0.15.0
    
  2. 在/data/k8s-yaml/node-exporter目录下准备资源配置清单并在node节点应用:

    ds.yaml

    kind: DaemonSet
    apiVersion: apps/v1
    metadata:
      name: node-exporter
      namespace: kube-system
      labels:
        daemon: "node-exporter"
        grafanak8sapp: "true"
    spec:
      selector:
        matchLabels:
          daemon: "node-exporter"
          grafanak8sapp: "true"
      template:
        metadata:
          name: node-exporter
          labels:
            daemon: "node-exporter"
            grafanak8sapp: "true"
        spec:
          volumes:
          - name: proc
            hostPath:
              path: /proc
              type: ""
          - name: sys
            hostPath: 
              path: /sys
              type: ""
          containers:
          - name: node-exporter
            image: harbor.od.com/public/node-exporter:v0.15.0
            imagePullPolicy: IfNotPresent
            args:
            - --path.procfs=/host_proc
            - --path.sysfs=/host_sys
            ports:
            - name: node-exporter
              hostPort: 9100
              containerPort: 9100
              protocol: TCP
            volumeMounts:
            - name: sys
              readOnly: true
              mountPath: /host_sys
            - name: proc
              readOnly: true
              mountPath: /host_proc
          hostNetwork: true
    
    # node节点查看9100端口是否占用,这时候应该没有9100端口
    [root@hdss7-21 ~]# netstat -tlnp|grep 9100
    
    [root@hdss7-21 ~]# kubectl apply -f http://k8s-yaml.od.com/node-exporter/ds.yaml
    
    # 再次查看端口
    [root@hdss7-21 ~]# netstat -tlnp|grep 9100
    tcp6       0      0 :::9100                 :::*                    LISTEN      34179/node_exporter
    [root@hdss7-21 ~]# curl localhost:9100
    # 查看node-exporter取出的信息
    [root@hdss7-21 ~]# curl localhost:9100/metrics
    

交付cadvisor

cadvisor用来监控容器内部使用资源的信息,cadvisor官方dockerhub镜像

  1. 200机器下载镜像:

    [root@hdss7-200 node-exporter]# docker pull google/cadvisor:v0.28.3
    
    [root@hdss7-200 node-exporter]# docker images|grep cadvisor
    
    [root@hdss7-200 node-exporter]# docker tag 75f88e3ec33 harbor.od.com/public/cadvisor:v0.28.3
    [root@hdss7-200 node-exporter]# docker push harbor.od.com/public/cadvisor:v0.28.3
    
  2. 在/data/k8s-yaml/cadvisor目录下创建资源配置清单,并在node节点应用:

    ds.yaml

    kind: DaemonSet
    apiVersion: apps/v1
    metadata:
      name: cadvisor
      namespace: kube-system
      labels:
        app: cadvisor
    spec:
      selector:
        matchLabels:
          name: cadvisor
      template:
        metadata:
          labels:
            name: cadvisor
        spec:
          hostNetwork: true
          tolerations:
          - key: node-role.kubernetes.io/master
            effect: NoExecute
          containers:
          - name: cadvisor
            image: harbor.od.com/public/cadvisor:v0.28.3
            imagePullPolicy: IfNotPresent
            volumeMounts:
            - name: rootfs
              mountPath: /rootfs
              readOnly: true
            - name: var-run
              mountPath: /var/run
              readOnly: true
            - name: docker
              mountPath: /var/lib/docker
              readOnly: true
            ports:
              - name: http
                containerPort: 4194
                protocol: TCP
            readinessProbe:
              tcpSocket:
                port: 4194
              initialDelaySeconds: 5
              periodSeconds: 10
            args:
              - --housekeeping_interval=10s
              - --port=4194
          terminationGracePeriodSeconds: 30
          volumes:
          - name: rootfs
            hostPath:
              path: /
          - name: var-run
            hostPath:
              path: /var/run
          - name: sys
            hostPath:
              path: /sys
          - name: docker
            hostPath:
              path: /data/docker
    

k8s调度调整

可人为影响k8s调度策略的三种方法:

  • 污点、容忍方法:

污点:运算节点node上的污点(先在运算节点上打标签等,kubectl taint nodes node1 key1=value1:NoSchedule,污点可以有多个;

容忍度:pod能否容忍污点,参考kubernetes官网

  • nodeName:让pod运行在指定的node上;

  • nodeSelector:通过标签选择器,让pod运行在指定的一类node上。

污点配置方法:

  1. 给21机器打个污点,在22机器执行下面的命令:

    [root@hdss7-22 ~]# kubectl taint node hdss7-21.host.com node-role.kubernetes.io/master=master:NoSchedule
    node/hdss7-21.host.com tainted
    

    image-20220301145047349

  2. 21/22机器,修改软连接:

    ~]# mount -o remount,rw /sys/fs/cgroup/
    ~]# ln -s /sys/fs/cgroup/cpu,cpuacct /sys/fs/cgroup/cpuacct,cpu
    ~]# ls -l /sys/fs/cgroup/
    
  3. 22机器应用资源清单:

    [root@hdss7-22 ~]# kubectl apply -f http://k8s-yaml.od.com/cadvisor/ds.yaml
    daemonset.apps/cadvisor created
    [root@hdss7-22 ~]# kubectl get pod -n kube-system -o wide
    NAME                                 READY   STATUS    RESTARTS   AGE     IP           NODE                NOMINATED NODE   READINESS GATES
    cadvisor-xcrd7                       0/1     Running   0          3s      10.4.7.22    hdss7-22.host.com   <none>           <none>
    
    

    只有22机器上有pod。

  4. 21机器删除污点:

    [root@hdss7-22 ~]# kubectl taint node hdss7-21.host.com node-role.kubernetes.io/master-
    node/hdss7-21.host.com untainted
    

    image-20220301145534947

    image-20220301145603756

  5. 修改cadvisor的daemonSet:

    image-20220301145728082

交付blackbox-exporter

blackbox-exporter用于监控业务容器的存活性:

  1. 200机器下载镜像:

    [root@hdss7-200 k8s-yaml]# docker pull prom/blackbox-exporter:v0.15.1
    
    [root@hdss7-200 k8s-yaml]# docker images|grep blackbox-exporter
    
    [root@hdss7-200 k8s-yaml]# docker tag 81b70b6158be harbor.od.com/public/blackbox-exporter:v0.15.1
    [root@hdss7-200 k8s-yaml]# docker push harbor.od.com/public/blackbox-exporter:v0.15.1
    
  2. 在/data/k8s-yaml/blackbox-exporter目录下准备资源配置清单,并在node节点应用:

    cm.yaml

    apiVersion: v1
    kind: ConfigMap
    metadata:
      labels:
        app: blackbox-exporter
      name: blackbox-exporter
      namespace: kube-system
    data:
      blackbox.yaml: |-
        modules:
          http_2xx:
            prober: http
            timeout: 2s
            http:
              valid_http_versions: ["HTTP/1.1", "HTTP/2"]
              valid_status_codes: [200,301,302]
              method: GET
              preferred_ip_protocol: "ip4"
          tcp_connect:
            prober: tcp
            timeout: 2s
    

    dp.yaml

    kind: Deployment
    apiVersion: apps/v1
    metadata:
      name: blackbox-exporter
      namespace: kube-system
      labels:
        app: blackbox-exporter
      annotations:
        deployment.kubernetes.io/revision: 1
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: blackbox-exporter
      template:
        metadata:
          labels:
            app: blackbox-exporter
        spec:
          volumes:
          - name: config
            configMap:
              name: blackbox-exporter
              defaultMode: 420
          containers:
          - name: blackbox-exporter
            image: harbor.od.com/public/blackbox-exporter:v0.15.1
            imagePullPolicy: IfNotPresent
            args:
            - --config.file=/etc/blackbox_exporter/blackbox.yaml
            - --log.level=info
            - --web.listen-address=:9115
            ports:
            - name: blackbox-port
              containerPort: 9115
              protocol: TCP
            resources:
              limits:
                cpu: 200m
                memory: 256Mi
              requests:
                cpu: 100m
                memory: 50Mi
            volumeMounts:
            - name: config
              mountPath: /etc/blackbox_exporter
            readinessProbe:
              tcpSocket:
                port: 9115
              initialDelaySeconds: 5
              timeoutSeconds: 5
              periodSeconds: 10
              successThreshold: 1
              failureThreshold: 3
    

    svc.yaml

    kind: Service
    apiVersion: v1
    metadata:
      name: blackbox-exporter
      namespace: kube-system
    spec:
      selector:
        app: blackbox-exporter
      ports:
        - name: blackbox-port
          protocol: TCP
          port: 9115
    

    ingress.yaml

    apiVersion: extensions/v1beta1
    kind: Ingress
    metadata:
      name: blackbox-exporter
      namespace: kube-system
    spec:
      rules: 
      - host: blackbox.od.com
        http:
          paths:
          - path: /
            backend:
              serviceName: blackbox-exporter
              servicePort: blackbox-port
    
    # 批量应用
    [root@hdss7-21 ~]# for i in cm dp svc ingress;do kubectl apply -f http://k8s-yaml.od.com/blackbox-exporter/$i.yaml;done
    
  3. 域名解析:

    2021121017 ; serial
    blackbox           A    10.4.7.10
    

    image-20220301161358709

部署Prometheus-server

prometheus-server通过pull metrics从exporter拉取和存储监控数据,并提供一套灵活的查询语言(PromQL),prometheus-server官网docker地址

  1. 200机器准备镜像:

    [root@hdss7-200 blackbox-exporter]# docker pull prom/prometheus:v2.14.0
    
    [root@hdss7-200 blackbox-exporter]# docker images|grep prometheus
    
    [root@hdss7-200 blackbox-exporter]# docker tag 7317640d555e harbor.od.com/infra/prometheus:v2.14.0
    [root@hdss7-200 blackbox-exporter]#  docker push harbor.od.com/infra/prometheus:v2.14.0
    
  2. 在/data/k8s-yaml/prometheus目录下准备资源清单:

    rbac.yaml

    apiVersion: v1
    kind: ServiceAccount
    metadata:
      labels:
        addonmanager.kubernetes.io/mode: Reconcile
        kubernetes.io/cluster-service: "true"
      name: prometheus
      namespace: infra
    ---
    apiVersion: rbac.authorization.k8s.io/v1
    kind: ClusterRole
    metadata:
      labels:
        addonmanager.kubernetes.io/mode: Reconcile
        kubernetes.io/cluster-service: "true"
      name: prometheus
    rules:
    - apiGroups:
      - ""
      resources:
      - nodes
      - nodes/metrics
      - services
      - endpoints
      - pods
      verbs:
      - get
      - list
      - watch
    - apiGroups:
      - ""
      resources:
      - configmaps
      verbs:
      - get
    - nonResourceURLs:
      - /metrics
      verbs:
      - get
    ---
    apiVersion: rbac.authorization.k8s.io/v1
    kind: ClusterRoleBinding
    metadata:
      labels:
        addonmanager.kubernetes.io/mode: Reconcile
        kubernetes.io/cluster-service: "true"
      name: prometheus
    roleRef:
      apiGroup: rbac.authorization.k8s.io
      kind: ClusterRole
      name: prometheus
    subjects:
    - kind: ServiceAccount
      name: prometheus
      namespace: infra
    

    dp.yaml

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      annotations:
        deployment.kubernetes.io/revision: "5"
      labels:
        name: prometheus
      name: prometheus
      namespace: infra
    spec:
      progressDeadlineSeconds: 600
      replicas: 1
      revisionHistoryLimit: 7
      selector:
        matchLabels:
          app: prometheus
      strategy:
        rollingUpdate:
          maxSurge: 1
          maxUnavailable: 1
        type: RollingUpdate
      template:
        metadata:
          labels:
            app: prometheus
        spec:
          containers:
          - name: prometheus
            image: harbor.od.com/infra/prometheus:v2.14.0
            imagePullPolicy: IfNotPresent
            command:
            - /bin/prometheus
            args:
            - --config.file=/data/etc/prometheus.yml
            - --storage.tsdb.path=/data/prom-db
            - --storage.tsdb.min-block-duration=10m
            - --storage.tsdb.retention=72h
            ports:
            - containerPort: 9090
              protocol: TCP
            volumeMounts:
            - mountPath: /data
              name: data
            resources:
              requests:
                cpu: "1000m"
                memory: "1.5Gi"
              limits:
                cpu: "2000m"
                memory: "3Gi"
          imagePullSecrets:
          - name: harbor
          securityContext:
            runAsUser: 0
          serviceAccountName: prometheus
          volumes:
          - name: data
            nfs:
              server: hdss7-200.host.com
              path: /data/nfs-volume/prometheus
    

    svc.yaml

    apiVersion: v1
    kind: Service
    metadata:
      name: prometheus
      namespace: infra
    spec:
      ports:
      - port: 9090
        protocol: TCP
        targetPort: 9090
      selector:
        app: prometheus
    

    ingress.yaml

    apiVersion: extensions/v1beta1
    kind: Ingress
    metadata:
      annotations:
        kubernetes.io/ingress.class: traefik
      name: prometheus
      namespace: infra
    spec:
      rules:
      - host: prometheus.od.com
        http:
          paths:
          - path: /
            backend:
              serviceName: prometheus
              servicePort: 9090
    
  3. 配置域名解析:

    2021121018 ; serial
    prometheus         A    10.4.7.10
    
  4. nfs配置:

    [root@hdss7-200 prometheus]# mkdir /data/nfs-volume/prometheus
    [root@hdss7-200 prometheus]# cd !$
    cd /data/nfs-volume/prometheus
    [root@hdss7-200 prometheus]# mkdir etc prom-db
    [root@hdss7-200 prometheus]# cd etc/
    [root@hdss7-200 etc]# cp /opt/certs/ca.pem .
    [root@hdss7-200 etc]# cp -a /opt/certs/client.pem .
    [root@hdss7-200 etc]# cp -a /opt/certs/client-key.pem .
    

    编辑prometheus.yml配置文件:

    global:
      scrape_interval:     15s
      evaluation_interval: 15s
    scrape_configs:
    - job_name: 'etcd'
      tls_config:
        ca_file: /data/etc/ca.pem
        cert_file: /data/etc/client.pem
        key_file: /data/etc/client-key.pem
      scheme: https
      static_configs:
      - targets:
        - '10.4.7.12:2379'
        - '10.4.7.21:2379'
        - '10.4.7.22:2379'
    - job_name: 'kubernetes-apiservers'
      kubernetes_sd_configs:
      - role: endpoints
      scheme: https
      tls_config:
        ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
      relabel_configs:
      - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
        action: keep
        regex: default;kubernetes;https
    - job_name: 'kubernetes-pods'
      kubernetes_sd_configs:
      - role: pod
      relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        target_label: __address__
      - action: labelmap
        regex: __meta_kubernetes_pod_label_(.+)
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: kubernetes_namespace
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: kubernetes_pod_name
    - job_name: 'kubernetes-kubelet'
      kubernetes_sd_configs:
      - role: node
      relabel_configs:
      - action: labelmap
        regex: __meta_kubernetes_node_label_(.+)
      - source_labels: [__meta_kubernetes_node_name]
        regex: (.+)
        target_label: __address__
        replacement: ${1}:10255
    - job_name: 'kubernetes-cadvisor'
      kubernetes_sd_configs:
      - role: node
      relabel_configs:
      - action: labelmap
        regex: __meta_kubernetes_node_label_(.+)
      - source_labels: [__meta_kubernetes_node_name]
        regex: (.+)
        target_label: __address__
        replacement: ${1}:4194
    - job_name: 'kubernetes-kube-state'
      kubernetes_sd_configs:
      - role: pod
      relabel_configs:
      - action: labelmap
        regex: __meta_kubernetes_pod_label_(.+)
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: kubernetes_namespace
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: kubernetes_pod_name
      - source_labels: [__meta_kubernetes_pod_label_grafanak8sapp]
        regex: .*true.*
        action: keep
      - source_labels: ['__meta_kubernetes_pod_label_daemon', '__meta_kubernetes_pod_node_name']
        regex: 'node-exporter;(.*)'
        action: replace
        target_label: nodename
    - job_name: 'blackbox_http_pod_probe'
      metrics_path: /probe
      kubernetes_sd_configs:
      - role: pod
      params:
        module: [http_2xx]
      relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_blackbox_scheme]
        action: keep
        regex: http
      - source_labels: [__address__, __meta_kubernetes_pod_annotation_blackbox_port,  __meta_kubernetes_pod_annotation_blackbox_path]
        action: replace
        regex: ([^:]+)(?::\d+)?;(\d+);(.+)
        replacement: $1:$2$3
        target_label: __param_target
      - action: replace
        target_label: __address__
        replacement: blackbox-exporter.kube-system:9115
      - source_labels: [__param_target]
        target_label: instance
      - action: labelmap
        regex: __meta_kubernetes_pod_label_(.+)
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: kubernetes_namespace
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: kubernetes_pod_name
    - job_name: 'blackbox_tcp_pod_probe'
      metrics_path: /probe
      kubernetes_sd_configs:
      - role: pod
      params:
        module: [tcp_connect]
      relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_blackbox_scheme]
        action: keep
        regex: tcp
      - source_labels: [__address__, __meta_kubernetes_pod_annotation_blackbox_port]
        action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        target_label: __param_target
      - action: replace
        target_label: __address__
        replacement: blackbox-exporter.kube-system:9115
      - source_labels: [__param_target]
        target_label: instance
      - action: labelmap
        regex: __meta_kubernetes_pod_label_(.+)
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: kubernetes_namespace
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: kubernetes_pod_name
    - job_name: 'traefik'
      kubernetes_sd_configs:
      - role: pod
      relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scheme]
        action: keep
        regex: traefik
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        target_label: __address__
      - action: labelmap
        regex: __meta_kubernetes_pod_label_(.+)
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: kubernetes_namespace
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: kubernetes_pod_name
    
  5. 应用配置清单:

    # 批量应用
    [root@hdss7-21 ~]# for i in rbac dp svc ingress;do kubectl apply -f http://k8s-yaml.od.com/prometheus/$i.yaml;done
    

    image-20220301172553519

    image-20220301172616712

Prometheus配置

监控业务容器

  1. traefik配置,编辑traefik的Daemon Set,添加以下内容:

    "annotations": {
      "prometheus_io_scheme": "traefik",
      "prometheus_io_path": "/metrics",
      "prometheus_io_port": "8080"
    }
    

    image-20220301213457467

    然后删除traefik对应的两个pod让其重启:

    image-20220301213618538

  2. 在22机器查看traefik的pod,如果起不来就使用命令行强制删除:

    ~]# kubectl get pods -n kube-system
    ~]# kubectl delete pods traefik-ingress-g26kw -n kube-system --force --grace-period=0
    
  3. 再次访问prometheus,可以看到traefik2/2:

    image-20220301213826361

  4. 由于之前apollo已经关闭,所以重新启动一个不使用apollo的dubbo-service,到harbor里找到tag为master的dubbo-service镜像:

    image-20220301214121895

  5. 修改dubbo-service的Deployment资源配置清单文件,然后将规模scale到1启动pod:

    image-20220301214301548

    image-20220301215144384

  6. 监控dubbo-service活性,在dubbo-service的Deployment资源配置清单文件中,添加以下内容:

    "annotations": {
      "blackbox_port": "20880",
      "blackbox_scheme": "tcp"
    }
    

    image-20220301215425699

  7. 再次刷新prometheus,已经自动发现业务:

    image-20220301215737270

    image-20220301215807603

  8. 同样的,把dubbo-consumer也加入监控,将dubbo-consumer的Deployment使用的镜像修改为非Apollo的版本,并且添加annotations配置:

    "annotations":{
      "blackbox_path": "/hello?name=health",
      "blackbox_port": "8080",
      "blackbox_scheme": "http"
    }
    

    image-20220301221132702

  9. 查看Prometheus,自动发现服务:

    image-20220301221222984

    image-20220301221254939

Grafana部署

Grafana用来代替Prometheus原生的UI界面

  1. 200机器准备镜像:

    [root@hdss7-200 ~]# docker pull grafana/grafana:5.4.2
    [root@hdss7-200 ~]# docker images|grep grafana
    [root@hdss7-200 ~]#  docker tag 6f18ddf9e552 harbor.od.com/infra/grafana:v5.4.2
    [root@hdss7-200 ~]# docker push harbor.od.com/infra/grafana:v5.4.2
    
  2. 在/data/k8s-yaml/grafana目录下创建资源配置清单:

    rbac.yaml

    apiVersion: rbac.authorization.k8s.io/v1
    kind: ClusterRole
    metadata:
      labels:
        addonmanager.kubernetes.io/mode: Reconcile
        kubernetes.io/cluster-server: "true"
      name: grafana
    rules:
    - apiGroups:
      - "*"
      resources:
      - namespaces
      - deployments
      - pods
      verbs:
      - get
      - list
      - watch
    ---
    apiVersion: rbac.authorization.k8s.io/v1
    kind: ClusterRoleBinding
    metadata:
      labels:
        addonmanager.kubernetes.io/mode: Reconcile
        kubernetes.io/cluster-service: "true"
      name: grafana
    roleRef:
      apiGroup: rbac.authorization.k8s.io
      kind: ClusterRole
      name: grafana
    subjects:
    - kind: User
      name: k8s-node
    

    dp.yaml

    # https://kubernetes.io/docs/concepts/workloads/controllers/deployment/
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: grafana
      namespace: infra
      labels:
        app: grafana
        name: grafana
    spec:
      progressDeadlineSeconds: 600
      revisionHistoryLimit: 7
      selector:
        matchLabels:
          app: grafana
      replicas: 1
      strategy:
        rollingUpdate:
          maxSurge: 1
          maxUnavailable: 1
        type: RollingUpdate
      template:
        metadata:
          labels:
            app: grafana
            name: grafana
        spec:
          containers:
            - name: grafana
              image: harbor.od.com/infra/grafana:v5.4.2
              imagePullPolicy: IfNotPresent
              ports:
                - containerPort: 3000
                  protocol: TCP
              volumeMounts:
                - name: data
                  mountPath: /var/lib/grafana
          imagePullSecrets:
            - name: harbor
          securityContext:
            runAsUser: 0
          volumes:
            - nfs:
                path: /data/nfs-volume/grafana
                server: hdss7-200.host.com
              name: data
    ---
    

    svc.yaml

    apiVersion: v1
    kind: Service
    metadata:
      name: grafana
      namespace: infra
    spec:
      ports:
        - port: 3000
          protocol: TCP
          targetPort: 3000
      selector:
        app: grafana
    

    ingress.yaml

    apiVersion: extensions/v1beta1
    kind: Ingress
    metadata:
      name: grafana
      namespace: infra
    spec:
      rules:
      - host: grafana.od.com
        http:
          paths:
          - path: /
            backend:
              serviceName: grafana
              servicePort: 3000
    
  3. 配置域名解析:

    2021121019 ; serial
    grafana            A    10.4.7.10
    
  4. 应用配置清单:

    # 批量配置
    [root@hdss7-22 ~]# for i in rbac dp svc ingress;do kubectl apply -f http://k8s-yaml.od.com/grafana/$i.yaml;done
    
  5. 访问grafana.od.com,账号密码都是admin,修改密码为admin123.

Grafana配置

  1. 修改grafana基础配置如下:

    image-20220302001852877

  2. 安装插件,进入grafana的容器,执行安装插件命令:

    [root@hdss7-22 ~]# kubectl exec -it grafana-58f5f84b59-8vc7t -n infra bash
    
    # 第一个插件kubenetes App
    grafana-cli plugins install grafana-kubernetes-app
    # 第二个 Clock Pannel
    grafana-cli plugins install grafana-clock-panel
    # 第三个 Pie Chart
    grafana-cli plugins install grafana-piechart-panel
    # 第四个 D3Gauge
    grafana-cli plugins install briangann-gauge-panel
    # 第五个 Discrete
    grafana-cli plugins install natel-discrete-panel
    

    进入200机器的/data/nfs-volume/grafana/plugins目录,查看插件:

    [root@hdss7-200 plugins]# ll
    总用量 0
    drwxr-xr-x 4 root root 253 3月   2 00:23 briangann-gauge-panel
    drwxr-xr-x 3 root root 155 3月   2 00:22 grafana-clock-panel
    drwxr-xr-x 4 root root 198 3月   2 00:21 grafana-kubernetes-app
    drwxr-xr-x 4 root root 277 3月   2 00:22 grafana-piechart-panel
    drwxr-xr-x 5 root root 216 3月   2 00:23 natel-discrete-panel
    
  3. 删除grafana的pod让其重启,重启完成后重新访问grafana,然后添加数据源:

    # 填入参数:
    URL:http://prometheus.od.com
    TLS Client Auth✔    With CA Cert✔
    

    image-20220302002852105

    # 到200机器获取证书参数,填入grafana中
    ~]# cat /opt/certs/ca.pem
    ~]# cat /opt/certs/client.pem
    ~]# cat /opt/certs/client-key.pem
    

    image-20220302003107128

  4. 配置plugins中的kuberntes:

    image-20220302003252515

    image-20220302003305402

    image-20220302003338930

    # 按参数填入:
    Name:myk8s
    URL:https://10.4.7.10:7443
    Access:Server
    TLS Client Auth✔    With CA Cert✔
    

    image-20220302003525974

    image-20220302003547509

  5. 删除grafana的dashboard:

    image-20220302211302668

    image-20220302211315885

    分别点击Cluster,Container,deployment,node,然后点击配置,删除dashboard:

    image-20220302211422829

    image-20220302211445932

  6. 点击Home,点击Import dashboard:

    image-20220302211653938

    image-20220302211710027

  7. 进入grafana官网,然后找到别人写好的dashboard,获取到编号,然后填入Import页面,或者直接使用k8s_PaaS/第七章——Promtheus监控k8s企业家应用.md at master · ben1234560/k8s_PaaS (github.com)提供的json文件,将所有的json文件导入进去即可:

    image-20220302212039073

    image-20220302212055246

    image-20220302212123567

Grafana添加dubbo数据

  1. 编辑dubbo-service的Deployment资源配置清单文件,添加以下内容:

      "prometheus_io_scrape": "true",
      "prometheus_io_port": "12346",
      "prometheus_io_path": "/"
    

    image-20220302172951569

  2. dubbo-consumer也同样添加以下内容:

     "prometheus_io_scrape": "true",
      "prometheus_io_port": "12346",
      "prometheus_io_path": "/"
    

    image-20220302173013302

  3. 完成后刷新JMX:

    image-20220302213457720

alertmanager部署及配置

部署alertmanagger

从 Prometheus server 端接收到 alerts 后,会进行去除重复数据,分组,并路由到对方的接受方式,发出报警。常见的接收方式有:电子邮件,pagerduty 等。

  1. 200机器准备镜像:

    [root@hdss7-200 k8s-yaml]# docker pull docker.io/prom/alertmanager:v0.14.0
    
    [root@hdss7-200 k8s-yaml]# docker images|grep alert
    
    [root@hdss7-200 k8s-yaml]# docker tag 23744b2d645c harbor.od.com/infra/alertmanager:v0.14.0
    [root@hdss7-200 k8s-yaml]#  docker push harbor.od.com/infra/alertmanager:v0.14.0
    
  2. 到/data/k8s-yaml/alertmanager目录下准备资源配置清单:

    cm.yaml

    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: alertmanager-config
      namespace: infra
    data:
      config.yml: |-
        global:
          # 在没有报警的情况下声明为已解决的时间
          resolve_timeout: 5m
          # 配置邮件发送信息
          # 根据自己的邮箱更改
          smtp_smarthost: 'smtp.163.com:25'
          smtp_from: 'xxx@163.com'
          smtp_auth_username: 'xxx@163.com'
          smtp_auth_password: 'xxx'
          smtp_require_tls: false
        # 所有报警信息进入后的根路由,用来设置报警的分发策略
        route:
          # 这里的标签列表是接收到报警信息后的重新分组标签,例如,接收到的报警信息里面有许多具有 cluster=A 和 alertname=LatncyHigh 这样的标签的报警信息将会批量被聚合到一个分组里面
          group_by: ['alertname', 'cluster']
          # 当一个新的报警分组被创建后,需要等待至少group_wait时间来初始化通知,这种方式可以确保您能有足够的时间为同一分组来获取多个警报,然后一起触发这个报警信息。
          group_wait: 30s
    
          # 当第一个报警发送后,等待'group_interval'时间来发送新的一组报警信息。
          group_interval: 5m
    
          # 如果一个报警信息已经发送成功了,等待'repeat_interval'时间来重新发送他们
          repeat_interval: 5m
    
          # 默认的receiver:如果一个报警没有被一个route匹配,则发送给默认的接收器
          receiver: default
    
        receivers:
        - name: 'default'
          email_configs:
          - to: 'evobot@foxmail.com'
            send_resolved: true
    

    dp.yaml

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: alertmanager
      namespace: infra
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: alertmanager
      template:
        metadata:
          labels:
            app: alertmanager
        spec:
          containers:
            - name: alertmanager
              image: harbor.od.com/infra/alertmanager:v0.14.0
              args:
                - "--config.file=/etc/alertmanager/config.yml"
                - "--storage.path=/alertmanager"
              ports:
                - name: alertmanager
                  containerPort: 9093
              volumeMounts:
              - name: alertmanager-cm
                mountPath: /etc/alertmanager
          volumes:
            - name: alertmanager-cm
              configMap:
                name: alertmanager-config
          imagePullSecrets:
            - name: harbor
    

    svc.yaml

    apiVersion: v1
    kind: Service
    metadata:
      name: alertmanager
      namespace: infra
    spec:
      selector:
        app: alertmanager
      ports:
        - port: 80
          targetPort: 9093
    
    #应用资源配置清单
    for i in cm dp svc; do kubectl apply -f http://k8s-yaml.od.com/alertmanager/$i.yaml;done
    

告警配置

  1. 在200机器的/data/nfs-volume/prometheus/etc目录下,创建rules.yml文件,内容如下:

    groups:
    - name: hostStatsAlert
      rules:
      - alert: hostCpuUsageAlert
        expr: sum(avg without (cpu)(irate(node_cpu{mode!='idle'}[5m]))) by (instance) > 0.85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "{{ $labels.instance }} CPU usage above 85% (current value: {{ $value }}%)"
      - alert: hostMemUsageAlert
        expr: (node_memory_MemTotal - node_memory_MemAvailable)/node_memory_MemTotal > 0.85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "{{ $labels.instance }} MEM usage above 85% (current value: {{ $value }}%)"
      - alert: OutOfInodes
        expr: node_filesystem_free{fstype="overlay",mountpoint ="/"} / node_filesystem_size{fstype="overlay",mountpoint ="/"} * 100 < 10
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Out of inodes (instance {{ $labels.instance }})"
          description: "Disk is almost running out of available inodes (< 10% left) (current value: {{ $value }})"
      - alert: OutOfDiskSpace
        expr: node_filesystem_free{fstype="overlay",mountpoint ="/rootfs"} / node_filesystem_size{fstype="overlay",mountpoint ="/rootfs"} * 100 < 10
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Out of disk space (instance {{ $labels.instance }})"
          description: "Disk is almost full (< 10% left) (current value: {{ $value }})"
      - alert: UnusualNetworkThroughputIn
        expr: sum by (instance) (irate(node_network_receive_bytes[2m])) / 1024 / 1024 > 100
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Unusual network throughput in (instance {{ $labels.instance }})"
          description: "Host network interfaces are probably receiving too much data (> 100 MB/s) (current value: {{ $value }})"
      - alert: UnusualNetworkThroughputOut
        expr: sum by (instance) (irate(node_network_transmit_bytes[2m])) / 1024 / 1024 > 100
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Unusual network throughput out (instance {{ $labels.instance }})"
          description: "Host network interfaces are probably sending too much data (> 100 MB/s) (current value: {{ $value }})"
      - alert: UnusualDiskReadRate
        expr: sum by (instance) (irate(node_disk_bytes_read[2m])) / 1024 / 1024 > 50
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Unusual disk read rate (instance {{ $labels.instance }})"
          description: "Disk is probably reading too much data (> 50 MB/s) (current value: {{ $value }})"
      - alert: UnusualDiskWriteRate
        expr: sum by (instance) (irate(node_disk_bytes_written[2m])) / 1024 / 1024 > 50
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Unusual disk write rate (instance {{ $labels.instance }})"
          description: "Disk is probably writing too much data (> 50 MB/s) (current value: {{ $value }})"
      - alert: UnusualDiskReadLatency
        expr: rate(node_disk_read_time_ms[1m]) / rate(node_disk_reads_completed[1m]) > 100
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Unusual disk read latency (instance {{ $labels.instance }})"
          description: "Disk latency is growing (read operations > 100ms) (current value: {{ $value }})"
      - alert: UnusualDiskWriteLatency
        expr: rate(node_disk_write_time_ms[1m]) / rate(node_disk_writes_completedl[1m]) > 100
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Unusual disk write latency (instance {{ $labels.instance }})"
          description: "Disk latency is growing (write operations > 100ms) (current value: {{ $value }})"
    - name: http_status
      rules:
      - alert: ProbeFailed
        expr: probe_success == 0
        for: 1m
        labels:
          severity: error
        annotations:
          summary: "Probe failed (instance {{ $labels.instance }})"
          description: "Probe failed (current value: {{ $value }})"
      - alert: StatusCode
        expr: probe_http_status_code <= 199 OR probe_http_status_code >= 400
        for: 1m
        labels:
          severity: error
        annotations:
          summary: "Status Code (instance {{ $labels.instance }})"
          description: "HTTP status code is not 200-399 (current value: {{ $value }})"
      - alert: SslCertificateWillExpireSoon
        expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 30
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "SSL certificate will expire soon (instance {{ $labels.instance }})"
          description: "SSL certificate expires in 30 days (current value: {{ $value }})"
      - alert: SslCertificateHasExpired
        expr: probe_ssl_earliest_cert_expiry - time()  <= 0
        for: 5m
        labels:
          severity: error
        annotations:
          summary: "SSL certificate has expired (instance {{ $labels.instance }})"
          description: "SSL certificate has expired already (current value: {{ $value }})"
      - alert: BlackboxSlowPing
        expr: probe_icmp_duration_seconds > 2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Blackbox slow ping (instance {{ $labels.instance }})"
          description: "Blackbox ping took more than 2s (current value: {{ $value }})"
      - alert: BlackboxSlowRequests
        expr: probe_http_duration_seconds > 2 
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Blackbox slow requests (instance {{ $labels.instance }})"
          description: "Blackbox request took more than 2s (current value: {{ $value }})"
      - alert: PodCpuUsagePercent
        expr: sum(sum(label_replace(irate(container_cpu_usage_seconds_total[1m]),"pod","$1","container_label_io_kubernetes_pod_name", "(.*)"))by(pod) / on(pod) group_right kube_pod_container_resource_limits_cpu_cores *100 )by(container,namespace,node,pod,severity) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Pod cpu usage percent has exceeded 80% (current value: {{ $value }}%)"
    
  2. 编辑同目录下的prometheus.yml配置文件,在文件末尾添加以下内容:

    alerting:
      alertmanagers:
        - static_configs:
            - targets: ["alertmanager"]
    rule_files:
      - "/data/etc/rules.yml"
    

    image-20220303151831025

  3. 平滑重启prometheus,到prometheus的pod所在node上,执行下面的命令:

    [root@hdss7-21 ~]# ps -ef|grep prometheus
    # kill prometheus的进程号
    [root@hdss7-21 ~]# kill -SIGHUP 75838
    
  4. 查看Prometheus的alerts:

    image-20220303152250801

  5. 告警测试,把dubbo-service的scale为0,这样consumer会进行报错告警,同时blackbox会产生failure:

    image-20220303152454369

    Prometheus上查看alert:

    image-20220303152605879

  6. 邮箱查看告警:

    image-20220303152751258

    完成后将dubbo-service scale回1,关闭部分pod节省资源:

    [root@hdss7-21 ~]# kubectl scale deployment grafana --replicas=0 -n infra
    deployment.apps/grafana scaled
    [root@hdss7-21 ~]# kubectl scale deployment alertmanager --replicas=0 -n infra
    deployment.apps/alertmanager scaled
    [root@hdss7-21 ~]# kubectl scale deployment prometheus --replicas=0 -n infra
    deployment.apps/prometheus scaled
    

dubbo接入ELK

随着容器编排的进行,业务容器在不断被创建、摧毁、迁移、扩缩容等,面对如此海量的数据,又分布在不同的地方,就需要建立一套集中日志采集、分析的系统,这就是ELK,其可以实现以下几个功能:

  • 收集——采集多种来源的日志数据(流式日志收集器)
  • 传输——稳定的把日志数据传输到中央系统(消息队列)
  • 存储——将日志以结构化数据的形式存储起来(搜索引擎)
  • 分析——支持方便的分析、检索等,有GUI管理系统(前端)
  • 告警——提供错误报告,监控机制(监控工具)

上图是ELK部署的架构,其中c1/c2是容器,filebeat收集业务容器的日志,把c和filebeat放在一个pod中,使耦合更紧;

kafka:高吞吐量的分布式发布订阅消息系统,可以处理消费者在网站中的所有动作流数据,filebeat收集数据以topic形式发布到kafka;

Topic:kafka数据写入操作的基本单元;

logstash:取kafka里的topic,然后再往Elasticsearch上传(异步过程,又取又传)

index-pattern:把数据按环境分(prod和test),并传到kibana;

kibana:展示数据。

tomcat镜像制作

  1. 进入200机器,tomcat官网下载tomcat,这里使用8.5.76版本:

    [root@hdss7-200 etc]# cd /opt/src/
    [root@hdss7-200 src]# wget https://dlcdn.apache.org/tomcat/tomcat-8/v8.5.76/bin/apache-tomcat-8.5.76.tar.gz
    [root@hdss7-200 src]# mkdir /data/dockerfile/tomcat
    [root@hdss7-200 src]# tar xvf apache-tomcat-8.5.76.tar.gz -C /data/dockerfile/tomcat/
    
  2. 注释apache-tomcat-8.5.76/conf/server.xml文件中的AJP配置:

        <!-- Define an AJP 1.3 Connector on port 8009 -->
        <!--
        <Connector protocol="AJP/1.3"
                   address="::1"
                   port="8009"
                   redirectPort="8443" />
        -->
    
  3. 删除多余日志配置:

    # 删除其中的3manager,4host-manager配置
    handlers = 1catalina.org.apache.juli.AsyncFileHandler, 2localhost.org.apache.juli.AsyncFileHandler,  java.util.logging.ConsoleHandler
    # 注释3manager,4host-manager
    #3manager.org.apache.juli.AsyncFileHandler.level = FINE
    #3manager.org.apache.juli.AsyncFileHandler.directory = ${catalina.base}/logs
    #3manager.org.apache.juli.AsyncFileHandler.prefix = manager.
    #3manager.org.apache.juli.AsyncFileHandler.encoding = UTF-8
    
    #4host-manager.org.apache.juli.AsyncFileHandler.level = FINE
    #4host-manager.org.apache.juli.AsyncFileHandler.directory = ${catalina.base}/logs
    #4host-manager.org.apache.juli.AsyncFileHandler.prefix = host-manager.
    #4host-manager.org.apache.juli.AsyncFileHandler.encoding = UTF-8
    
    # 修改日志级别
    1catalina.org.apache.juli.AsyncFileHandler.level = INFO
    2localhost.org.apache.juli.AsyncFileHandler.level = INFO
    java.util.logging.ConsoleHandler.level = INFO
    

    image-20220303161726503

  4. 准备dockerfile:

    FROM harbor.od.com/public/jre:8u112
    RUN /bin/cp /usr/share/zoneinfo/Asia/Shanghai /etc/localtime && \
        echo 'Asia/Shanghai' > /etc/timezone
    ENV CATALINA_HOME /opt/tomcat
    ENV LANG zh_CN.UTF-8
    ADD apache-tomcat-8.5.76 /opt/tomcat
    ADD config.yml /opt/prom/config.yml
    ADD jmx_javaagent-0.3.1.jar /opt/prom/jmx_javaagent-0.3.1.jar
    WORKDIR /opt/tomcat
    ADD entrypoint.sh /entrypoint.sh
    CMD ["/entrypoint.sh"]
    

    config.yml

    ---
    rules:
      - pattern: '_*'
    

    entrypoint.sh

    #!/bin/bash
    M_OPTS="-Duser.timezone=Asia/Shanghai -javaagent:/opt/prom/jmx_javaagent-0.3.1.jar=$(hostname -i):${M_PORT:-"12346"}:/opt/prom/config.yml"
    C_OPTS=${C_OPTS}
    MIN_HEAP=${MIN_HEAP:-"128m"}
    MAX_HEAP=${MAX_HEAP:-"128m"}
    JAVA_OPTS=${JAVA_OPTS:-"-Xmn384m -Xss256k -Duser.timezone=GMT+08  -XX:+DisableExplicitGC -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+CMSParallelRemarkEnabled -XX:+UseCMSCompactAtFullCollection -XX:CMSFullGCsBeforeCompaction=0 -XX:+CMSClassUnloadingEnabled -XX:LargePageSizeInBytes=128m -XX:+UseFastAccessorMethods -XX:+UseCMSInitiatingOccupancyOnly -XX:CMSInitiatingOccupancyFraction=80 -XX:SoftRefLRUPolicyMSPerMB=0 -XX:+PrintClassHistogram  -Dfile.encoding=UTF8 -Dsun.jnu.encoding=UTF8"}
    CATALINA_OPTS="${CATALINA_OPTS}"
    JAVA_OPTS="${M_OPTS} ${C_OPTS} -Xms${MIN_HEAP} -Xmx${MAX_HEAP} ${JAVA_OPTS}"
    sed -i -e "1a\JAVA_OPTS=\"$JAVA_OPTS\"" -e "1a\CATALINA_OPTS=\"$CATALINA_OPTS\"" /opt/tomcat/bin/catalina.sh
    
    cd /opt/tomcat && /opt/tomcat/bin/catalina.sh run 2>&1 >> /opt/tomcat/logs/stdout.log
    
    # 下载javaagent:
    wget https://repo1.maven.org/maven2/io/prometheus/jmx/jmx_prometheus_javaagent/0.3.1/jmx_prometheus_javaagent-0.3.1.jar -O jmx_javaagent-0.3.1.jar
    
    [root@hdss7-200 tomcat]# chmod +x entrypoint.sh
    [root@hdss7-200 tomcat]# docker build . -t harbor.od.com/base/tomcat:v8.5.76
    [root@hdss7-200 tomcat]# docker push !$
    
  5. jenkins新建流水线:

    image-20220303165757582

    image-20220303165915753

    image-20220303170426991

    image-20220303170557204

    image-20220303172336986

    image-20220303170634988

    image-20220303170723799

    image-20220303170759557

    image-20220303170824834

    image-20220303170911165

    image-20220303171554509

    image-20220303172959465

    image-20220303171944312

    pipeline {
      agent any 
        stages {
        stage('pull') { //get project code from repo 
          steps {
            sh "git clone ${params.git_repo} ${params.app_name}/${env.BUILD_NUMBER} && cd ${params.app_name}/${env.BUILD_NUMBER} && git checkout ${params.git_ver}"
            }
        }
        stage('build') { //exec mvn cmd
          steps {
            sh "cd ${params.app_name}/${env.BUILD_NUMBER}  && /var/jenkins_home/maven-${params.maven}/bin/${params.mvn_cmd}"
          }
        }
        stage('unzip') { //unzip  target/*.war -c target/project_dir
          steps {
            sh "cd ${params.app_name}/${env.BUILD_NUMBER} && cd ${params.target_dir} && mkdir project_dir && unzip *.war -d ./project_dir"
          }
        }
        stage('image') { //build image and push to registry
          steps {
            writeFile file: "${params.app_name}/${env.BUILD_NUMBER}/Dockerfile", text: """FROM harbor.od.com/${params.base_image}
    ADD ${params.target_dir}/project_dir /opt/tomcat/webapps/${params.root_url}"""
            sh "cd  ${params.app_name}/${env.BUILD_NUMBER} && docker build -t harbor.od.com/${params.image_name}:${params.git_ver}_${params.add_tag} . && docker push harbor.od.com/${params.image_name}:${params.git_ver}_${params.add_tag}"
          }
        }
      }
    }
    
  6. 点击构建,填入以下参数:

    app_name:       dubbo-demo-web
    image_name:     app/dubbo-demo-web
    git_repo:       -b client git@gitee.com:evobot/dubbo-demo-web.git
    git_ver:        tomcat
    add_tag:        220303-1730
    mvn_dir:        ./
    target_dir:     ./dubbo-client/target
    mvn_cmd:        mvn clean package -Dmaven.test.skip=true
    base_image:     base/tomcat:v8.5.76
    maven:          3.6.1-8u242
    root_url:       ROOT
    # 点击Build进行构建,等待构建完成
    
  7. build成功后,将apollo scale到1,然后修改test命名空间dubbo-demo-consumer的deployment中的image版本号以及删除20880端口号,然后更新pod:

    image-20220303200633995

    image-20220303200537485

  8. 访问demo-test.od.com/hello?name=tomcat页面:

    image-20220303200819244

  9. 进入consumer的pod命令行,查看日志,下面显示的就是需要收集的日志:

    image-20220303200926789

部署elasticsearch

这里只部署一个elasticsearch节点。

  1. 官网下载包,选择MACOS/LINUX右键复制链接下载到12机器:

    [root@hdss7-12 src]# wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-6.8.6.tar.gz
    [root@hdss7-12 src]# tar xvf elasticsearch-6.8.6.tar.gz -C /opt/
    [root@hdss7-12 src]# ln -s /opt/elasticsearch-6.8.6/ /opt/elasticsearch
    [root@hdss7-12 src]# cd /opt/elasticsearch
    [root@hdss7-12 elasticsearch]# mkdir -p /data/elasticsearch/{data,logs}
    
  2. 配置elasticsearch,编辑config/elasticsearch.yml,修改以下内容:

    cluster.name: es.od.com
    node.name: hdss7-12.host.com
    path.data: /data/elasticsearch/data
    path.logs: /data/elasticsearch/logs
    bootstrap.memory_lock: true
    network.host: 10.4.7.12
    http.port: 9200
    
  3. 编辑config/jvm.options,修改以下内容:

    -Xms512m
    -Xmx512m
    
  4. 创建用户及调整系统参数:

    [root@hdss7-21 elasticsearch]# useradd -s /bin/bash -M es
    [root@hdss7-21 elasticsearch]# chown -R es.es /opt/elasticsearch-6.8.6/
    [root@hdss7-21 elasticsearch]# chown -R es.es /opt/elasticsearch
    [root@hdss7-21 elasticsearch]# chown -R es.es /data/elasticsearch/
    # vi /etc/security/limits.d/es.conf写入以下内容
    es hard nofile 65536
    es soft fsize unlimited
    es hard memlock unlimited
    es soft memlock unlimited
    
    # 调整内核参数
    [root@hdss7-21 elasticsearch]# sysctl -w vm.max_map_count=262144
    [root@hdss7-21 elasticsearch]# echo 'vm.max_map_count=262144' >> /etc/sysctl.conf
    [root@hdss7-21 elasticsearch]# sysctl -p
    
    
  5. 启动es并调整日志模板:

    [root@hdss7-21 elasticsearch]# su -c "/opt/elasticsearch/bin/elasticsearch -d" es
    [root@hdss7-21 elasticsearch]# netstat -tlunp|grep 9200
    # 调整日志模板
    [root@hdss7-21 elasticsearch]# curl -H "Content-Type:application/json" -XPUT http://10.4.7.12:9200/_template/k8s -d '{
    > "template": "k8s*",
    > "index_patterns": ["k8s*"],
    > "settings": {
    >   "number_of_shards":5,
    >   "number_of_replicas": 0
    > }
    > }'
    

部署kafka和kafka-manager

部署kafka

  1. 在11机器下载kafka安装包:

    cd /opt/src/
    src]# wget https://archive.apache.org/dist/kafka/2.2.0/kafka_2.12-2.2.0.tgz
    src]# tar xfv kafka_2.12-2.2.0.tgz -C /opt/
    src]# ln -s /opt/kafka_2.12-2.2.0/ /opt/kafka
    src]# cd /opt/kafka
    kafka]# mkdir -pv /data/kafka/logs
    
  2. 配置kafka:

    # vi config/server.properties
    # 修改以下配置,其中zk不变
    kafka]# vi config/server.properties
    log.dirs=/data/kafka/logs
    zookeeper.connect=localhost:2181
    log.flush.interval.messages=10000
    log.flush.interval.ms=1000\
    # 下面两行则新增到尾部
    delete.topic.enable=true
    host.name=hdss7-11.host.com
    
  3. 启动kafka:

    [root@hdss7-11 kafka]# bin/kafka-server-start.sh -daemon config/server.properties
    # 获取kafka的进程PID
    [root@hdss7-11 kafka]# ps aux|grep kafka
    # 查看进程ID监听的端口
    [root@hdss7-11 kafka]# netstat -tlnp |grep 125525
    

部署kafka-manager

  1. 200机器,制作docker:

    [root@hdss7-200 src]# docker pull hlebalbau/kafka-manager
    [root@hdss7-200 src]# docker images|grep kafka
    [root@hdss7-200 src]# docker tag 2743fc8f2a3e harbor.od.com/infra/kafka-manager:latest
    

[root@hdss7-200 src]# docker push !$


2. 在/data/k8s-yaml/kafka-manager目录下创建资源配置清单,并在node节点应用:

**dp.yaml**

```yaml
kind: Deployment
apiVersion: apps/v1
metadata:
  name: kafka-manager
  namespace: infra
  labels:
    name: kafka-manager
spec:
  template:
    metadata:
      labels:
        app: kafka-manager
    spec:
      containers:
        - name: kafka-manager
          image: harbor.od.com/infra/kafka-manager:latest
          imagePullPolicy: IfNotPresent
          ports:
            - containerPort: 9000
              protocol: TCP
          env:
            - name: ZK_HOSTS
              value: zk1.od.com:2181
            - name: APPLICATION_SECRET
              value: letmein
      imagePullSecrets:
        - name: harbor
      terminationGracePeriodSeconds: 30
      securityContext:
        runAsUser: 0
  replicas: 1
  selector:
    matchLabels:
      app: kafka-manager
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1
      maxSurge: 1
  revisionHistoryLimit: 7
  progressDeadlineSeconds: 600

svc.yaml

kind: Service
apiVersion: v1
metadata:
  name: kafka-manager
  namespace: infra
spec:
  ports:
    - protocol: TCP
      port: 9000
      targetPort: 9000
  selector:
    app: kafka-manager

ingress.yaml

kind: Ingress
apiVersion: extensions/v1beta1
metadata:
  name: kafka-manager
  namespace: infra
spec:
  rules:
    - host: km.od.com
      http:
        paths:
          - path: /
            backend:
              serviceName: kafka-manager
              servicePort: 9000
# 批量应用
[root@hdss7-22 ~]# for i in  dp svc ingress;do kubectl apply -f http://k8s-yaml.od.com/kafka-manager/$i.yaml;done
  1. 域名配置

    2021121020 ; serial
    km                 A    10.4.7.10
    
  2. 访问km.od.com,添加cluster:

    image-20220304140241522

    如果出现Yikes! KeeperErrorCode = Unimplemented for /kafka-manager/mutex Try again.的报错,就进入zk1的zookeeper,手动创建节点:

    [root@hdss7-11 zookeeper]# ./bin/zkCli.sh
    [zk: localhost:2181(CONNECTED) 1] ls /kafka-manager
    [configs, deleteClusters, clusters]
    [zk: localhost:2181(CONNECTED) 2] create /kafka-manager/mutex ""
    Created /kafka-manager/mutex
    [zk: localhost:2181(CONNECTED) 3] create /kafka-manager/mutex/locks ""
    Created /kafka-manager/mutex/locks
    [zk: localhost:2181(CONNECTED) 4] create /kafka-manager/mutex/leases ""
    
  3. 查看cluster:

    image-20220304144426149

dubbo接入filebeat

  1. Filebeat官网下载Linux 64-BIT的sha指纹:

    image-20220304153225542

    复制选中部分:

    image-20220304153326928

  2. 将test命名空间的所有pod启动,然后在200机器准备镜像:

    vi /data/dockerfile/filebeat/Dockerfile

    FROM debian:jessie
    
    ENV FILEBEAT_VERSION=7.6.1 \
        FILEBEAT_SHA1=887edb2ab255084ef96dbc4c7c047bfa92dad16f263e23c0fcc80120ea5aca90a3a7a44d4783ba37b135dac76618971272a591ab4a24997d8ad40c7bc23ffabf
    
    RUN set -x && \
        apt-get update && \
        wget https://artifacts.elastic.co/downloads/beats/filebeat/filebeat-${FILEBEAT_VERSION}-linux-x86_64.tar.gz -O /opt/filebeat.tar.gz && \
        cd /opt && \
        echo "${FILEBEAT_SHA1}  filebeat.tar.gz" | sha512sum -c - && \
        tar xzvf filebeat.tar.gz && \
        cd filebeat-* && \
        cp filebeat /bin && \
        cd /opt && \
        rm -rf filebeat* && \
        apt-get purge -y wget && \
        apt-get autoremove -y && \
        apt-get clean && rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*
    
    COPY docker-entrypoint.sh /
    ENTRYPOINT ["/docker-entrypoint.sh"]
    

    docker-entrypoint.sh

    #!/bin/bash
    
    ENV=${ENV:-"test"}
    PROJ_NAME=${PROJ_NAME:-"no-define"}
    MULTILINE=${MULTILINE:-"^\d{2}"}
    
    cat > /etc/filebeat.yaml << EOF
    filebeat.inputs:
    - type: log
      fields_under_root: true
      fields:
        topic: logm-${PROJ_NAME}
      paths:
        - /logm/*.log
        - /logm/*/*.log
        - /logm/*/*/*.log
        - /logm/*/*/*/*.log
        - /logm/*/*/*/*/*.log
      scan_frequency: 120s
      max_bytes: 10485760
      multiline.pattern: '$MULTILINE'
      multiline.negate: true
      multiline.match: after
      multiline.max_lines: 100
    - type: log
      fields_under_root: true
      fields:
        topic: logu-${PROJ_NAME}
      paths:
        - /logu/*.log
        - /logu/*/*.log
        - /logu/*/*/*.log
        - /logu/*/*/*/*.log
        - /logu/*/*/*/*/*.log
        - /logu/*/*/*/*/*/*.log
    output.kafka:
      hosts: ["10.4.7.11:9092"]
      topic: k8s-fb-$ENV-%{[topic]}
      version: 2.0.0
      required_acks: 0
      max_message_bytes: 10485760
    EOF
    
    set -xe
    
    # If user don't provide any command
    # Run filebeat
    if [[ "$1" == "" ]]; then
         exec filebeat  -c /etc/filebeat.yaml 
    else
        # Else allow the user to run arbitrarily commands like bash
        exec "$@"
    fi
    
    [root@hdss7-200 filebeat]# chmod +x entrypoint.sh
    [root@hdss7-200 filebeat]# docker build . -t harbor.od.com/infra/filebeat:v7.6.1
    [root@hdss7-200 filebeat]# docker push !$
    
  3. 清空/data/k8s-yaml/test/dubbo-demo-consumer目录下的资源配置清单,使用下面的配置:

    dp.yaml

    kind: Deployment
    apiVersion: apps/v1
    metadata:
      name: dubbo-demo-consumer
      namespace: test
      labels:
        name: dubbo-demo-consumer
    spec:
      replicas: 1
      selector:
        matchLabels:
          name: dubbo-demo-consumer
      template:
        metadata:
          labels:
            app: dubbo-demo-consumer
            name: dubbo-demo-consumer
        spec:
          containers:
          - name: dubbo-demo-consumer
            image: harbor.od.com/app/dubbo-demo-web:tomcat_220303_1730
            imagePullPolicy: IfNotPresent
            ports:
            - containerPort: 8080
              protocol: TCP
            env:
            - name: C_OPTS
              value: -Denv=fat -Dapollo.meta=http://apollo-configservice:8080
            volumeMounts:
            - mountPath: /opt/tomcat/logs
              name: logm
          - name: filebeat
            image: harbor.od.com/infra/filebeat:v7.6.1
            imagePullPolicy: IfNotPresent
            env:
            - name: ENV
              value: test
            - name: PROJ_NAME
              value: dubbo-demo-web
            volumeMounts:
            - mountPath: /logm
              name: logm
          volumes:
            - emptyDir: {}
              name: logm
          imagePullSecrets:
            - name: harbor
          restartPolicy: Always
          terminationGracePeriodSeconds: 30
          securityContext:
            runAsUser: 0
          schedulerName: default-scheduler
      strategy:
        type: RollingUpdate
        rollingUpdate:
          maxUnavailable: 1
          maxSurge: 1
      revisionHistoryLimit: 7
      progressDeadlineSeconds: 600
    
  4. 应用资源配置清单,查看filebeat:

    [root@hdss7-22 ~]# kubectl apply -f http://k8s-yaml.od.com/test/dubbo-demo-consumer/dp.yaml
    
    [root@hdss7-21 ~]# docker ps -a |grep consumer
    [root@hdss7-21 ~]# docker exec -it 9eb3ee5ed02c bash
    root@dubbo-demo-consumer-669f6845b7-xrvhr:/# cd /logm/
    root@dubbo-demo-consumer-669f6845b7-xrvhr:/logm# ls
    catalina.2022-03-04.log  localhost.2022-03-04.log  localhost_access_log.2022-03-04.txt  stdout.log
    # 每刷新一次demo页面,这个日志都会更新
    root@dubbo-demo-consumer-669f6845b7-xrvhr:/logm# tail -f stdout.log
    

    image-20220304164309440

部署logstash

  1. 准备镜像,logstash的版本需要和es一致:

    [root@hdss7-200 src]# docker pull logstash:6.8.6
    [root@hdss7-200 src]# docker images|grep logstash
    [root@hdss7-200 src]# docker tag d0a2dac51fcb harbor.od.com/infra/logstash:v6.8.6
    [root@hdss7-200 src]# docker push harbor.od.com/infra/logstash:v6.8.6
    
  2. 创建配置文件

    [root@hdss7-200 src]# mkdir /etc/logstash
    [root@hdss7-200 src]# vi /etc/logstash/logstash-test.conf
    [root@hdss7-200 src]# vi /etc/logstash/logstash-prod.conf
    

    logstash-test.conf

    input {
      kafka {
        bootstrap_servers => "10.4.7.11:9092"
        client_id => "10.4.7.200"
        consumer_threads => 4
        group_id => "k8s_test"
        topics_pattern => "k8s-fb-test-.*"
      }
    }
    
    filter {
      json {
        source => "message"
      }
    }
    
    output {
      elasticsearch {
        hosts => ["10.4.7.12:9200"]
        index => "k8s-test-%{+YYYY.MM.DD}"
      }
    }
    

    logstash-prod.conf

    input {
      kafka {
        bootstrap_servers => "10.4.7.11:9092"
        client_id => "10.4.7.200"
        consumer_threads => 4
        group_id => "k8s_prod"
        topics_pattern => "k8s-fb-prod-.*"
      }
    }
    
    filter {
      json {
        source => "message"
      }
    }
    
    output {
      elasticsearch {
        hosts => ["10.4.7.12:9200"]
        index => "k8s-prod-%{+YYYY.MM.DD}"
      }
    }
    
  3. 启动logstash:

    [root@hdss7-200 src]# docker run -d --name logstash-test -v /etc/logstash:/etc/logstash harbor.od.com/infra/logstash:v6.8.6 -f /etc/logstash/logstash-test.conf
    [root@hdss7-200 src]# docker ps -a |grep logstash
    
  4. 刷新demo页面,让kafka里更新日志,然后查看es索引:

    image-20220304171534355

    [root@hdss7-200 src]# curl http://10.4.7.12:9200/_cat/indices?v
    health status index               uuid                   pri rep docs.count docs.deleted store.size pri.store.size
    green  open   k8s-test-2022.03.63 -KUK0rBmR3SUlgY7QzklTw   5   0          2            0     24.8kb         24.8kb
    [root@hdss7-200 src]#
    

交付kibana

  1. 镜像准备:

    [root@hdss7-200 src]# docker pull kibana:6.8.6
    [root@hdss7-200 src]# docker images|grep kibana
    [root@hdss7-200 src]# docker tag adfab5632ef4 harbor.od.com/infra/kibana:v6.8.6
    [root@hdss7-200 src]# docker push harbor.od.com/infra/kibana:v6.8.6
    
  2. 200机器在/data/k8s-yaml/kibana目录下准备资源配置清单文件:

    dp.yaml

    kind: Deployment
    apiVersion: apps/v1
    metadata:
      name: kibana
      namespace: infra
      labels:
        name: kibana
    spec:
      replicas: 1
      selector:
        matchLabels:
          name: kibana
      template:
        metadata:
          labels:
            app: kibana
            name: kibana
        spec:
          containers:
            - name: kibana
              image: harbor.od.com/infra/kibana:v6.8.6
              imagePullPolicy: IfNotPresent
              ports:
                - containerPort: 5601
                  protocol: TCP
              env:
                - name: ELASTICSEARCH_URL
                  value: http://10.4.7.12:9200
          imagePullSecrets:
            - name: harbor
          securityContext:
            runAsUser: 0
      strategy:
        type: RollingUpdate
        rollingUpdate:
          maxUnavailable: 1
          maxSurge: 1
      revisionHistoryLimit: 7
      progressDeadlineSeconds: 600
    

    svc.yaml

    kind: Service
    apiVersion: v1
    metadata:
      name: kibana
      namespace: infra
    spec:
      ports:
        - protocol: TCP
          port: 5601
          targetPort: 5601
      selector:
        app: kibana
    

    ingress.yaml

    apiVersion: extensions/v1beta1
    kind: Ingress
    metadata:
      name: kinbana
      namespace: infra
    spec:
      rules:
      - host: kibana.od.com
        http:
          paths:
          - path: /
            backend:
              serviceName: kibana
              servicePort: 5601
    ---
    
  3. 配置域名解析,应用资源配置清单:

    2021121021 ; serial
    kibana             A    10.4.7.10
    
    # 22机器批量应用
    [root@hdss7-22 ~]# for i in  dp svc ingress;do kubectl apply -f http://k8s-yaml.od.com/kibana/$i.yaml;done
    
  4. 访问kibana.od.com

    image-20220304173611794

    image-20220304173645761

    image-20220304173749427

    image-20220304173819033

    image-20220304173916129

    image-20220304173943079

    image-20220304174012736

    然后就可以看到日志:

    image-20220304174102038

  5. 接着把prod命名空间里的configservice和admin启动起来:

    image-20220304223817079

  6. 200机器,复制test/dubbo-demo-consumer下的dp.yaml,放到prod同目录下,并修改其中的namespace为pro:

    [root@hdss7-200 prod]# cd dubbo-demo-consumer/
    [root@hdss7-200 dubbo-demo-consumer]# ls
    dp.yaml  ingress.yaml  svc.yaml
    [root@hdss7-200 dubbo-demo-consumer]# cp ../../test/dubbo-demo-consumer/dp.yaml .
    cp:是否覆盖"./dp.yaml"? y
    [root@hdss7-200 dubbo-demo-consumer]# sed -i 's/fat/pro/g' dp.yaml
    [root@hdss7-200 dubbo-demo-consumer]# sed -i 's/test/prod/g' dp.yaml
    

    image-20220304224355301

Kibana生产实践

日志采集

  1. 确认config-prod.od.com的Eureka有config和admin服务:

    image-20220304225417071

  2. 确认apollo里FAT和PRO环境都存在

    image-20220304225606219

  3. 确认完毕后,将prod命名空间的dubbo-demo-service的pod启动:

    image-20220304225707803

  4. 然后应用新的consumer deploymet:

    [root@hdss7-22 ~]# kubectl apply -f http://k8s-yaml.od.com/prod/dubbo-demo-consumer/dp.yaml
    deployment.apps/dubbo-demo-consumer configured
    
  5. 200机器,创建prod环境的logstash:

    [root@hdss7-200 dubbo-demo-consumer]# docker run -d --name logstash-prod -v /etc/logstash:/etc/logstash harbor.od.com/infra/logstash:v6.8.6 -f /etc/logstash/logstash-prod.conf
    [root@hdss7-200 dubbo-demo-consumer]# docker ps|grep logstash
    
    # curl es,这时候只有test
    [root@hdss7-200 dubbo-demo-consumer]# curl http://10.4.7.12:9200/_cat/indices?v
    health status index                           uuid                   pri rep docs.count docs.deleted store.size pri.store.size
    green  open   .monitoring-kibana-6-2022.03.04 Bv6cXIk4RqeqyJbJCeCL0w   1   0       2433            0    613.2kb        613.2kb
    green  open   .kibana_1                       yyHJ8l5qT3uC2Hz4lAeUPA   1   0          5            0     25.2kb         25.2kb
    green  open   .kibana_task_manager            Xalxgst9SjOE38Z-2HgQIA   1   0          2            0     12.5kb         12.5kb
    green  open   k8s-test-2022.03.63             -KUK0rBmR3SUlgY7QzklTw   5   0        153            0    465.8kb        465.8kb
    green  open   .monitoring-es-6-2022.03.04     BxxwSy44SkCy6Q69926TLg   1   0      22021           36      9.2mb          9.2mb
    
    
  6. 浏览器访问demo-prod.od.com,然后查看filebeat的日志:

    image-20220305003428134

    image-20220305003510295

    [root@hdss7-21 ~]# docker ps -a |grep consumer
    1c3fc912fbe4   b5194046a19c                         "/docker-entrypoint.…"   14 minutes ago   Up 14 minutes                                                     k8s_filebeat_dubbo-demo-consumer-6b4b988fbb-fp2lx_prod_0c2c12cd-11af-4f90-a8bc-4fc36d038fac_0
    a46c7fdf2bd6   80ef6604725e                         "/entrypoint.sh"         15 minutes ago   Up 14 minutes                                                     k8s_dubbo-demo-consumer_dubbo-demo-consumer-6b4b988fbb-fp2lx_prod_0c2c12cd-11af-4f90-a8bc-4fc36d038fac_0
    8158e0aeaa28   harbor.od.com/public/pause:latest    "/pause"                 15 minutes ago   Up 14 minutes                                                     k8s_POD_dubbo-demo-consumer-6b4b988fbb-fp2lx_prod_0c2c12cd-11af-4f90-a8bc-4fc36d038fac_0
    [root@hdss7-21 ~]# docker exec -it 1c3fc912fbe4 bash
    

    image-20220305005731006

    到km.od.com查看

    image-20220305010051803

  7. 200机器,查询es:

    image-20220305010143124

  8. 然后去kibana进行配置:

    image-20220305010303096

    image-20220305010322940

kibana使用

时间选择

image-20220305010821505

image-20220305010739608

使用最多的是today:

image-20220305010657218

环境选择器

image-20220305011146037

关键字选择器

选择message,log.file.path,hostname几项

image-20220305011255543

尝试制造错误,将test环境的service scale为0

image-20220305011600473

然后刷新demo-test.od.com,让其报错:

image-20220305011655710

在kibana中搜索exception关键字:

image-20220305012038022

image-20220305012101263

这样整个consumer的日志采集就已经完成了,完成后恢复service的pod。