一劳永逸解决Jenkins安装插件超时、慢等问题

众所周知的原因,在国内访问Jenkins的插件站点不是很稳定,经常访问很慢或者超时。

如果你网上搜索Jenkins更换国内源等关键字,大多数文章都会告诉你做以下三个步骤:

  • 修改"Manage Jenkins"--->"Manage Plugins"--->"Advanced" --->"Update Site" URL为国内源,如https://mirrors.tuna.tsinghua.edu.cn/jenkins/updates/update-center.json
  • /var/jenkins_home/updates/default.json文件内容中https://updates.jenkins.io/download替换为国内源,如https://mirrors.tuna.tsinghua.edu.cn/jenkins
  • 重启Jenkins

第一步有一个坑:https://mirrors.tuna.tsinghua.edu.cn/jenkins/updates/update-center.json下载的是始终是最新的插件,和旧版本的Jenkins是不兼容的,要替换为https://mirrors.tuna.tsinghua.edu.cn/jenkins/updates/dynamic-stable-VERSION/update-center.json才能装到匹配的插件版本,其中VERSION是当前Jenkins的版本。如果不是LTS版本,则dynamic-stable-VERSION要换成dynamic-VERSION

第二步修改了之后,过一段时间你想再装插件你会发现/var/jenkins_home/updates/default.json这个文件又被恢复成官方的了。此举并不是一劳永逸的,因为Jenkins会定时更新这个文件。

那么有没有更好的办法呢?答案是有的,下面就教大家如何一劳永逸地将Jenkins插件站点换为你想要的任何镜像站点。

方法示意图:

Jenkins.png

  • 修改hosts,使updates.jenkins.io指向nginx的IP
  • 配置nginx反向代理,指向你所希望访问的Jenkins插件站点

需要解决一个问题,Jenkins源站点是https协议的,Jenkins会校验SSL证书的有效性,因此我这里将使用nginx的sub_filter模块将update-center.json返回的内容修改为http协议的。

记一次服务器时间不同步导致的gitlab集群服务报错

正在运行的gitlab集群突然抽风,具体故障现象如下。

环境信息:gitlab部署在k8s集群内,采用官方helm包部署,gitlab版本为14.10.x。

故障现象:项目相关的操作有几率报错

  • 新建:创建项目有几率失败,并报错
  • 查看项目:进入项目内,有几率刷不出代码树,并报错,报错信息 An error occurred while fetching folder content.
  • 删除项目:项目删不掉

查看Gitaly服务的日志,发现是Praefect服务调用Gitaly的健康检查接口报错,错误关键信息为 PermissionDenied。

{"correlation_id":"01G7BX3DR59H35EYHPMKHNVBC0","error":"rpc error: code = PermissionDenied desc = permission denied","grpc.code":"PermissionDenied","grpc.meta.auth_version":"v2","grpc.meta.deadline_type":"unknown","grpc.meta.method_type"
:"unary","grpc.method":"Check","grpc.request.deadline":"2022-07-07T08:39:54.207","grpc.request.fullMethod":"/grpc.health.v1.Health/Check","grpc.request.payload_bytes":0,"grpc.response.payload_bytes":0,"grpc.service":"grpc.health.v1.Healt
h","grpc.start_time":"2022-07-07T08:39:53.208","grpc.time_ms":0.285,"level":"warning","msg":"finished unary call with code PermissionDenied","peer.address":"10.42.1.134:51266","pid":12,"span.kind":"server","system":"grpc","time":"2022-07-07T08:39:53.208Z"}
{"correlation_id":"01G7BX3EQJZJFT996MD2KF2HXW","error":"rpc error: code = PermissionDenied desc = permission denied","grpc.code":"PermissionDenied","grpc.meta.auth_version":"v2","grpc.meta.deadline_type":"unknown","grpc.meta.method_type":"unary","grpc.method":"Check","grpc.request.deadline":"2022-07-07T08:39:55.212","grpc.request.fullMethod":"/grpc.health.v1.Health/Check","grpc.request.payload_bytes":0,"grpc.response.payload_bytes":0,"grpc.service":"grpc.health.v1.Health","grpc.start_time":"2022-07-07T08:39:54.213","grpc.time_ms":0.201,"level":"warning","msg":"finished unary call with code PermissionDenied","peer.address":"10.42.1.134:51266","pid":12,"span.kind":"server","system":"grpc","time":"2022-07-07T08:39:54.213Z"}

*** /var/log/gitaly/gitaly_ruby_json.log ***
{"type":"gitaly-ruby","grpc.start_time":"2022-07-07T08:39:53Z","grpc.time_ms":0.286,"grpc.code":"OK","grpc.method":"Check","grpc.service":"grpc.health.v1.Health","pid":35,"correlation_id":"c486ca8b2bbc3eb6736d533c38cf6017","time":"2022-07-07T08:39:53.642Z"}
{"type":"gitaly-ruby","grpc.start_time":"2022-07-07T08:39:53Z","grpc.time_ms":17.012,"grpc.code":"OK","grpc.method":"Check","grpc.service":"grpc.health.v1.Health","pid":36,"correlation_id":"71bff8c80424d633989f5daeb29111c3","time":"2022-07-07T08:39:53.658Z"}

*** /var/log/gitaly/gitaly.log ***
{"correlation_id":"01G7BX3FPZ8BTMCA0RY5754JGJ","error":"rpc error: code = PermissionDenied desc = permission denied","grpc.code":"PermissionDenied","grpc.meta.auth_version":"v2","grpc.meta.deadline_type":"unknown","grpc.meta.method_type":"unary","grpc.method":"Check","grpc.request.deadline":"2022-07-07T08:39:56.217","grpc.request.fullMethod":"/grpc.health.v1.Health/Check","grpc.request.payload_bytes":0,"grpc.response.payload_bytes":0,"grpc.service":"grpc.health.v1.Health","grpc.start_time":"2022-07-07T08:39:55.218","grpc.time_ms":0.135,"level":"warning","msg":"finished unary call with code PermissionDenied","peer.address":"10.42.1.134:51266","pid":12,"span.kind":"server","system":"grpc","time":"2022-07-07T08:39:55.218Z"}
{"correlation_id":"01G7BX3GPCRD1RPMN7F3WN6X14","error":"rpc error: code = PermissionDenied desc = permission denied","grpc.code":"PermissionDenied","grpc.meta.auth_version":"v2","grpc.meta.deadline_type":"unknown","grpc.meta.method_type":"unary","grpc.method":"Check","grpc.request.deadline":"2022-07-07T08:39:57.222","grpc.request.fullMethod":"/grpc.health.v1.Health/Check","grpc.request.payload_bytes":0,"grpc.response.payload_bytes":0,"grpc.service":"grpc.health.v1.Health","grpc.start_time":"2022-07-07T08:39:56.223","grpc.time_ms":0.189,"level":"warning","msg":"finished unary call with code PermissionDenied","peer.address":"10.42.1.134:51266","pid":12,"span.kind":"server","system":"grpc","time":"2022-07-07T08:39:56.223Z"}

解决:服务器之间做好时间同步就好了

官方issue: Permission denied between Gitlab and Praefect

RabbitMQ 删除队列

项目中使用到了RabbitMQ,使用了大量的一次性队列,然而没有设置自动过期、自动删除等特性。长期运行导致了大量的队列产生,非常影响性能及问题排查效率。这里收集了一些可以批量删除队列的方法,供参考。

方法一:设置过期策略

优点:操作简单,可针对有规律的队列进行策略设置
缺点:想不到有什么缺点

# 设置规则
rabbitmqctl set_policy delete_gen "amq.gen-.*" '{"expires":1}' --apply-to queues

# 取消规则
rabbitmqctl clear_policy delete_gen

# 如果要作用于所有队列
rabbitmqctl set_policy delete_all ".*" '{"expires":1}' --apply-to queues

方法二:重置数据库

优点:简单,删除全部队列
缺点:粗暴

rabbitmqctl stop_app
rabbitmqctl reset
rabbitmqctl start_app

方法三:删除vhost

优点:删除一个vhost的所有队列,包括Exchange
缺点:仅适用需要删除一个vhost的场景

curl -i -XDELETE http://USERNAME:[email protected]:15672/api/vhosts/VHOST_NAME
# 例子
curl -i -XDELETE http://admin:[email protected]:15672/api/vhosts/%2F

方法四:通过HTTP API删除

优点:HTTP API灵活
缺点:一次删一个

curl -i -XDELETE http://USERNAME:[email protected]:PORT/api/queues/VHOST/QUEUE_NAME
# 例子:
curl -i -XDELETE http://admin:[email protected]:15672/api/queues/%2F/test_queue

方法五:使用rabbitmqadmin工具

优点:使用方便
缺点:底层也是使用HTTP API实现的

rabbitmqadmin --host=HOST --port=15672 --ssl --vhost=VHOST --username=USERNAME --password=PASSWORD delete queue name=QUEUE_NAME

haproxy http 重定向 https

对于http端口:80,https端口:443

frontend app
    bind *:80
    bind :443 ssl crt /etc/haproxy/server.pem no-sslv3
    mode http
    option httplog
    option forwardfor
    rspidel ^Server.*
    redirect scheme https if !{ ssl_fc }
    default_backend app

backend app
    mode http
    option httpchk HEAD /
    server app01 server1:3000 check inter 2000 rise 2 fall 5

对于http、https端口不为80、443时,以上的方法就行不通了,得使用下面的方法

frontend app
    bind *:8080
    bind :8443 ssl crt /etc/haproxy/server.pem no-sslv3
    mode http
    option httplog
    option forwardfor
    rspidel ^Server.*
    http-request redirect code 301 location https://www.haxi.cc:8443%[capture.req.uri] if !{ ssl_fc }
    default_backend app

backend app
    mode http
    option httpchk HEAD /
    server app01 server1:3000 check inter 2000 rise 2 fall 5

故障管理流程 Incident Management

  1. 目标

    在短时间内恢复服务正常运营(满足 SLA [Service-Level Agreement]),将业务运营的负面影响降至最低。

  2. 范围

    包括:

    • 用户和技术人员报告的失效、问题或疑问
    • 事件监控工具的自动发现和报告
  3. 对企业的价值

    • 能够检测和解决故障
    • 能够将IT活动与实时业务优先级相关联
    • 能够发现潜在的服务改进方面
    • 服务台可以从中发现额外需要的服务或培训需求
    • 故障管理在企业中有很高的曝光率,更容易展示出流程价值所在,为争取投资提供支持。
  4. 基本概念

    • 处理时限:

      • 根据 SLA 中规定的整体故障响应与解决目标,在不同的故障处理阶段必须确定具体处理时限。要在 OLA [Operational Level Agreement] 和 UC [Underpinning Contract] 中作为目标明确规定
      • 所有支持小组必须清除了解这些处理时限
      • 可以借助服务管理工具用于自动执行处理时限,并根据预定义规则升级
    • 故障模型:

      • 预定的“标准”故障模型将有助于在故障发生时对应到合适的故障
      • 按故障模型要求将信息输入到故障处理支持工具中,之后该类工具可以自动进行流程的处理、管理与升级工作
    • 模型包括:

      • 处理故障应遵循的步骤
      • 这些步骤应遵循的时间顺序,相互依赖关系
      • 职责
      • 措施完成的时间表与阈值
      • 升级程序,应该联系谁,何时进行升级
      • 任何必要的证据保留
    • 重大故障:

      • 组织必须明确标识出哪类事件构成重大故障
      • 必要时可以动态成立一支重大故障处理团队
      • 如果需要调查故障原因,问题经理也需要参与其中
      • 服务台需确保所有活动均记录在案,且用户了解具体进展