基于 Sleuth+Zipkin 实施链路跟踪体系

在单体应用系统年代，基于日志来查找并分析程序调用流程和定位问题或许不是太过复杂。但是到了分布式服务时代，不同的服务分散于多台不同的机器甚至是不同区域的机房之中，追踪应用程序的调用链路便不是一件那么容易的事了。本文尝试以Spring Cloud 体系中的Sleuth + Zipkin来演示分布式环境下的链路跟踪体系的搭建过程。

一. 技术背景

在微服务体系架构下，系统的功能都是由众多个微服务应用来协调提供的。一个应用请求可能要经过多个微服务实例的处理之后才能将最终结果返回给用户。举个栗子：在电商环境中的下订单业务，就要经过订单系统，库存系统，支付系统，消息通知系统等等。这些系统由一个个微服务组成，可能部署在成千上百台机器上。面对如此纷杂的消息传递过程，当系统发生故障的时候，就需要一种机制对故障点进行快速定位，确认是哪个服务出了问题。基于此，链路追踪技术应运而生。

那么，何谓 ”链路追踪“呢？

所谓的链路追踪，就是运行时通过某种方式记录下服务之间的调用过程，再通过可视化的 UI 界面帮研发运维人员快速定位到出错点，从而大大提高了错误排查效率。可以说，链路追踪技术，是微服务架构运维的底层地基。没有它，运维人员在面对问题时，根本就无法参透链路调用这个黑盒子，就好比身处太阳照射不到的深海，看不到一丝光明。

二. Spring Cloud Sleuth + Zipkin

在 Spring全家桶中，Spring Cloud 生态体系下内置了Sleuth 组件，它通过扩展 Logging 日志的方式实现微服务的链路追踪。在引入了Sleuth 组件的微服务应用日志中，其格式会变成如下形式：

2021-05-14 19:08:25.370 INFO[a-service,1811eabb5ec36fe3,1811eabb5ec36fe3,true] 9872 — [nio-7000-exec-4] m.m.a.RequestResponseBodyMethodProcessor : Writing [“-> Service A -> Service B -> Service C”]

可以看到，INFO之后多出来一个框框，其中的文本就是 Sleuth 在微服务日志中附加的链路调用数据，它有着固定的格式，即

[application_Id, traceId, spanId, isExport]

这四个项各自的含义如下：

微服务 Id，说明日志是由哪个微服务产生的。
traceId，轨迹编号。一次完整的业务处理过程被称为轨迹，例如：实现登录功能需要从服务 A 调用服务 B，服务B再调用服务 C，那这一次登录处理的过程就是一个轨迹，从前端应用发来请求到接收到响应，每一次完整的业务功能处理过程都对应唯一的 traceId。
spanId，步骤编号。刚才要实现登录功能需要从服务 A 到服务 C 涉及 3 个微服务处理，按处理前后顺序，每一个微服务处理时日志都被赋予不同的 spanId。一个 traceId 拥有多个 spanId，而 spanId 只能隶属于某一个traceId。
导出标识，当前这个日志是否被导出，该值为 true 的时候说明当前轨迹数据允许被其他链路追踪可视化服务收集展现。

这些日志信息会被 Sleuth 导出，交由 Zipkin 来收集展示。

Zipkin 是 Twitter 开源的一个项目，它能收集各个服务实例上的链路追踪数据并可视化展现。下图是一个 Zipkin 的 UI 示例图。

通过这个图表可以非常直观的了解业务处理过程中服务间的依赖关系与处理时间、处理状态等信息。

三. 动手搭建

3.1 创建工程

下面话不多叙，马上动手来搭建一个基于Spring Cloud Sleuth + Zipkin 的链路追踪 demo. 当然了，在开始之前，你应该确保自己的机器上已经安装了搭建过程中所需的那些前置软件，如 JDK 等。

为了演示整个搭建过程，我们这里新建三个 Spring Boot 工程，分别命名为 a-service, b-service 和 c-service. 在它们的 pom.xml 文件中引入如下这些 Jar 包依赖。

xml

<properties>
    <java.version>1.8</java.version>
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
    <project.reporting.outputEncoding>UTF-8</project.reporting.outputEncoding>
    <spring-boot.version>2.3.7.RELEASE</spring-boot.version>
    <spring-cloud-alibaba.version>2.2.2.RELEASE</spring-cloud-alibaba.version>
    <spring-cloud.version>Hoxton.SR9</spring-cloud.version>
</properties>

<dependencies>
    <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-web</artifactId>
    </dependency>
    <dependency>
        <groupId>com.alibaba.cloud</groupId>
        <artifactId>spring-cloud-starter-alibaba-nacos-discovery</artifactId>
    </dependency>
    <dependency>
        <groupId>org.springframework.cloud</groupId>
        <artifactId>spring-cloud-starter-openfeign</artifactId>
    </dependency>
    <dependency>
        <groupId>org.springframework.cloud</groupId>
        <artifactId>spring-cloud-starter-sleuth</artifactId>
    </dependency>
    <dependency>
        <groupId>org.springframework.cloud</groupId>
        <artifactId>spring-cloud-starter-zipkin</artifactId>
    </dependency>
    <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-test</artifactId>
        <scope>test</scope>
        <exclusions>
            <exclusion>
                <groupId>org.junit.vintage</groupId>
                <artifactId>junit-vintage-engine</artifactId>
            </exclusion>
        </exclusions>
    </dependency>
</dependencies>

我们的演示过程是，由客户端请求 a-service暴露的接口，再由 a-service 调用 b-service， b-service 调用 c-service. 因此 a-service 和 b-service 需要额外依赖 OpenFeign 组件来进行远程调用。

3.2 配置文件

接下来，需要针对这三个不同的工程做一下配置，在 application.yml 配置文件中填入以下内容：

yml

server:
  port: 7000 #a-service:7000/b-service:8000/c-service:9000 
spring:
  cloud:
    nacos:
      discovery:
        server-addr: 192.168.110.129:8848
        username: nacos
        password: nacos

  application:
    name: a-service #a-service/b-service/c-service


logging:
  level:
    root: info

3.3 实现业务逻辑

a-service

在 a-service 中，编写如下代码：

java

@RestController
public class SampleController {
    @Resource
    private BServiceFeignClient bService;
    @GetMapping("/a")
    public String methodA(){
        String result = bService.methodB();
        result = "Service A" + result;
        return result;
    }
}

methodA() 是暴露给客户端调用的接口，该接口又会通过 BServiceFeignClient 这个类去调用 b-service. BServiceFeignClient 的代码如下：

java

@FeignClient("b-service")
public interface BServiceFeignClient {
    @GetMapping("/b")
    public String methodB();
}

同样地，在 b-service 中编写代码如下：

java

@Controller
public class SampleController {
    @Resource
    private CServiceFeignClient cService;
    @GetMapping("/b")
    @ResponseBody
    public String methodB(){
        String result = cService.methodC();
        result = " -> Service B" + result;
        return result;
    }
}

CServiceFeignClient的代码如下：

java

@FeignClient("c-service")
public interface CServiceFeignClient {
    @GetMapping("/c")
    public String methodC();
}

在 c-service 中，只需要提供一个给 CServiceFeignClient 调用的服务接口就行了。代码如下：

java

/**
** SampleController，methodC方法产生响应字符串“-> Service C”，方法映射地址“/c”
**/
@RestController
public class SampleController {
    @GetMapping("/c")
    public String methodC(){
        String result = " -> Service C";
        return result;
    }
}

这样一个完整的调用链路已形成。在 3 个服务实例启动后，访问 A 实例。访问 http://localhost:7000/a 得到运行结果为

Code

1	Service A -> Service B -> Service C

在控制台的日志中可以看到Spring Cloud Sleuth 自动为日志增加了链路追踪数据。日志数据已经产生了，接下来需要将日志数据收集并通过可视化UI 界面来展示调用链路，那么就需要将 Zipkin 集成进来。

3.4 添加 Zipkin

部署 Zipkin 服务端

Zipkin 服务端部署非常简单，可以通过官网快速上手。

Zipkin Quick Start

通过如下命令：

curl -sSL https://zipkin.io/quickstart.sh | bash -s

java -jar zipkin.jar

即可快速跑起来 Zipkin.

当看到上面这个启动信息时，说明 Zipkin服务已经正常启动了。

在工程中加入Zipkin 客户端

由于我们使用的Maven 构建的工程，因此在 pom.xml 中加入以下这段坐标依赖：

xml

<dependency>
    <groupId>org.springframework.cloud</groupId>
    <artifactId>spring-cloud-starter-zipkin</artifactId>
</dependency>

同时打开 applicaiton.yml 配置文件加入以下内容：

yml

spring:
  ...
  sleuth:
    sampler: #采样器
      probability: 1.0 #采样率，采样率是采集Trace的比率，默认0.1
      rate: 10000 #每秒数据采集量，最多n条/秒Trace
  zipkin:
    #设置zipkin服务端地址
    base-url: http://localhost:9411

注意 zipkin 配置项在文件中的位置是跟 sleuth 配置项并列的，注意对齐，以免发生配置失效。

采样器的两个设置项需要重点说明：

spring.sleuth.sampler.probability 是指采样率，假设在过去的 1 秒 a 服务实例产生了 10 个 Trace，如果采用默认采样率 0.1 则代表只有其中1条会被发送到 Zipkin 服务端进行分析整理，如果设为 1，则 10 条 Trace 都会被发送到服务端进行处理。
spring.sleuth.sampler.rate 指每秒最多采集量，说明每条最多允许采集多少条 Trace，超出部分将直接抛弃。

在上述三个工程中都加好配置内容好，重启服务，再次访问 http://localhost:7000/a

然后访问 http://localhost:9411/zipkin，点击 Run Query 按钮，可以看到如下界面

上述结果中随便单击哪一条，都可以查看到其具体的链路调用过程，就如图1 所示。

到这里，基本上我们的搭建过程就告一段落了。