redis-go-cluster

从redis-go-cluster报错源码分析

Posted by Liangjf on July 20, 2020

从redis-go-cluster报错源码分析

redis-go-cluster报错: client can't connect to Redis cluster, node.address is 172.16.7.16:8002@18002

现象: 可以连接集群, 但是执行命令出错

redis版本: redis-5.0.5 redis-go-cluster版本: 1.0.0

可以查看我这个issues:

https://github.com/wuxibin89/redis-go-cluster/issues/39

我的使用demo

```go
cluster, err := redis.NewCluster(
    &redis.Options{
        StartNodes:   nodes,
        ConnTimeout:  50 * time.Millisecond,
        ReadTimeout:  50 * time.Millisecond,
        WriteTimeout: 50 * time.Millisecond,
        KeepAlive:    16,
        AliveTime:    60 * time.Second,
    })
if err != nil {
    return nil, err
}

fmt.Println(redis.String(cluster.Do("GET", "name")))
```

但是在运行时报错:

```go
=== RUN   TestNewRedisPool
	--- FAIL: TestNewRedisPool (0.00s)
	panic: runtime error: invalid memory address or nil pointer dereference [recovered]
		panic: runtime error: invalid memory address or nil pointer dereference
	[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x544f2a]

	goroutine 7 [running]:
	testing.tRunner.func1(0xc0000ae100)
		/home/liangjf/app/go/src/testing/testing.go:874 +0x3a3
	panic(0x5749a0, 0x6d8890)
		/home/liangjf/app/go/src/runtime/panic.go:679 +0x1b2
	github.com/chasex/redis-go-cluster.(*redisConn).shutdown(...)
		/home/liangjf/ljf_home/code/go_home/pkg/mod/github.com/chasex/redis-go-cluster@v1.0.0/node.go:146
	github.com/chasex/redis-go-cluster.(*redisNode).do(0xc0000b5200, 0x5ab7fa, 0x3, 0xc00004c760, 0x1, 0x1, 0x10, 0x10, 0xc00004c760, 0x546b0a)
		/home/liangjf/ljf_home/code/go_home/pkg/mod/github.com/chasex/redis-go-cluster@v1.0.0/node.go:192 +0x23a
	github.com/chasex/redis-go-cluster.(*redisCluster).Do(0xc0000ac080, 0x5ab7fa, 0x3, 0xc00004c760, 0x1, 0x1, 0x1372e4f5, 0x1372e4f50000000f, 0x5ed5b3ba, 0xc00003af60)
		/home/liangjf/ljf_home/code/go_home/pkg/mod/github.com/chasex/redis-go-cluster@v1.0.0/cluster.go:254 +0x20f
	command-line-arguments.(*RedisPool).Get(...)
		/home/liangjf/opensource/gpusher/common/db/redis_string.go:40
	command-line-arguments.TestNewRedisPool(0xc0000ae100)
		/home/liangjf/opensource/gpusher/common/db/redis_test.go:23 +0x16b
	testing.tRunner(0xc0000ae100, 0x5b5de8)
		/home/liangjf/app/go/src/testing/testing.go:909 +0xc9
	created by testing.(*T).Run
		/home/liangjf/app/go/src/testing/testing.go:960 +0x350
	FAIL	command-line-arguments	0.006s
```

通过goland debug跟踪代码, 定位到这里出错net为nil, 不科学啊, 既然可以连接集群, 执行命令Do()时连接一般是可以分配的

```go
func (node *redisNode) getConn() (*redisConn, error) {
    ...
    c, err := net.DialTimeout("tcp", node.address, node.connTimeout)
    if err != nil {
        return nil, err
    }
    ...
} ```

执行命令(出错):

连接集群(正常):

我发现,发生panic时,node.address为172.16.7.16:8002@18002, 但连接集群时,node.address为172.16.7.16:8002

所以在调试模式下运行时,我将其手动修改为 172.16.7.16:8002@18002172.16.7.16:8002,连接和Do执行命令正常

继续跟踪代码调用栈,我发现node.address会这个函数里更新, 这个函数是client连接上redis cluster时会通过 CLUSTER NODES命令来获取集群节点的address

初步可以确定问题了, 是redis-go-cluster库的对CLUSTER NODES命令返回信息处理获取节点node节点的address出错, 导致在执行Do命令时, 由于会根据key来hash得到一个node来执行命令, 但是由于node.address 出错172.16.7.16:8002@18002(正确是172.16.7.16:8002),所以会导致执行Do()出错, 实质是net为nil

```go
//github.com/chasex/redis-go-cluster@v1.0.0/cluster.go:609
func (cluster *redisCluster) updateClustrInfo(node *redisNode) error {
    info, err := String(node.do("CLUSTER", "NODES"))
    infos := strings.Split(strings.Trim(info, "\n"), "\n")

    for i := range fields {
        fields[i] = strings.Split(infos[i], " ")
        if len(fields[i]) < kFieldSlot {
            return fmt.Errorf("missing field: %s [%d] [%d]", infos[i], len(fields[i]), kFieldSlot)
        }

        nodes[fields[i][kFieldAddr]] = &redisNode {
            name: fields[i][kFieldName],
            address: fields[i][kFieldAddr],
            slaves: make([]*redisNode, 0),
            connTimeout: cluster.connTimeout,
            readTimeout: cluster.readTimeout,
            writeTimeout: cluster.writeTimeout,
            keepAlive: cluster.keepAlive,
            aliveTime: cluster.aliveTime,
        }
    }
    ...
}
```

连接集群后,该函数很好的更新了集群信息,主要是处理cluster nodes命令返回的集群信息,然后获取每个节点的地址

例如在终端客户端执行cluster nodes会得到:

```go
	172.16.7.16:8001> cluster nodes
	78a8926d4ad002e88f7d499ff4e590f6385b8ffd 172.16.7.16:8005@18005 slave 37a21de453a1118217fed1ec3535463d1bfe5244 0 1591076153884 4 connected
	37a21de453a1118217fed1ec3535463d1bfe5244 172.16.7.16:8002@18002 master - 0 1591076151881 2 connected 5462-10922
	f83fc4f5b61e4886be681e3b19b952781de9d1bc 172.16.7.16:8003@18003 master - 0 1591076152000 0 connected 10923-16383
	b1a7ab40c42a0e8c9c5037aa3df7a701377567cd 172.16.7.16:8006@18006 slave f83fc4f5b61e4886be681e3b19b952781de9d1bc 0 1591076153000 5 connected
	3e1c4883e904adde6c4e917c9929954cf14a1a57 172.16.7.16:8001@18001 myself,master - 0 1591076152000 1 connected 0-5461
	4e6f4de2fab077ea612b05a4e264cb85a6a9eda7 172.16.7.16:8004@18004 slave 3e1c4883e904adde6c4e917c9929954cf14a1a57 0 1591076151000 3 connected
```

既然定位到问题, 所以我修改如下:

```go
//github.com/chasex/redis-go-cluster@v1.0.0/cluster.go:622
for i := range fields {
    fields[i] = strings.Split(infos[i], " ")
    if len(fields[i]) < kFieldSlot {
        return fmt.Errorf("missing field: %s [%d] [%d]", infos[i], len(fields[i]), kFieldSlot)
    }

    //这是我添加的: Handling the address field
    if strings.Contains(fields[i][kFieldAddr], "@") {
        fields[i][kFieldAddr] = strings.Split(fields[i][kFieldAddr], "@")[0]
    }

    nodes[fields[i][kFieldAddr]] = &redisNode {
        name: fields[i][kFieldName],
        address: fields[i][kFieldAddr],
        slaves: make([]*redisNode, 0),
        connTimeout: cluster.connTimeout,
        readTimeout: cluster.readTimeout,
        writeTimeout: cluster.writeTimeout,
        keepAlive: cluster.keepAlive,
        aliveTime: cluster.aliveTime,
    }
}
```

现在, 重新运行程序, client连接集群, 执行命令都没问题了. 果然是updateClustrInfo函数处理集群节点address的问题

通过查看最新master代码, 发现更新集群节点信息函数已经修改了, 换为使用CLUSTER SLOTS命令来获取节点node的address, 但是没有打tag, 因此使用go module拉取的还是tag v1.0.0的版本, 还是出现问题的那个版本.

```go
//cluster.go:314
func (cluster *Cluster) update(node *redisNode) error {
    info, err := Values(node.do("CLUSTER", "SLOTS"))
    ....
}
```

可以通过go mod replace来修改版本问题

replace github.com/chasex/redis-go-cluster => 
github.com/chasex/redis-go-cluster v1.0.1-0.20161207023922-222d81891f1d