从redis-go-cluster报错源码分析
redis-go-cluster报错: client can't connect to Redis cluster, node.address is 172.16.7.16:8002@18002
现象: 可以连接集群, 但是执行命令出错
redis版本: redis-5.0.5 redis-go-cluster版本: 1.0.0
可以查看我这个issues:
https://github.com/wuxibin89/redis-go-cluster/issues/39
我的使用demo
```go
cluster, err := redis.NewCluster(
&redis.Options{
StartNodes: nodes,
ConnTimeout: 50 * time.Millisecond,
ReadTimeout: 50 * time.Millisecond,
WriteTimeout: 50 * time.Millisecond,
KeepAlive: 16,
AliveTime: 60 * time.Second,
})
if err != nil {
return nil, err
}
fmt.Println(redis.String(cluster.Do("GET", "name")))
```
但是在运行时报错:
```go
=== RUN TestNewRedisPool
--- FAIL: TestNewRedisPool (0.00s)
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x544f2a]
goroutine 7 [running]:
testing.tRunner.func1(0xc0000ae100)
/home/liangjf/app/go/src/testing/testing.go:874 +0x3a3
panic(0x5749a0, 0x6d8890)
/home/liangjf/app/go/src/runtime/panic.go:679 +0x1b2
github.com/chasex/redis-go-cluster.(*redisConn).shutdown(...)
/home/liangjf/ljf_home/code/go_home/pkg/mod/github.com/chasex/redis-go-cluster@v1.0.0/node.go:146
github.com/chasex/redis-go-cluster.(*redisNode).do(0xc0000b5200, 0x5ab7fa, 0x3, 0xc00004c760, 0x1, 0x1, 0x10, 0x10, 0xc00004c760, 0x546b0a)
/home/liangjf/ljf_home/code/go_home/pkg/mod/github.com/chasex/redis-go-cluster@v1.0.0/node.go:192 +0x23a
github.com/chasex/redis-go-cluster.(*redisCluster).Do(0xc0000ac080, 0x5ab7fa, 0x3, 0xc00004c760, 0x1, 0x1, 0x1372e4f5, 0x1372e4f50000000f, 0x5ed5b3ba, 0xc00003af60)
/home/liangjf/ljf_home/code/go_home/pkg/mod/github.com/chasex/redis-go-cluster@v1.0.0/cluster.go:254 +0x20f
command-line-arguments.(*RedisPool).Get(...)
/home/liangjf/opensource/gpusher/common/db/redis_string.go:40
command-line-arguments.TestNewRedisPool(0xc0000ae100)
/home/liangjf/opensource/gpusher/common/db/redis_test.go:23 +0x16b
testing.tRunner(0xc0000ae100, 0x5b5de8)
/home/liangjf/app/go/src/testing/testing.go:909 +0xc9
created by testing.(*T).Run
/home/liangjf/app/go/src/testing/testing.go:960 +0x350
FAIL command-line-arguments 0.006s
```
通过goland debug跟踪代码, 定位到这里出错net为nil, 不科学啊, 既然可以连接集群, 执行命令Do()时连接一般是可以分配的
```go
func (node *redisNode) getConn() (*redisConn, error) {
...
c, err := net.DialTimeout("tcp", node.address, node.connTimeout)
if err != nil {
return nil, err
}
...
} ```
执行命令(出错):
连接集群(正常):
我发现,发生panic时,node.address为172.16.7.16:8002@18002
, 但连接集群时,node.address为172.16.7.16:8002
所以在调试模式下运行时,我将其手动修改为 172.16.7.16:8002@18002
为 172.16.7.16:8002
,连接和Do执行命令正常
继续跟踪代码调用栈,我发现node.address会这个函数里更新, 这个函数是client连接上redis cluster时会通过 CLUSTER NODES
命令来获取集群节点的address
初步可以确定问题了, 是redis-go-cluster库的对CLUSTER NODES
命令返回信息处理获取节点node节点的address出错, 导致在执行Do命令时, 由于会根据key来hash得到一个node来执行命令, 但是由于node.address
出错172.16.7.16:8002@18002
(正确是172.16.7.16:8002
),所以会导致执行Do()出错, 实质是net为nil
```go
//github.com/chasex/redis-go-cluster@v1.0.0/cluster.go:609
func (cluster *redisCluster) updateClustrInfo(node *redisNode) error {
info, err := String(node.do("CLUSTER", "NODES"))
infos := strings.Split(strings.Trim(info, "\n"), "\n")
for i := range fields {
fields[i] = strings.Split(infos[i], " ")
if len(fields[i]) < kFieldSlot {
return fmt.Errorf("missing field: %s [%d] [%d]", infos[i], len(fields[i]), kFieldSlot)
}
nodes[fields[i][kFieldAddr]] = &redisNode {
name: fields[i][kFieldName],
address: fields[i][kFieldAddr],
slaves: make([]*redisNode, 0),
connTimeout: cluster.connTimeout,
readTimeout: cluster.readTimeout,
writeTimeout: cluster.writeTimeout,
keepAlive: cluster.keepAlive,
aliveTime: cluster.aliveTime,
}
}
...
}
```
连接集群后,该函数很好的更新了集群信息,主要是处理cluster nodes命令返回的集群信息,然后获取每个节点的地址
例如在终端客户端执行cluster nodes
会得到:
```go
172.16.7.16:8001> cluster nodes
78a8926d4ad002e88f7d499ff4e590f6385b8ffd 172.16.7.16:8005@18005 slave 37a21de453a1118217fed1ec3535463d1bfe5244 0 1591076153884 4 connected
37a21de453a1118217fed1ec3535463d1bfe5244 172.16.7.16:8002@18002 master - 0 1591076151881 2 connected 5462-10922
f83fc4f5b61e4886be681e3b19b952781de9d1bc 172.16.7.16:8003@18003 master - 0 1591076152000 0 connected 10923-16383
b1a7ab40c42a0e8c9c5037aa3df7a701377567cd 172.16.7.16:8006@18006 slave f83fc4f5b61e4886be681e3b19b952781de9d1bc 0 1591076153000 5 connected
3e1c4883e904adde6c4e917c9929954cf14a1a57 172.16.7.16:8001@18001 myself,master - 0 1591076152000 1 connected 0-5461
4e6f4de2fab077ea612b05a4e264cb85a6a9eda7 172.16.7.16:8004@18004 slave 3e1c4883e904adde6c4e917c9929954cf14a1a57 0 1591076151000 3 connected
```
既然定位到问题, 所以我修改如下:
```go
//github.com/chasex/redis-go-cluster@v1.0.0/cluster.go:622
for i := range fields {
fields[i] = strings.Split(infos[i], " ")
if len(fields[i]) < kFieldSlot {
return fmt.Errorf("missing field: %s [%d] [%d]", infos[i], len(fields[i]), kFieldSlot)
}
//这是我添加的: Handling the address field
if strings.Contains(fields[i][kFieldAddr], "@") {
fields[i][kFieldAddr] = strings.Split(fields[i][kFieldAddr], "@")[0]
}
nodes[fields[i][kFieldAddr]] = &redisNode {
name: fields[i][kFieldName],
address: fields[i][kFieldAddr],
slaves: make([]*redisNode, 0),
connTimeout: cluster.connTimeout,
readTimeout: cluster.readTimeout,
writeTimeout: cluster.writeTimeout,
keepAlive: cluster.keepAlive,
aliveTime: cluster.aliveTime,
}
}
```
现在, 重新运行程序, client连接集群, 执行命令都没问题了. 果然是updateClustrInfo
函数处理集群节点address的问题
通过查看最新master代码, 发现更新集群节点信息函数已经修改了, 换为使用CLUSTER SLOTS
命令来获取节点node的address, 但是没有打tag, 因此使用go module拉取的还是tag v1.0.0的版本, 还是出现问题的那个版本.
```go
//cluster.go:314
func (cluster *Cluster) update(node *redisNode) error {
info, err := Values(node.do("CLUSTER", "SLOTS"))
....
}
```
可以通过go mod replace来修改版本问题
replace github.com/chasex/redis-go-cluster =>
github.com/chasex/redis-go-cluster v1.0.1-0.20161207023922-222d81891f1d