本来是没想写这个对比。无奈之前和的作者聊了一阵,发现了一些的改进空间。然后顺便看了新的boost.context的cc部分的代码,有所启发。想给做一些优化,主要集中在减少分配次数从而减少内存碎片;在支持的编译器里有些地方用右值引用来减少不必要的拷贝;减少原子操作和减少L1cache miss几个方面。
之后改造了茫茫多流程和接口后出了v2版本,虽然没完全优化完,但是组织结构已经定型了,可以用来做压力测试。为了以后方便顺便还把cppcheck和clang-analyzer的静态分析工具写进了dev脚本。然后万万没想到的是,在大量协程的情况下,benchmark的结果性能居然比原来还下降了大约1/3。
我结合valgrind和perf的报告分析了一下原因,原来的读L1 cache miss大约在68%左右,而v2里的读L1 cache miss到了98%。究其原因,原来的协程、协程任务对象和协程任务的actor是由malloc分配的,而现在在v2里全部优化后放进了分配的执行栈里。这样减少了三次malloc操作。但是这也导致不同协程、任务和actor之间的距离隔得非常远,必超出L1 cache的量(一般是64字节)。那么就必然容易L1 cache miss了。而原来在benchmark里由于是连续分配的,所以他们互相都在比较近的位置,当然原来的性能高了。
这种情况,因该说是原来的benchmark更加不能作为实际使用过程中的性能参考依据。之前也说过,因为在实际应用场景中几乎必然cache miss,因为逻辑会更复杂得多,内存访问也更切换得频繁和多。
这让我突然想到了[go][3]语言的goroutine。不知道这玩意有没有考虑切换时的Cache miss开销,不知道针对这个做优化。因为它是语言级别实现的go程,那么如果要针对缓存做优化则比较容易实现一些。不过它的动态栈总归会有一定的开销。
goroutine压力测试
这里还是用了和里差不多的测试方法。benchmark的代码如下:
package main
import (
"fmt"
"os"
"strconv"
"time"
)
func runCallback(in, out chan int64) {
for n, ok := <-in; ok; n, ok = <-in {
out <- n
}
}
func runTest(round int, coroutineNum, switchTimes int64) {
fmt.Printf("##### Round: %v\n", round)
start := time.Now()
channelsIn, channelsOut := make([]chan int64, coroutineNum), make([]chan int64, coroutineNum)
for i := int64(0); i < coroutineNum; i++ {
channelsIn[i] = make(chan int64, 1)
channelsOut[i] = make(chan int64, 1)
}
end := time.Now()
fmt.Printf("Create %v goroutines and channels cost %vns, avg %vns\n", coroutineNum, end.Sub(start).Nanoseconds(), end.Sub(start).Nanoseconds()/coroutineNum)
start = time.Now()
for i := int64(0); i < coroutineNum; i++ {
go runCallback(channelsIn[i], channelsOut[i])
}
end = time.Now()
fmt.Printf("Start %v goroutines and channels cost %vns, avg %vns\n", coroutineNum, end.Sub(start).Nanoseconds(), end.Sub(start).Nanoseconds()/coroutineNum)
var sum int64 = 0
start = time.Now()
for i := int64(0); i < switchTimes; i++ {
for j := int64(0); j < coroutineNum; j++ {
channelsIn[j] <- 1
sum += <-channelsOut[j]
}
}
end = time.Now()
fmt.Printf("Switch %v goroutines for %v times cost %vns, avg %vns\n", coroutineNum, sum, end.Sub(start).Nanoseconds(), end.Sub(start).Nanoseconds()/sum)
start = time.Now()
for i := int64(0); i < coroutineNum; i++ {
close(channelsIn[i])
close(channelsOut[i])
}
end = time.Now()
fmt.Printf("Close %v goroutines cost %vns, avg %vns\n", coroutineNum, end.Sub(start).Nanoseconds(), end.Sub(start).Nanoseconds()/coroutineNum)
}
func main() {
var coroutineNum, switchTimes int64 = 30000, 1000
fmt.Printf("### Run: ")
for _, v := range os.Args {
fmt.Printf(" \"%s\"", v)
}
fmt.Printf("\n")
if (len(os.Args)) > 1 {
v, _ := strconv.Atoi(os.Args[1])
coroutineNum = int64(v)
}
if (len(os.Args)) > 2 {
v, _ := strconv.Atoi(os.Args[2])
switchTimes = int64(v)
}
for i := 1; i <= 5; i++ {
runTest(i, coroutineNum, switchTimes)
}
}
测试结果如下:
PS D:\projs\test\go> .\test_goroutine.exe
### Run: "D:\projs\test\go\test_goroutine.exe"
##### Round: 1
Create 30000 goroutines and channels cost 6515200ns, avg 217ns
Start 30000 goroutines and channels cost 79505000ns, avg 2650ns
Switch 30000 goroutines for 30000000 times cost 42225426300ns, avg 1407ns
Close 30000 goroutines cost 15017500ns, avg 500ns
##### Round: 2
Create 30000 goroutines and channels cost 19868200ns, avg 662ns
Start 30000 goroutines and channels cost 22487700ns, avg 749ns
Switch 30000 goroutines for 30000000 times cost 44709165100ns, avg 1490ns
Close 30000 goroutines cost 15559000ns, avg 518ns
##### Round: 3
Create 30000 goroutines and channels cost 3999700ns, avg 133ns
Start 30000 goroutines and channels cost 17508400ns, avg 583ns
Switch 30000 goroutines for 30000000 times cost 50535999000ns, avg 1684ns
Close 30000 goroutines cost 36289900ns, avg 1209ns
##### Round: 4
Create 30000 goroutines and channels cost 5999600ns, avg 199ns
Start 30000 goroutines and channels cost 44500300ns, avg 1483ns
Switch 30000 goroutines for 30000000 times cost 45678842800ns, avg 1522ns
Close 30000 goroutines cost 13005600ns, avg 433ns
##### Round: 5
Create 30000 goroutines and channels cost 5000000ns, avg 166ns
Start 30000 goroutines and channels cost 14001000ns, avg 466ns
Switch 30000 goroutines for 30000000 times cost 47485810100ns, avg 1582ns
Close 30000 goroutines cost 17999800ns, avg 599ns
不过goroutine的内存开销确实小,30000个goroutine的内存占用才300MB。
我只贴一样的协程数量和切换次数的结果了
###################### task (stack using stack pool) ###################
########## Cmd: .\sample_benchmark_task_stack_pool.exe 30000 1000 64
### Round: 1 ###
create 30000 task, cost time: 0 s, clock time: 104 ms, avg: 3466 ns
switch 30000 tasks 30000000 times, cost time: 18 s, clock time: 18500 ms, avg: 616 ns
remove 30000 tasks, cost time: 0 s, clock time: 28 ms, avg: 933 ns
### Round: 2 ###
create 30000 task, cost time: 0 s, clock time: 44 ms, avg: 1466 ns
switch 30000 tasks 30000000 times, cost time: 19 s, clock time: 18341 ms, avg: 611 ns
remove 30000 tasks, cost time: 0 s, clock time: 29 ms, avg: 966 ns
### Round: 3 ###
create 30000 task, cost time: 0 s, clock time: 44 ms, avg: 1466 ns
switch 30000 tasks 30000000 times, cost time: 18 s, clock time: 18188 ms, avg: 606 ns
remove 30000 tasks, cost time: 0 s, clock time: 28 ms, avg: 933 ns
### Round: 4 ###
create 30000 task, cost time: 0 s, clock time: 44 ms, avg: 1466 ns
switch 30000 tasks 30000000 times, cost time: 18 s, clock time: 18267 ms, avg: 608 ns
remove 30000 tasks, cost time: 0 s, clock time: 28 ms, avg: 933 ns
### Round: 5 ###
create 30000 task, cost time: 0 s, clock time: 44 ms, avg: 1466 ns
switch 30000 tasks 30000000 times, cost time: 19 s, clock time: 18772 ms, avg: 625 ns
remove 30000 tasks, cost time: 0 s, clock time: 26 ms, avg: 866 ns
其实这是v2的测试数据,虽然切换开销比原来是要大一些,但是之前在Linux上的结果,这个创建开销已经是原来版本的一半了(Linux上的创建开销原先大约是1us,v2大约是500ns,切换开销忘记了,v2版本大约是300-400ns)。
结论