之前有太多的堆内存分配了,导致很多碎片。那么第一个想法就是协程对象可以分配在栈上,runner也可以分配在栈上。然后还可以加一个自定义预留长度。每个对象对齐到sizeof(long),总长度对齐到64 Bytes。
原来栈分配和协程对象分配是分开的,现在必须合并了。并且创建时可以指定分配多少预留空间。栈空间回收的操作必须切只能在析构结束最后执行,也就是智能指针的析构需要自己定义。
原来很多地方用得std::shared_ptr,这些也会导致额外的内存分配。这个目的本来是为了线程安全和生命周期管理,可以用make_shared来合并对象和管理区。如果我把协程对象分配在栈上,那就不能用make_shared了,那么内存碎片也就又出现了。所以我想再引入侵入式智能指针来解决这个问题。同时还可以选择要不要考虑线程安全,如果可以不考虑线程安全就可以节省掉原子操作的L1 cache miss的开销。
所以减少原子操作,一方面是要简化cotask和copp里的两次,可以优化成一次。另外可以开启一个非线程安全选项,这时候吧原子整数换成普通整数。因为我们游戏进程里都是单线程的服务,这种开销也不需要。
本来大部分逻辑是可以用简单的方法,无视掉低量的复制消耗的。但是随着原子操作的消耗增加,我们就不得不注意类似智能指针复制导致的这种原子开销。所以就得更加智能地判定编译器并且使用右值,因为使用右值转移智能指针的话,不涉及计数器的增减,所以也就没有这个开销。
###################### context coroutine (stack using stack pool) ###################
########## Cmd: ./sample_benchmark_coroutine_stack_pool 1000 1000 2048
### Round: 1 ###
allocate 1000 coroutine, cost time: 0 s, clock time: 0 ms, avg: 0 ns
create 1000 coroutine, cost time: 0 s, clock time: 5 ms, avg: 5000 ns
switch 1000 coroutine contest 1000000 times, cost time: 0 s, clock time: 83 ms, avg: 83 ns
remove 1000 coroutine, cost time: 0 s, clock time: 0 ms, avg: 0 ns
### Round: 2 ###
allocate 1000 coroutine, cost time: 0 s, clock time: 0 ms, avg: 0 ns
create 1000 coroutine, cost time: 0 s, clock time: 0 ms, avg: 0 ns
switch 1000 coroutine contest 1000000 times, cost time: 0 s, clock time: 82 ms, avg: 82 ns
remove 1000 coroutine, cost time: 0 s, clock time: 0 ms, avg: 0 ns
### Round: 3 ###
allocate 1000 coroutine, cost time: 0 s, clock time: 0 ms, avg: 0 ns
create 1000 coroutine, cost time: 0 s, clock time: 0 ms, avg: 0 ns
switch 1000 coroutine contest 1000000 times, cost time: 0 s, clock time: 83 ms, avg: 83 ns
remove 1000 coroutine, cost time: 0 s, clock time: 0 ms, avg: 0 ns
### Round: 4 ###
allocate 1000 coroutine, cost time: 0 s, clock time: 0 ms, avg: 0 ns
create 1000 coroutine, cost time: 0 s, clock time: 0 ms, avg: 0 ns
switch 1000 coroutine contest 1000000 times, cost time: 0 s, clock time: 82 ms, avg: 82 ns
remove 1000 coroutine, cost time: 0 s, clock time: 0 ms, avg: 0 ns
### Round: 5 ###
allocate 1000 coroutine, cost time: 0 s, clock time: 0 ms, avg: 0 ns
create 1000 coroutine, cost time: 0 s, clock time: 0 ms, avg: 0 ns
switch 1000 coroutine contest 1000000 times, cost time: 0 s, clock time: 83 ms, avg: 83 ns
remove 1000 coroutine, cost time: 0 s, clock time: 0 ms, avg: 0 ns
###################### context coroutine (stack using stack pool) ###################
########## Cmd: ./sample_benchmark_coroutine_stack_pool 1 3000000 16
### Round: 1 ###
allocate 1 coroutine, cost time: 0 s, clock time: 0 ms, avg: 0 ns
create 1 coroutine, cost time: 0 s, clock time: 0 ms, avg: 0 ns
switch 1 coroutine contest 3000000 times, cost time: 0 s, clock time: 162 ms, avg: 54 ns
remove 1 coroutine, cost time: 0 s, clock time: 0 ms, avg: 0 ns
### Round: 2 ###
allocate 1 coroutine, cost time: 0 s, clock time: 0 ms, avg: 0 ns
create 1 coroutine, cost time: 0 s, clock time: 0 ms, avg: 0 ns
switch 1 coroutine contest 3000000 times, cost time: 1 s, clock time: 162 ms, avg: 54 ns
remove 1 coroutine, cost time: 0 s, clock time: 0 ms, avg: 0 ns
### Round: 3 ###
allocate 1 coroutine, cost time: 0 s, clock time: 0 ms, avg: 0 ns
create 1 coroutine, cost time: 0 s, clock time: 0 ms, avg: 0 ns
switch 1 coroutine contest 3000000 times, cost time: 0 s, clock time: 162 ms, avg: 54 ns
remove 1 coroutine, cost time: 0 s, clock time: 0 ms, avg: 0 ns
### Round: 4 ###
allocate 1 coroutine, cost time: 0 s, clock time: 0 ms, avg: 0 ns
create 1 coroutine, cost time: 0 s, clock time: 0 ms, avg: 0 ns
switch 1 coroutine contest 3000000 times, cost time: 0 s, clock time: 163 ms, avg: 54 ns
remove 1 coroutine, cost time: 0 s, clock time: 0 ms, avg: 0 ns
### Round: 5 ###
allocate 1 coroutine, cost time: 0 s, clock time: 0 ms, avg: 0 ns
create 1 coroutine, cost time: 0 s, clock time: 0 ms, avg: 0 ns
switch 1 coroutine contest 3000000 times, cost time: 0 s, clock time: 163 ms, avg: 54 ns
remove 1 coroutine, cost time: 0 s, clock time: 0 ms, avg: 0 ns
###################### context coroutine (stack using stack pool) ###################
########## Cmd: ./sample_benchmark_coroutine_stack_pool 30000 100 64
### Round: 1 ###
allocate 30000 coroutine, cost time: 0 s, clock time: 0 ms, avg: 0 ns
create 30000 coroutine, cost time: 0 s, clock time: 145 ms, avg: 4833 ns
switch 30000 coroutine contest 3000000 times, cost time: 1 s, clock time: 771 ms, avg: 257 ns
remove 30000 coroutine, cost time: 0 s, clock time: 6 ms, avg: 200 ns
### Round: 2 ###
allocate 30000 coroutine, cost time: 0 s, clock time: 0 ms, avg: 0 ns
create 30000 coroutine, cost time: 0 s, clock time: 10 ms, avg: 333 ns
switch 30000 coroutine contest 3000000 times, cost time: 1 s, clock time: 812 ms, avg: 270 ns
remove 30000 coroutine, cost time: 0 s, clock time: 4 ms, avg: 133 ns
### Round: 3 ###
allocate 30000 coroutine, cost time: 0 s, clock time: 0 ms, avg: 0 ns
create 30000 coroutine, cost time: 0 s, clock time: 9 ms, avg: 300 ns
switch 30000 coroutine contest 3000000 times, cost time: 1 s, clock time: 793 ms, avg: 264 ns
remove 30000 coroutine, cost time: 0 s, clock time: 4 ms, avg: 133 ns
### Round: 4 ###
allocate 30000 coroutine, cost time: 0 s, clock time: 0 ms, avg: 0 ns
create 30000 coroutine, cost time: 0 s, clock time: 9 ms, avg: 300 ns
switch 30000 coroutine contest 3000000 times, cost time: 0 s, clock time: 787 ms, avg: 262 ns
remove 30000 coroutine, cost time: 0 s, clock time: 5 ms, avg: 166 ns
### Round: 5 ###
allocate 30000 coroutine, cost time: 0 s, clock time: 0 ms, avg: 0 ns
create 30000 coroutine, cost time: 0 s, clock time: 9 ms, avg: 300 ns
switch 30000 coroutine contest 3000000 times, cost time: 1 s, clock time: 804 ms, avg: 268 ns
remove 30000 coroutine, cost time: 0 s, clock time: 4 ms, avg: 133 ns
###################### task (stack using stack pool) ###################
########## Cmd: ./sample_benchmark_task_stack_pool 1000 1000 2048
### Round: 1 ###
create 1000 task, cost time: 0 s, clock time: 5 ms, avg: 5000 ns
switch 1000 tasks 1000000 times, cost time: 0 s, clock time: 160 ms, avg: 160 ns
remove 1000 tasks, cost time: 0 s, clock time: 0 ms, avg: 0 ns
### Round: 2 ###
create 1000 task, cost time: 0 s, clock time: 0 ms, avg: 0 ns
switch 1000 tasks 1000000 times, cost time: 0 s, clock time: 160 ms, avg: 160 ns
remove 1000 tasks, cost time: 0 s, clock time: 0 ms, avg: 0 ns
### Round: 3 ###
create 1000 task, cost time: 0 s, clock time: 0 ms, avg: 0 ns
switch 1000 tasks 1000000 times, cost time: 0 s, clock time: 160 ms, avg: 160 ns
remove 1000 tasks, cost time: 0 s, clock time: 0 ms, avg: 0 ns
### Round: 4 ###
create 1000 task, cost time: 0 s, clock time: 0 ms, avg: 0 ns
switch 1000 tasks 1000000 times, cost time: 0 s, clock time: 159 ms, avg: 159 ns
remove 1000 tasks, cost time: 0 s, clock time: 0 ms, avg: 0 ns
### Round: 5 ###
create 1000 task, cost time: 0 s, clock time: 0 ms, avg: 0 ns
switch 1000 tasks 1000000 times, cost time: 0 s, clock time: 160 ms, avg: 160 ns
remove 1000 tasks, cost time: 0 s, clock time: 0 ms, avg: 0 ns
###################### task (stack using stack pool) ###################
########## Cmd: ./sample_benchmark_task_stack_pool 1 3000000 16
### Round: 1 ###
create 1 task, cost time: 0 s, clock time: 0 ms, avg: 0 ns
switch 1 tasks 3000000 times, cost time: 1 s, clock time: 351 ms, avg: 117 ns
remove 1 tasks, cost time: 0 s, clock time: 0 ms, avg: 0 ns
### Round: 2 ###
create 1 task, cost time: 0 s, clock time: 0 ms, avg: 0 ns
switch 1 tasks 3000000 times, cost time: 0 s, clock time: 347 ms, avg: 115 ns
remove 1 tasks, cost time: 0 s, clock time: 0 ms, avg: 0 ns
### Round: 3 ###
create 1 task, cost time: 0 s, clock time: 0 ms, avg: 0 ns
switch 1 tasks 3000000 times, cost time: 0 s, clock time: 346 ms, avg: 115 ns
remove 1 tasks, cost time: 0 s, clock time: 0 ms, avg: 0 ns
### Round: 4 ###
create 1 task, cost time: 0 s, clock time: 0 ms, avg: 0 ns
switch 1 tasks 3000000 times, cost time: 1 s, clock time: 346 ms, avg: 115 ns
remove 1 tasks, cost time: 0 s, clock time: 0 ms, avg: 0 ns
### Round: 5 ###
create 1 task, cost time: 0 s, clock time: 0 ms, avg: 0 ns
switch 1 tasks 3000000 times, cost time: 0 s, clock time: 346 ms, avg: 115 ns
remove 1 tasks, cost time: 0 s, clock time: 0 ms, avg: 0 ns
###################### task (stack using stack pool) ###################
########## Cmd: ./sample_benchmark_task_stack_pool 30000 100 64
### Round: 1 ###
create 30000 task, cost time: 1 s, clock time: 150 ms, avg: 5000 ns
switch 30000 tasks 3000000 times, cost time: 1 s, clock time: 1407 ms, avg: 469 ns
remove 30000 tasks, cost time: 0 s, clock time: 13 ms, avg: 433 ns
### Round: 2 ###
create 30000 task, cost time: 0 s, clock time: 14 ms, avg: 466 ns
switch 30000 tasks 3000000 times, cost time: 1 s, clock time: 1404 ms, avg: 468 ns
remove 30000 tasks, cost time: 0 s, clock time: 13 ms, avg: 433 ns
### Round: 3 ###
create 30000 task, cost time: 0 s, clock time: 14 ms, avg: 466 ns
switch 30000 tasks 3000000 times, cost time: 2 s, clock time: 1420 ms, avg: 473 ns
remove 30000 tasks, cost time: 0 s, clock time: 14 ms, avg: 466 ns
### Round: 4 ###
create 30000 task, cost time: 0 s, clock time: 16 ms, avg: 533 ns
switch 30000 tasks 3000000 times, cost time: 1 s, clock time: 1419 ms, avg: 473 ns
remove 30000 tasks, cost time: 0 s, clock time: 13 ms, avg: 433 ns
### Round: 5 ###
create 30000 task, cost time: 0 s, clock time: 14 ms, avg: 466 ns
switch 30000 tasks 3000000 times, cost time: 2 s, clock time: 1389 ms, avg: 463 ns
remove 30000 tasks, cost time: 0 s, clock time: 13 ms, avg: 433 ns
这一批优化过后,其实切换性能并没有提高,反而下降了,我查了一下是原先对read的L1 cache命中率也比较高,但是现在下降了。不过我认为这种miss的情况可能更能说明问题,因为现在是几乎0逻辑,一旦有逻辑之后也容易增加cache miss。究其原因,应该是新的组织方式把一个协程(特别是task)的维护数据放在了栈上,那么切换的时候切换地址就会比较遥远。而之前的方式,只要多个协程对象靠得近,而协程对象又足够小(task对象里的数据比较大,但是coroutine里的比较小),因为只有压测代码,所以new的时候malloc一般会把类似大小的对象放一起,这也增加了命中率。但是实际应用的话,必然中间会插入很多其他对象的,所以这个情况也会不贴近实际场景。但是另一方面,v2的优化也使得在使用栈池的情况下,分配开销被大幅降低了,我们的项目里由于开始需要大量的定时器,定时器回调的时候也需要在协程内进行,这也给大量定时器的触发的时候提供了更好的支持和保障。
不过我认为现在的v2任然有可以再优化的地方,本来想在优化完一波再放出这篇总结,但是一方面还没找到什么特别好的不破坏设计的点。另一方面最近看了下SSR-libev的源码,觉得他那种方式的适配的扩展性更好,所以后面可能回去优化服务器框架[asf4g-co][3]的网关层部分,可能要有比较大的改动,改变密钥协商流程,让atgateway支持协商加密方式、更多的加密算法和支持椭圆双曲线协商。所以libcopp的优化暂时搁置吧,不过以后会主要以v2版本的api为准,基本不会再变更接口了吧。