如何用eBPF写TCP拥塞控制算法？-电子发烧友网

其实不想用这个题目的，只因为TCP相关的东西比较吸引人的眼球，这篇文章的主题还是eBPF，而不是TCP。

用eBPF写TCP拥塞控制算法只是本文所讲内容的一个再平凡不过的例子。

先看两个问题，或者说是两个痛点：

内核越来越策略化。

内核接口不稳定。

分别简单说一下。

所谓内核策略化就是说越来越多的灵巧的算法，小tricks等灵活多变的代码进入内核，举例来讲，包括但不限于以下这些：

TCP拥塞控制算法。

TC排队规则，数据包调度算法。

各种查找的哈希算法。

…

这部分策略化的代码几乎都是用“回调函数”实现的，这在另一方面烘托了Linux内核也是模块化设计的，且机制和策略分离，当需要一种新的算法时，只需要register一组新的回调函数即可。

然而，…

然而不够完美，因为上述第2点，“内核接口不稳定”！即每一个内核版本的数据结构以及API都是不兼容的。

这意味着什么？

这意味着，即便是高度封装好的算法模块代码，也需要为不同版本的Linux内核维护一套代码，当涉及内核模块由于版本问题不得不升级时，数据结构和api的适配工作往往是耗时且出力不讨好的。

但其实，很多算法根本就是与内核数据结构，内核api这些无关的。

两个内核版本，数据结构只是字段变化了位置，新增了字段，更新了字段名字，即便如此，不得不对算法模块进行重新编译…

如果能在模块载入内核的时候，对函数和数据结构字段进行重定位就好了！

我们的目标是，一次编写，多次运行。

又是Facebook走在了前面，来自Facebook的BPF CO-RE(Compile Once – Run Everywhere)：
http://vger.kernel.org/bpfconf2019_talks/bpf-core.pdf
没错，eBPF，就是它！

我们看下其描述：

BPF CO-RE talk discussed issues that developers currently run into when developing, testing, deploying, and running BPF applications at scale, taking Facebook’s experience as an example. Today, most types of BPF programs access internal kernel structures, which necessitates the need to compile BPF program’s C code “on the fly” on every single production machine due to changing struct/union layouts and definitions inside kernel. This causes many problems and inconveniences, starting from the need to have kernel sources available everywhere and in sync with running kernel, which is a hassle to set up and maintain. Reliance on embedded LLVM/Clang for compilation means big application binary size, increased memory usage, and some rare, but impactful production issues due to increased resource usage due to compilation. With current approach testing BPF programs against multitude of production kernels is a stressful, time-consuming, and error-prone process. The goal of BPF CO-RE is to solve all of those issues and move BPF app development flow closer to typical experience, one would expect when developing applications: compile BPF code once and distribute it as a binary. Having a good way to validate that BPF application will run without issues on all active kernels is also a must.

The complexity hides in the need to adjust compiled BPF assembly code to every specific kernel in production, as memory layout of kernel data structures changes between kernel versions and even different kernel build configurations. BPF CO-RE solution relies on self-describing kernel providing BTF type information and layout (ability to produce it was recently committed upstream). With the help from Clang compiler emitting special relocations during BPF compilation and with libbpf as a dynamic loader, it’s possible to reconciliate correct field offsets just before loading BPF program into kernel. As BPF programs are often required to work without modification (i.e., re-compilation) on multiple kernel versions/configurations with incompatible internal changes, there is a way to specify conditional BPF logic based on actual kernel version and configuration, also using relocations emitted from Clang. Not having to rely on kernel headers significantly improves the testing story and makes it possible to have a good tooling support to do pre-validation before deploying to production.

There are still issues which will have to be worked around for now. There is currently no good way to extract #define macro from kernel, so this has to be dealt with by copy/pasting the necessary definitions manually. Code directly relying on size of structs/unions has to be avoided as well, as it isn’t relocatable in general case. While there are some raw ideas how to solve issues like that in the future, BPF CO-RE developers prioritize providing basic mechanisms to allow “Compile Once - Run Everywhere” approach and significantly improve testing and pre-validation experience through better tooling, enabled by BPF CO-RE. As existing applications are adapted to BPF CO-RE, there will be new learning and better understanding of additional facilities that need to be provided to provide best developer experience possible.

该机制可以：

用eBPF的一组字节码实现内核模块的一组回调函数。

对使用到的内核数据结构字段进行重定位，适配当前内核的对应偏移。

后果就是：

很多内核算法模块可以用eBPF来编写了。

Linux 5.6用TCP拥塞控制算法举了一例，我们看一下：
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=09903869f69f

可以看到，这个eBPF程序是与内核版本无关的，你可以看到它的tcp_sock结构体的定义：

struct tcp_sock { struct inet_connection_sock inet_conn; __u32 rcv_nxt; __u32 snd_nxt; __u32 snd_una; __u8 ecn_flags; __u32 delivered; __u32 delivered_ce; __u32 snd_cwnd; __u32 snd_cwnd_cnt; __u32 snd_cwnd_clamp; __u32 snd_ssthresh; __u8 syn_data:1, /* SYN includes data */ syn_fastopen:1, /* SYN includes Fast Open option */ syn_fastopen_exp:1,/* SYN includes Fast Open exp. option */ syn_fastopen_ch:1, /* Active TFO re-enabling probe */ syn_data_acked:1,/* data in SYN is acked by SYN-ACK */ save_syn:1, /* Save headers of SYN packet */ is_cwnd_limited:1,/* forward progress limited by snd_cwnd? */ syn_smc:1; /* SYN includes SMC */ __u32 max_packets_out; __u32 lsndtime; __u32 prior_cwnd;} __attribute__((preserve_access_index));

这里注意到两点：

该结构体并非内核头文件里的对应结构体，它只包含了内核对应结构体里TCP CC算法用到的字段，它是内核对应同名结构体的子集。

preserve_access_index属性表示eBPF字节码在载入的时候，会对这个结构体里的字段进行重定向，满足当前内核版本的同名结构体字段的偏移。

我们在看下eBPF实现的TCP CC回调函数是个什么样子：

BPF_TCP_OPS_3(tcp_reno_cong_avoid, void, struct sock *, sk, __u32, ack, __u32, acked){ struct tcp_sock *tp = tcp_sk(sk); if (!tcp_is_cwnd_limited(sk)) return; /* In "safe" area, increase. */ if (tcp_in_slow_start(tp)) { acked = tcp_slow_start(tp, acked); if (!acked) return; } /* In dangerous area, increase slowly. */ tcp_cong_avoid_ai(tp, tp->snd_cwnd, acked);}... SEC(".struct_ops")struct tcp_congestion_ops dctcp = { .init = (void *)dctcp_init, .in_ack_event = (void *)dctcp_update_alpha, .cwnd_event = (void *)dctcp_cwnd_event, .ssthresh = (void *)dctcp_ssthresh, .cong_avoid = (void *)tcp_reno_cong_avoid, .undo_cwnd = (void *)dctcp_cwnd_undo, .set_state = (void *)dctcp_state, .flags = TCP_CONG_NEEDS_ECN, .name = "bpf_dctcp",};

没啥特殊的，几乎和内核模块的写法一样，唯一不同的是：

它和内核版本无关了。你用llvm/clang编译出来.o字节码将可以被载入到所有的内核。

它让人感觉这是在用户态编程。

是的，这就是在用户态写的TCP CC算法，eBPF字节码的对应verifier会对你的代码进行校验，它不允许可以crash内核的eBPF代码载入，你的危险代码几乎无法通过verify。

如果你想搞明白这一切背后是怎么做到的，看两个文件就够了：

net/ipv4/bpf_tcp_ca.c

kernel/bpf/bpf_struct_ops.c

当然，经理不会知道这意味着什么。

浙江温州皮鞋湿，下雨进水不会胖。

原文标题：用eBPF写TCP拥塞控制算法

文章出处：【微信公众号：Linuxer】欢迎添加关注！文章转载请注明出处。

责任编辑：haq

声明：本文内容及配图由入驻作者撰写或者入驻合作网站授权转载。文章观点仅代表作者本人，不代表电子发烧友网立场。文章及其配图仅供工程师学习之用，如有内容侵权或者其他违规问题，请联系本站处理。举报投诉

内核

内核

+关注

关注
3

文章
1363

浏览量
40228
TCP

TCP

+关注

关注
8

文章
1349

浏览量
78985

原文标题：用eBPF写TCP拥塞控制算法

文章出处：【微信号：LinuxDev，微信公众号：Linux阅码场】欢迎添加关注！文章转载请注明出处。

如何用Jacinto内部的GPtimer输出PWM信号控制屏幕背光

电子发烧友网站提供《如何用Jacinto内部的GPtimer输出PWM信号控制屏幕背光.pdf》资料免费下载

发表于 09-29 10:25 •0次下载

如<b class='flag-5'>何用</b>Jacinto内部的GPtimer输出PWM信号<b class='flag-5'>控制</b>屏幕背光

神经网络如何用无监督算法训练

标记数据的处理尤为有效，能够充分利用互联网上的海量数据资源。以下将详细探讨神经网络如何用无监督算法进行训练，包括常见的无监督学习算法、训练过程、应用及挑战。

发表于 07-09 18:06 •698次阅读

论TCP协议中的拥塞控制机制与网络稳定性

TCP协议中的拥塞控制机制与网络稳定性的深度探讨随着互联网的快速发展，网络流量呈现爆炸式增长，网络拥塞问题逐渐凸显。为了维护网络的稳定运行，TCP

发表于 04-19 16:42 •373次阅读

eBPF动手实践系列三：基于原生libbpf库的eBPF编程改进方案简析

在上一篇文章《eBPF动手实践系列二：构建基于纯C语言的eBPF项目》中，我们初步实现了脱离内核源码进行纯C语言eBPF项目的构建。libbpf库在早期和内核源码结合的比较紧密，如今的libbpf库更加成熟，已经完全脱离内核源码

发表于 03-19 14:19 •765次阅读

基于原生libbpf库的eBPF编程改进方案

为了简化 eBPF程序的开发流程，降低开发者在使用 libbpf 库时的入门难度，libbpf-bootstrap 框架应运而生。基于libbpf-bootstrap框架的编程方案是目前网络上看到的最主流编程方案。

发表于 03-19 14:19 •579次阅读

以太网存储网络的拥塞管理连载案例（五）

解决无损以太网网络拥塞问题的方法与光纤通道结构相同。两者都使用逐跳流量控制机制，只是实现方式不同而已。

发表于 03-04 11:17 •784次阅读

TCP协议技术之拥塞控制算法

拥塞控制是在网络层和传输层进行的功能。在网络层，拥塞控制可以通过路由算法来控制数据包在网络中的传

发表于 02-03 17:06 •1970次阅读

TCP协议技术之自适应重传

自适应重传是TCP协议中的一种拥塞控制机制，旨在通过智能的方式处理网络拥塞，并进行相应的数据重传，以提高网络的可靠性和性能。

发表于 02-03 17:03 •1378次阅读

一文详解DCQCN拥塞控制算法

DCQCN 是一种基于速率的端到端拥塞协议，它建立在 QCN 和 DCTCP 之上。DCQCN 的大部分功能是现在网卡上（而不是交换机上，或者操作系统上）。

发表于 01-23 10:48 •5826次阅读

请问TCP拥塞控制对数据延迟有何影响？

今天分享一篇文章，是关于 TCP 拥塞控制对数据延迟产生的影响的。作者在服务延迟变高之后进行抓包分析，结果发现时间花在了 TCP 本身的机制上面：客户端并不是将请求一股脑发送给服务端，

发表于 01-19 09:44 •548次阅读

SIMATIC S7-1500 Modbus TCP通讯

很多工业现场的 Modbus TCP 通信应用中，常常需要一个 P LC控制器通过Modbus TCP 作为客户端访问多个服务器，且客户端和服务器之间都有读和写操作的应用需求。

发表于 01-10 09:26 •1922次阅读

如何选择传输层协议？TCP和UDP的优缺点和适用场合

和可靠性至关重要。本文将详细介绍TCP和UDP的优缺点以及适用场合。 1. TCP的优点和适用场合： TCP是一种可靠的、面向连接的传输层协议，它提供了重发机制、数据丢失检测和拥塞

发表于 12-11 11:42 •952次阅读

什么是网络拥塞及解决办法简介

网络拥塞是指在计算机网络中由于网络资源(如带宽、内存等)的有限性，当网络负载超过其容量时，数据包可能会在网络中延迟或丢失，从而导致网络性能下降的现象。网络拥塞通常会导致网络延迟增加、数据包丢失率上升，甚至可能导致网络连接中断。

发表于 11-28 13:45 •2114次阅读

请问SigmaStudio可以导入自己写的算法吗？

请问SigmaStudio可以导入自己写的算法吗？谢谢

发表于 11-28 07:38

TCP传输控制协议知识科普拓展

传输控制协议（TCP，Transmission Control Protocol）是一种面向连接的、可靠的、基于字节流的传输层通信协议，由IETF的RFC 793定义。

发表于 11-27 17:46 •917次阅读