NetTLP: A nonexistent NIC


On the NetTLP platform, you can develop your PCIe devices in software interacting with hardware root complexes. For an example of such a software device, we introduce an implementation of simple NIC. The simple NIC, originally introduced by pcie-bench, is a theoretical model of a simplistic Ethernet NIC, which does not have any performance optimizations and complicated DMA interfaces, for understanding PCIe interactions and calculating bandwidth. NetTLP enables implementing such models of PCIe devices in software and confirming whether the models actually work with existent hardware root complex. In other words, the PCIe devices are certainly implemented in software, but they perform as actual devices for physical root complexes. Moreover, their PCIe transactions can be observed by tcpdump.

The PCIe interactions of the simple NIC model are described in the model's implementation, actually, here: Our simple NIC implementation in NetTLP follows the steps of the PCIe interactions described in the source code. Anyway, this page focuses on how to set up and run the simple NIC on the NetTLP platform.


To run the simple NIC, please set up a NetTLP environment following the instruction and confirm the example applications work.

The simple NIC implementation is composed of two components among both adapter and device hosts: (1) a device driver for the NetTLP adapter and (2) the software device implementation for simple NIC using LibTLP. The driver treats the NetTLP adapter as a physical Ethernet NIC and creates a netdevice interface (like usual Ethernet interfaces in Linux as ifconfig command shows). Packets transmitted to the interface are delivered to the device host as TLPs over the PCIe links between the root complex and the NetTLP adapter and the Ethernet link between the adapter host and device host. The simple NIC implementation receives the TLPs using LibTLP and sends the packets to a tap interface.

Figure 1: Overview of an environment to run the simple NIC.

Figure 1 shows an example setup of the simple NIC. Descriptions of network interfaces in the figure are:

The implementation of simple NIC, which is illustrated as simple-nic in Figure 1, is responsible for the substance of the NIC represented as eth2 on the device host. simple-nic receives and sends TLPs from and to the NetTLP adapter through LibTLP. The interactions between simple-nic and the NetTLP adapter fully follow the simple NIC model, and they can be observed by tcpdump at the 10G Ethernet link.

Adapter host

First, install the simple NIC driver at the adapter host.

$ ls
$ git clone
Cloning into 'simple-nic'...
remote: Enumerating objects: 112, done.
remote: Counting objects: 100% (112/112), done.
remote: Compressing objects: 100% (75/75), done.
remote: Total 112 (delta 43), reused 98 (delta 29), pack-reused 0
Receiving objects: 100% (112/112), 31.81 KiB | 552.00 KiB/s, done.
Resolving deltas: 100% (43/43), done.

$ ls
libtlp  simple-nic
$ cd simple-nic/
$ ls
device  driver  include
$ cd driver/
$ ls
Makefile  nettlp_msg.c	nettlp_msg.h  nettlp_snic.c
$ make
make -C /lib/modules/4.20.2-tsukumo1-nopti/build M=/home/upa/work/test/nettlp/simple-nic/driver V=0 modules
make[1]: Entering directory '/home/upa/src/linux-4.20.2'

  WARNING: Symbol version dump ./Module.symvers
           is missing; modules will have no dependencies and modversions.

  CC [M]  /home/upa/work/test/nettlp/simple-nic/driver/nettlp_snic.o
  CC [M]  /home/upa/work/test/nettlp/simple-nic/driver/nettlp_msg.o
  LD [M]  /home/upa/work/test/nettlp/simple-nic/driver/nettlp_snic_driver.o
  Building modules, stage 2.
  MODPOST 1 modules
  CC      /home/upa/work/test/nettlp/simple-nic/driver/nettlp_snic_driver.mod.o
  LD [M]  /home/upa/work/test/nettlp/simple-nic/driver/nettlp_snic_driver.ko
make[1]: Leaving directory '/home/upa/src/linux-4.20.2'

$ ls
Makefile        nettlp_msg.c  nettlp_snic.c          nettlp_snic_driver.mod.c
Module.symvers  nettlp_msg.h  nettlp_snic.o          nettlp_snic_driver.mod.o
modules.order   nettlp_msg.o  nettlp_snic_driver.ko  nettlp_snic_driver.o

$ sudo insmod nettlp_snic_driver.ko

Then, you can see dmesg output and a netdev interface representing the simple NIC.

[1382475.456615] nettlp_snic_driver: nettlp_snic (v0.0.1) is loaded
[1382475.456710] nettlp_snic_driver: nettlp_snic_pci_init: register nettlp simple nic device 0000:1b:00.0
[1382475.456722] nettlp_snic_driver 0000:1b:00.0: enabling device (0100 -> 0102)
[1382475.456861] nettlp_snic_driver: BAR4 0xb0000000 is mapped to 00000000fdb0aee4
[1382475.456871] nettlp_snic_driver: BAR0 0xc0100000 is mapped to 0000000003c10e6c
[1382475.456904] nettlp_snic_driver: BAR2 0xa0000000 is mamped to 000000007bf72a9a
[1382475.456923] nettlp_snic_driver: nettlp_snic_pci_init: rx_buf is 0x2f003000
[1382475.456934] nettlp_snic_driver: nettlp_snic_init
[1382475.457303] initialize nettlp message socket
[1382475.457357] nettlp_snic_driver: nettlp_snic_pci_init: probe finished.
[1382475.457359] nettlp_snic_driver: nettlp_snic_pci_init: tx desc is 0x2f001000, rx desc is 0x2f002000
[1382475.466553] nettlp_snic_driver 0000:1b:00.0 enp27s0: renamed from eth0
$ ifconfig enp27s0
enp27s0: flags=4098<BROADCAST,MULTICAST>  mtu 1500
        ether 00:11:22:33:44:55  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

$ ethtool -i enp27s0
driver: nettlp_snic_driver
bus-info: 0000:1b:00.0
supports-statistics: no
supports-test: no
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: no

To clarify this instruction, we change the name enp27s0 to eth2 to align with Figure 1.

$ sudo ip link set dev enp27s0 name eth2
$ ifconfig eth2
eth2: flags=4098<BROADCAST,MULTICAST>  mtu 1500
        ether 00:11:22:33:44:55  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

Warning: DO NOT set eth2 up at this moment. simple-nic process, the substance of the simple NIC, must be running before eth2 up. If doing that, the kernel tries to send packets, e.g., IPv6 router solicitation, and then the driver sends a MWr TLP to inform the device about a TX descriptor. However, there is no response from the device, because the device, which is the software simple-nic process on the device host, has not been running yet.

Device host

Before making eth2 up, start the simple-nic process at the device host.

$ ls
$ git clone
Cloning into 'simple-nic'...
remote: Enumerating objects: 112, done.
remote: Counting objects: 100% (112/112), done.
remote: Compressing objects: 100% (75/75), done.
remote: Total 112 (delta 43), reused 98 (delta 29), pack-reused 0
Receiving objects: 100% (112/112), 31.81 KiB | 552.00 KiB/s, done.
Resolving deltas: 100% (43/43), done.

$ ls
libtlp  simple-nic
$ cd simple-nic/
$ ls
device  driver  include
$ cd device/
$ ls
Makefile  nettlp_snic_device.c
$ make
gcc -g -Wall -I../../libtlp/include -I../include  -L../../libtlp/lib  nettlp_snic_device.c  -ltlp -lpthread -o nettlp_snic_device
$ ls
Makefile  nettlp_snic_device  nettlp_snic_device.c

nettlp_snic_device is the simple-nic process.

Note: In the current implementation, libtlp directory and simple-nic directory must be located on the same directory. I will develop an install procedure soon.

Then, start the simple-nic process at the device host.

$ ./nettlp_snic_device -h
./nettlp_snic_device: invalid option -- 'h'
    -r remote addr
    -l local addr
    -R remote host addr (not TLP NIC)

    -t tunif name (default tap0)
$ sudo ./nettlp_snic_device -r -l -R
Device is 1b00
BAR4 start address is 0xb0000000
TX IRQ address is 0xfee1a000, data is 0x00004022
RX IRQ address is 0xfee03000, data is 0x00004022
create tap read thread
start nettlp callback

-r and -l options are identical to the example applications. is the address of the NetTLP adapters, and is the local address connected to the adapter, which is eth1 in Figure 1. -R option specifies the IP address of the adapter host on the management network. When nettlp_snic_device starts, it gathers necessary information from the adapter host, for example, the address of BAR4 and IRQ information of the adapter. For this purpose, nettlp_snic_device uses a UDP-based simple messaging API implemented as a side-channel in the driver.

If nettlp_snic_device successfully gets the information through the management network, it starts to perform the simple NIC: waiting for TLPs following the PCIe interactions of the simple NIC model.

nettlp_snic_device creates tap0, and it is automatically set up. This is the actual port of the simple NIC. To ping on this closed environment, set the tap0 as a bridge port of a linux bridge, and assign an IP address to the bridge.

Please keep nettlp_snic_device running. Use other terminals on the device host.

$ ifconfig tap0
tap0: flags=67<UP,BROADCAST,RUNNING>  mtu 1500
        inet6 fe80::fc9d:a9ff:fe99:15cd  prefixlen 64  scopeid 0x20
        ether fe:9d:a9:99:15:cd  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 13  bytes 1006 (1.0 KB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

$ sudo ip link add br0 type bridge
$ sudo ip link set dev br0 up
$ sudo ip addr add dev br0
$ sudo ip link set dev tap0 master br0
$ ifconfig br0
br0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet  netmask  broadcast
        inet6 fe80::4c0d:f8ff:fe25:4b40  prefixlen 64  scopeid 0x20
        ether fe:9d:a9:99:15:cd  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 11  bytes 906 (906.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0


Now we can set the simple NIC up at the adapter host.

$ sudo ip addr add dev eth2
$ sudo ip link set dev eth2 up
$ ifconfig eth2
eth2: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet  netmask  broadcast
        inet6 fe80::211:22ff:fe33:4455  prefixlen 64  scopeid 0x20
        ether 00:11:22:33:44:55  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 8  bytes 656 (656.0 B)
        TX errors 0  dropped 1 overruns 0  carrier 0  collisions 0

Then, the kernel tries to send some packets, so that you see some output messages from the nettlp_snic_device.

BAR4 start address is 0xb0000000
TX IRQ address is 0xfee1a000, data is 0x00004022
RX IRQ address is 0xfee03000, data is 0x00004022
create tap read thread
start nettlp callback
nettlp_snic_mwr: dma_addr is 0xb0000008
RX desc base is 0x2f005000
nettlp_snic_mwr: dma_addr is 0xb0000014
RX desc update: dma to 0xb0000014, rx desc ptr is 0x2f005000
RX desc update: new rx_desc addr=0x2f006000 len=0
nettlp_snic_mwr: dma_addr is 0xb0000018
nettlp_snic_mwr: dma_addr is 0xb0000000
TX desc base is 0x2f004000
nettlp_snic_mwr: dma_addr is 0xb0000010
idx 0, dma to 0xb0000010, tx desc ptr is 0x2f004000
TX: pkt length is 90, addr is 0x3bb15000
TX: generate interrupt to 0xfee1a000
TX done
nettlp_snic_mwr: dma_addr is 0xb0000010
idx 0, dma to 0xb0000010, tx desc ptr is 0x2f004000
TX: pkt length is 86, addr is 0x3bb15800
TX: generate interrupt to 0xfee1a000
TX done
nettlp_snic_mwr: dma_addr is 0xb0000010
idx 0, dma to 0xb0000010, tx desc ptr is 0x2f004000
TX: pkt length is 90, addr is 0x3bb16000
TX: generate interrupt to 0xfee1a000
TX done

And, you can ping via the simple NIC at the adapter host.

$ ping
PING ( 56(84) bytes of data.
64 bytes from icmp_seq=1 ttl=64 time=0.297 ms
64 bytes from icmp_seq=2 ttl=64 time=0.246 ms
64 bytes from icmp_seq=3 ttl=64 time=0.239 ms
--- ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2049ms
rtt min/avg/max/mdev = 0.239/0.260/0.297/0.031 ms

This ping sends an ICMP echo packet to eth2, and then the driver notifies nettlp_snic_device about the TX descriptor for the packet (MWr TLP). Then nettlp_snic_device starts DMA Read to read the descriptor and reads the packet content specified by the descriptor. After nettlp_snic_device writes the packet to the tap interface, it sends a MWr TLP as the TX interrupt (MSI-X).

The corresponding ICMP reply packet is generated by the kernel on the device host when the echo is received by br0. nettlp_snic_device reads the reply packet from the tap interface and starts to write the packet to the packet buffer on the adapter host by DMA Write. After the DMA Write, nettlp_snic_device sends a MWr TLP as the RX interrupt (MSI-X).

Observing the PCIe transactions

The PCIe transactions between the driver and the device can be observed on the Ethernet link. This is one of the powerful functionalities of the NetTLP platform. Let us see the TLPs of the simple NIC.

An easy way to see the TLPs is to use the modified tcpdump that can parse TLPs encapsulated in Ethernet, IP, UDP, and NetTLP headers. Capture the eth1 on the device host using the tcpdump.

# At device host
$ sudo tcpdump -ni eth1
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth1, link-type EN10MB (Ethernet), snapshot length 262144 bytes
01:18:00.269163 IP > NetTLP: MWr, 3DW, WD, tc 0, flags [none], attrs [none], len 1, requester 00:00, tag 0x01, last 0x0, first 0xf, Addr 0xb0000010
01:18:00.269202 IP > NetTLP: MRd, 3DW, tc 0, flags [none], attrs [none], len 4, requester 1b:00, tag 0x01, last 0xf, first 0xf, Addr 0x2f004000
01:18:00.269215 IP > NetTLP: CplD, 3DW, WD, tc 0, flags [none], attrs [none], len 4, completer 00:00, success, byte count 16, requester 1b:00, tag 0x01, lowaddr 0x00
01:18:00.269234 IP > NetTLP: MRd, 3DW, tc 0, flags [none], attrs [none], len 25, requester 1b:00, tag 0x01, last 0x3, first 0xf, Addr 0x3bdc1000
01:18:00.269247 IP > NetTLP: CplD, 3DW, WD, tc 0, flags [none], attrs [none], len 25, completer 00:00, success, byte count 98, requester 1b:00, tag 0x01, lowaddr 0x00
01:18:00.269277 IP > NetTLP: MWr, 3DW, WD, tc 0, flags [none], attrs [none], len 1, requester 1b:00, tag 0x01, last 0x0, first 0xf, Addr 0xfee1a000
01:18:00.269300 IP > NetTLP: MWr, 3DW, WD, tc 0, flags [none], attrs [none], len 25, requester 1b:00, tag 0x00, last 0x3, first 0xf, Addr 0x2f006000
01:18:00.269326 IP > NetTLP: MWr, 3DW, WD, tc 0, flags [none], attrs [none], len 4, requester 1b:00, tag 0x00, last 0xf, first 0xf, Addr 0x2f005000
01:18:00.269337 IP > NetTLP: MWr, 3DW, WD, tc 0, flags [none], attrs [none], len 1, requester 1b:00, tag 0x00, last 0x0, first 0xf, Addr 0xfee03000
01:18:00.272141 IP > NetTLP: MWr, 3DW, WD, tc 0, flags [none], attrs [none], len 1, requester 00:00, tag 0x02, last 0x0, first 0xf, Addr 0xb0000014
01:18:00.272173 IP > NetTLP: MRd, 3DW, tc 0, flags [none], attrs [none], len 4, requester 1b:00, tag 0x0f, last 0xf, first 0xf, Addr 0x2f005000
01:18:00.272191 IP > NetTLP: CplD, 3DW, WD, tc 0, flags [none], attrs [none], len 4, completer 00:00, success, byte count 16, requester 1b:00, tag 0x0f, lowaddr 0x00
12 packets captured
12 packets received by filter
0 packets dropped by kernel

It is time to dive into PCIe. The PCIe interaction of the simple NIC model is shown below:

On the TX side:

  1. Host updates a 4-byte TX queue tail pointer (PICe write)
  2. Device reads a 16-byte TX queue descriptor (PCIe read)
  3. Device reads the packet specified by the descriptor (PCIe read)
  4. Device sends the packet to wire
  5. Device generates 4-byte interrupt (PCIe write)

On the RX side:

  1. Host updates a 4-byte RX queue tail pointer (PCIe write)
  2. Device reads a 16-byte RX queue tail pointer (PCIe read)
  3. Device writes a received packet to the buffer specified by the descriptor (PCIe write)
  4. Device writes back 16-byte RX descriptor to host (PCIe write)
  5. Device generates 4-byte interrupt (PCIe write)
  6. Host reads 4-byte RX queue tail pointer (PCIe read)

All the PCIe transactions can be observed on the tcpdump. Let us see them step-by-step.

TX side

01:18:00.269163 IP > NetTLP: MWr, 3DW, WD, tc 0, flags [none], attrs [none], len 1, requester 00:00, tag 0x01, last 0x0, first 0xf, Addr 0xb0000010

This is the first TLP for sending the ICMP echo packet. It is MWr TLP to address 0xb0000010, which is the register for the TX queue tail pointer, from the root complex (requester 00:00) to BAR4 (0xb0000000) on the adapter. This is the first transaction on the TX side: 1. Host updates a 4-byte TX queue tail pointer (PICe write).

01:18:00.269202 IP > NetTLP: MRd, 3DW, tc 0, flags [none], attrs [none], len 4, requester 1b:00, tag 0x01, last 0xf, first 0xf, Addr 0x2f004000

Next, the simple-nic starts to read a 16-byte TX descriptor pointed by the DMA write of the last MWr TLP. This TLP is it. The simple-nic (requester 1b:00 is PCIe bus number where the NetTLP adapter is installed) send MRd TLP for 16-byte (len 4 in DWORD) from 0x2f004000, which is the address of the TX descriptor on the main memory.

01:18:00.269215 IP > NetTLP: CplD, 3DW, WD, tc 0, flags [none], attrs [none], len 4, completer 00:00, success, byte count 16, requester 1b:00, tag 0x01, lowaddr 0x00

This TLP is the response for the MRd TLP for reading the TX descriptor.

01:18:00.269234 IP > NetTLP: MRd, 3DW, tc 0, flags [none], attrs [none], len 25, requester 1b:00, tag 0x01, last 0x3, first 0xf, Addr 0x3bdc1000

Next, as step 4 on the TX side, the simple-nic reads the actual packet to be sent from the main memory on the adapter host. 0x3bdc1000 is the address of the packet (skb->data, or skb_mac_header(skb)). len 25 means 100 bytes, but the Last DW Byte Enable filed is 0x3, thus, actual read length is 98 bytes, which is the length of the ICMP echo packet.

01:18:00.269247 IP > NetTLP: CplD, 3DW, WD, tc 0, flags [none], attrs [none], len 25, completer 00:00, success, byte count 98, requester 1b:00, tag 0x01, lowaddr 0x00

Then the root complex sends a response for the DMA read for the packet. After receiving this CplD TLP, the simple-nic sends the packet to a tap interface.

01:18:00.269277 IP > NetTLP: MWr, 3DW, WD, tc 0, flags [none], attrs [none], len 1, requester 1b:00, tag 0x01, last 0x0, first 0xf, Addr 0xfee1a000

After transmitting the packet, the simple-nic generates a TX interrupt to inform the driver of the completion of the transmission. In MSI-X, interrupt is implemented by just a MWr TLP to a specified address on main memory with a specified data. In this case, the address is 0xfee1a000.

RX side

01:18:00.269300 IP > NetTLP: MWr, 3DW, WD, tc 0, flags [none], attrs [none], len 25, requester 1b:00, tag 0x00, last 0x3, first 0xf, Addr 0x2f006000

On the RX side, the interaction starts with writing the received packet to the host memory (step 3) because the driver told the RX buffer to the simple-nic before receiving new packets. So, this first TLP means that the simple-nic writes the ICMP reply packet to the host memory pointed by an RX descriptor notified in advance.

01:18:00.269326 IP > NetTLP: MWr, 3DW, WD, tc 0, flags [none], attrs [none], len 4, requester 1b:00, tag 0x00, last 0xf, first 0xf, Addr 0x2f005000

Next, the simple-nic updates the RX descriptor to notify the driver of information about the received packet, i.e., packet length.

01:18:00.269337 IP > NetTLP: MWr, 3DW, WD, tc 0, flags [none], attrs [none], len 1, requester 1b:00, tag 0x00, last 0x0, first 0xf, Addr 0xfee03000

And then the simple-nic generates an RX interrupt by MSI-X as with the TX side.

01:18:00.272141 IP > NetTLP: MWr, 3DW, WD, tc 0, flags [none], attrs [none], len 1, requester 00:00, tag 0x02, last 0x0, first 0xf, Addr 0xb0000014

After the host consumed the received packet, the driver returns the RX buffer to the device. The driver writes the freed RX queue tail pointer to the specified address on the BAR4 (0xb0000014) on the NetTLP adapter, and then simple-nic receives this MWr TLP. This is actually the first on the RX side of the simple NIC model.

01:18:00.272173 IP > NetTLP: MRd, 3DW, tc 0, flags [none], attrs [none], len 4, requester 1b:00, tag 0x0f, last 0xf, first 0xf, Addr 0x2f005000

Then the simple-nic reads the new RX descriptor notified by updating the RX queue tail pointer. 0x2f005000 is the address of the 16-byte (len 4 in DWORD) RX descriptor on the main memory.

01:18:00.272191 IP > NetTLP: CplD, 3DW, WD, tc 0, flags [none], attrs [none], len 4, completer 00:00, success, byte count 16, requester 1b:00, tag 0x0f, lowaddr 0x00

And this TLP is the response for the DMA read for the RX descriptor.

In this manner, you can observe, understand, and debug actual PCIe transactions by IP networking tools.