NetTLP: Micro bemchmark


This section describes how to conduct a micro benchmark on a NetTLP platform. In the NetTLP platform, there are two directions of PCIe transactions: (1) from LibTLP to the NetTLP adapter, and (2) from the NetTLP adapter to LibTLP. The former that this page describes indicates that an application performing a software PCIe device issues DMAs to the memory on the adapter host.


To generate PCIe transactions from software, we developed a LibTLP-based benchmark application called tlpperf. Users can send memory read and write requests to the memory on the adapter host through the NetTLP adapter from the device host by using tlpperf. tlpperf is also contained in the apps directory of the LibTLP repository.

$ ./tlpperf -h
./tlpperf: invalid option -- 'h'
tlpperf usage

  basic parameters
    -r X.X.X.X  remote addr at NetTLP link
    -l X.X.X.X  local addr at NetTLP link
    -b XX:XX    bus number of requester

  DMA parameters
    -d read|write  DMA direction
    -a 0xADDR      DMA target region address (physical)
    -s u_int       DMA target region size
    -L u_int       DMA length (spilited into MPS and MRRS)

  benchmark style parameters
    -N u_int                  number of thread
    -R same|diff              how to split DMA region for threads
    -P fix|seq|seq512|random  access pattern on each reagion
    -M                        measuring latency mode

    -c int   count of interations on each thread
    -i msec  interval for each iteration
    -t sec   duration
    -D       debug mode

  for target host
    -S size  size to allocate hugepage as tlpperf target

First, a target memory region for benchmarking is needed at the adapter host. tlpperf provides an option for this purpose. -S option allocates a specified sized memory region and enters while (1) sleep(1);. This target mode allocates the region from hugepage, so that please setup hugepage in advance.

# At adapter host
# setup hugepage
$ cat
echo 2048 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
mkdir -p /mnt/hugepages
mount -t hugetlbfs nodev /mnt/hugepages
$ sudo ./

# start tlpperf in target mode: allocate a 2MB region
$ sudo ./tlpperf -S $(( 1024 * 1024 * 2))
2101248-byte allocated, physical address is 0x74ee00000

Next, run tlpperf at the device host. An example shown below issues 512-byte DMA read to the memory region on the adapter host (0x74ee00000 is indicated by the target mode tlpperf) with a single thread. The throughput is approximately 133Mbps. Note that this throughput does not include any headers, i.e., Ethernet, IP, UDP, NetTLP, and TLP headers. The throughput indicates goodput of DMA.

# At device host
$ ./tlpperf -r -l -b 1b:00 -d read -a 0x74ee00000 -s 2097152 -L 512     
============ tlpperf ============
-r remote:    
-l local:     
-b requester:           1b:00

-d direction:           read
-a DMA region:          0x74ee00000
-s DMA region size:     2097152
-L DMA length           512

-N nthreads:            1
-R how to split:        same
-P pattern:             seq
-M latency mode:        off

-c count:               0
-i interval:            0
-t duration             0
-D debug:               off
count_thread: start count thread
benchmark_thread: start on cpu 0, address 0x74ee00000, size 2097152, len 512
   1: 133246976 bps
   1: 32531 tps
   2: 133255168 bps
   2: 32533 tps
   3: 133263360 bps
   3: 32535 tps
   4: 133255168 bps
   4: 32533 tps
^Cstop_all: stopping...
   5: 48467968 bps
   5: 11833 tps

tlpperf provides various options for benchmarking DMA on a NetTLP platform: DMA directions, region size, access patterns, number of threads, and so on. The detailed benchmark results are published it the paper.


For improving throughput, NetTLP exploits the TLP tag field to distribute receiving encapsulated TLPs among multiple hardware queues of a NIC and CPU cores at the device host. The tag field is used to distinguish individual non-posted transactions that can be processed independently. The NetTLP adapter embeds the lower 4-bit of the tag values into the lower 4-bit of UDP port numbers when encapsulating TLPs. As a result, PCIe transactions to the NetTLP adapter are delivered through different 16 UDP flows based on the tag field, and the device host can receive the flows by different NIC queues.

To receive the UDP flows efficiently, we used Intel Ethernet Flow Director. Flow Director allows us to assign specific flows to specific CPU cores. By using it, the 16 UDP flows from the NetTLP adapter can be assigned to each CPU core where the corresponding tlpperf threads are running.

A sample script shown below assigns the Nth flow to the Nth core. tlpperf also assigns threads corresponding to tag values to each core with the same rule.



ethtool --features ${ETH} ntuple off
ethtool --features ${ETH} ntuple on

for x in `seq 0 15`; do

	idx=$(( $x % 16 ))
	PORT=$(( 12288 + $idx ))

	cmd="ethtool --config-ntuple ${ETH} flow-type udp4 \
		src-ip ${SADDR} dst-ip ${DADDR} \
		src-port $PORT dst-port $PORT action $idx"
	echo $cmd

ethtool --show-ntuple ${ETH}

From NetTLP adapter to LibTLP

It needs rare equipments and some tricks. To generate PCIe transactions on in this direction, we used pcie-bench. The pcie-bench with NetFPGA-SUME issued PCIe transactions to the BAR4 of NetTLP adapter instead of main memory, and psmem on the device hosts responded to the transactions. This setup requires PCIe switch and P2P DMA.

For the adapter host that needs PCIe switches to accommodate NetFPGA-SUME for pcie-bench and NetTLP adapter, we used ASUS WS X299 SAGE motherboard. For P2P DMA, we modified the pcie-bench implementation for NetFPGA-SUME (ToDo: clean up the modified pcie-bench and publish it here).

How to use pcie-bench is here, and the benchmark results are alost published in the paper.

Back to NetTLP home.