Overview
This section describes how to conduct a micro benchmark on a NetTLP platform. In the NetTLP platform, there are two directions of PCIe transactions: (1) from LibTLP to the NetTLP adapter, and (2) from the NetTLP adapter to LibTLP. The former that this page describes indicates that an application performing a software PCIe device issues DMAs to the memory on the adapter host.
tlpperf
To generate PCIe transactions from software, we developed a LibTLP-based benchmark application called tlpperf. Users can send memory read and write requests to the memory on the adapter host through the NetTLP adapter from the device host by using tlpperf. tlpperf is also contained in the apps directory of the LibTLP repository.
$ ./tlpperf -h
./tlpperf: invalid option -- 'h'
tlpperf usage
basic parameters
-r X.X.X.X remote addr at NetTLP link
-l X.X.X.X local addr at NetTLP link
-b XX:XX bus number of requester
DMA parameters
-d read|write DMA direction
-a 0xADDR DMA target region address (physical)
-s u_int DMA target region size
-L u_int DMA length (spilited into MPS and MRRS)
benchmark style parameters
-N u_int number of thread
-R same|diff how to split DMA region for threads
-P fix|seq|seq512|random access pattern on each reagion
-M measuring latency mode
options
-c int count of interations on each thread
-i msec interval for each iteration
-t sec duration
-D debug mode
for target host
-S size size to allocate hugepage as tlpperf target
First, a target memory region for benchmarking is needed at the adapter host. tlpperf provides an option for this purpose. -S option allocates a specified sized memory region and enters while (1) sleep(1);. This target mode allocates the region from hugepage, so that please setup hugepage in advance.
# At adapter host
# setup hugepage
$ cat setup-hugepage.sh
#!/bin/bash
echo 2048 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
mkdir -p /mnt/hugepages
mount -t hugetlbfs nodev /mnt/hugepages
$ sudo ./setup-hugepage.sh
# start tlpperf in target mode: allocate a 2MB region
$ sudo ./tlpperf -S $(( 1024 * 1024 * 2))
2101248-byte allocated, physical address is 0x74ee00000
Next, run tlpperf at the device host. An example shown below issues 512-byte DMA read to the memory region on the adapter host (0x74ee00000 is indicated by the target mode tlpperf) with a single thread. The throughput is approximately 133Mbps. Note that this throughput does not include any headers, i.e., Ethernet, IP, UDP, NetTLP, and TLP headers. The throughput indicates goodput of DMA.
# At device host
$ ./tlpperf -r 192.168.10.1 -l 192.168.10.3 -b 1b:00 -d read -a 0x74ee00000 -s 2097152 -L 512
============ tlpperf ============
-r remote: 192.168.10.1
-l local: 192.168.10.3
-b requester: 1b:00
-d direction: read
-a DMA region: 0x74ee00000
-s DMA region size: 2097152
-L DMA length 512
-N nthreads: 1
-R how to split: same
-P pattern: seq
-M latency mode: off
-c count: 0
-i interval: 0
-t duration 0
-D debug: off
=================================
count_thread: start count thread
benchmark_thread: start on cpu 0, address 0x74ee00000, size 2097152, len 512
1: 133246976 bps
1: 32531 tps
2: 133255168 bps
2: 32533 tps
3: 133263360 bps
3: 32535 tps
4: 133255168 bps
4: 32533 tps
^Cstop_all: stopping...
5: 48467968 bps
5: 11833 tps
tlpperf provides various options for benchmarking DMA on a NetTLP platform: DMA directions, region size, access patterns, number of threads, and so on. The detailed benchmark results are published it the paper.
Optimization
For improving throughput, NetTLP exploits the TLP tag field to distribute receiving encapsulated TLPs among multiple hardware queues of a NIC and CPU cores at the device host. The tag field is used to distinguish individual non-posted transactions that can be processed independently. The NetTLP adapter embeds the lower 4-bit of the tag values into the lower 4-bit of UDP port numbers when encapsulating TLPs. As a result, PCIe transactions to the NetTLP adapter are delivered through different 16 UDP flows based on the tag field, and the device host can receive the flows by different NIC queues.
To receive the UDP flows efficiently, we used Intel Ethernet Flow Director. Flow Director allows us to assign specific flows to specific CPU cores. By using it, the 16 UDP flows from the NetTLP adapter can be assigned to each CPU core where the corresponding tlpperf threads are running.
A sample script shown below assigns the Nth flow to the Nth core. tlpperf also assigns threads corresponding to tag values to each core with the same rule.
#!/bin/bash
ETH=eth1
SADDR=192.168.10.1
DADDR=192.168.10.3
ethtool --features ${ETH} ntuple off
ethtool --features ${ETH} ntuple on
for x in `seq 0 15`; do
idx=$(( $x % 16 ))
PORT=$(( 12288 + $idx ))
cmd="ethtool --config-ntuple ${ETH} flow-type udp4 \
src-ip ${SADDR} dst-ip ${DADDR} \
src-port $PORT dst-port $PORT action $idx"
echo $cmd
$cmd
done
ethtool --show-ntuple ${ETH}
From NetTLP adapter to LibTLP
It needs rare equipments and some tricks. To generate PCIe transactions on in this direction, we used pcie-bench. The pcie-bench with NetFPGA-SUME issued PCIe transactions to the BAR4 of NetTLP adapter instead of main memory, and psmem on the device hosts responded to the transactions. This setup requires PCIe switch and P2P DMA.
For the adapter host that needs PCIe switches to accommodate NetFPGA-SUME for pcie-bench and NetTLP adapter, we used ASUS WS X299 SAGE motherboard. For P2P DMA, we modified the pcie-bench implementation for NetFPGA-SUME (ToDo: clean up the modified pcie-bench and publish it here).
How to use pcie-bench is here https://github.com/pcie-bench/pciebench-netfpga, and the benchmark results are alost published in the paper.