GPU Benchmarking on the Edge with Computer Vision Object Detection

January 7th, 2021

Headshot for

Principal Engineer

Introduction

With the growing popularity of Artificial Intelligence and Machine Learning applications, GPU availability has become more critical than ever. For the cases of compute offload or remote rendering, GPU is essential at the Edge. At MobiledgeX, we are in the process of deploying GPUs to more of our cloudlet locations. As of today, GPU cloudlets are a hot commodity, but still relatively scarce. Because of this, it is vital to understand how many concurrent users a single GPU can support. The use case we will study for this is Object Detection. We will be running our open source ComputerVision server on a GPU enabled cloudlet and connecting to it with multiple clients simultaneously to stress the server.

CPU/GPU Comparisons

There’s no question that the types of computer vision processing that we’re interested in benefit significantly from a GPU. This table shows a comparison between CPU-only and GPU-enabled processing for a couple of different computer vision activities. The values shown are the average times to process one frame.

Activity

CPU-only

GPU Support

Performance Improvement

Pose Detection

14408.680 ms

64.589 ms

223.08x

Object Detection

403.652 ms

41.340 ms

9.76x

These computer vision activities benefit significantly from GPU support. We have found Object Detection has been more popular as a computer vision demonstration, so let’s dig in and see how much performance we can squeeze out of a single GPU.

The Use Case

We are testing for a use case where a ComputerVision app instance runs on a cloudlet and serves multiple clients. Those clients might be surveillance cameras, drones, self-driving cars, etc. In this use case, the number of frames processed per second is important to each client. The time it takes for an object detection result to be returned for an image might make the difference between a smooth, uneventful ride, and a severe car crash. We want to discover the configuration that will give us the lowest possible latency for each client to receive each result. We know Edge computing gives us the lowest possible network latency, and that a GPU-equipped server will process object detection requests much faster than a CPU-only server. But is that enough? This image shows an example of our use case:

Testing Methodology

Using Edge to Test Edge

As most of our GPUs are deployed in Germany, we are also using simulated clients in Germany to approximate a low-latency Edge environment. We use the /client/benchmark web service, which runs as part of every ComputerVision instance in Germany. This allowed us to use any ComputerVision instance as a client to connect to any specified server. One nice thing about this setup is that no software installation is required at all to run a test. As long as you have access to the “curl” command, you can start a benchmarking session.

The following diagram shows a laptop in the U.S. initiating tests on multiple ComputerVision app instances in Germany, acting as clients and connecting to the server in Frankfurt.

Here’s an example command line that launches a client in Dusseldorf which connects to a server in Berlin and processes each frame of a specified video:

curl -X POST https://cv-gpu-cluster.dusseldorf-main.tdg.mobiledgex.net:8008/client/benchmark/ -d "-s cv-gpu-cluster.berlin-main.tdg.mobiledgex.net --tls -e /object/detect/ -c websocket -f objects_320x180.mp4 -n PING --server-stats”

These are the results returned:

========================================================================================
Grand totals for cv-gpu-cluster.berlin-main.tdg.mobiledgex.net /object/detect/ websocket
1 threads repeated 1 times on 1 files. 313 total frames. FPS=17.51
========================================================================================
====> Average Latency Full Process=54.830 ms (stddev=26.618) FPS=18.24
====> Average Latency Network Only=11.763 ms (stddev=0.522)
====> Average Server Processing Time=31.972 ms (stddev=3.005)
====> Average CPU Utilization=17.1%
====> Average Memory Utilization=55.0%
====> Average GPU Utilization=33.5%
====> Average GPU Memory Utilization=20.4%

Details of what the client actually does can be seen in the source of the multi_client.py script in our repo. This script can be run from the command line to initiate a test, or it can be called from the /client/benchmark endpoint to remotely launch a client benchmark to any ComputerVision server.

With another script, we can launch multiple clients simultaneously, and aggregate the results. Here is an example command line for remote_bench.py and a snippet of the results:

python remote_bench.py -n 2

2/2 clients reporting:
Num Clients, FPS/Client, Total FPS, % CPU, %Mem, %GPU, %GPU Mem
2, 13.41, 26.81, 23.08, 17.83, 25.02, 15.05

The last couple of lines allow importing CSV data into a spreadsheet. We will see several spreadsheets in the “Results” section below.

Server Hardware Architecture

CPU-only 

This configuration is used as a baseline to collect results with no GPU support. The CPU is an Intel Xeon Processor (Skylake, IBRS) with four cores running at 2992.968 MHz. Server configuration summary:

RAM Size(GB)

8

Number of vCPUs

4

Disk Space(GB)

80

Number of GPUs

0

GPU Support 

This configuration keeps the same CPU and RAM as above, but adds a GPU.

RAM Size(GB)

8

Number of vCPUs

4

Disk Space(GB)

160

Number of GPUs

1

The GPU is a NVIDIA Tesla T4 with 16GB. The NVIDIA GPU driver used is version 440.64 and the CUDA version is 10.2.

Our ComputerVision server was developed using the the pytorch package and the Django framework. With Django, multiple workers can simultaneously handle incoming requests more efficiently than a single process with multi-threading. In the multi-worker case, there is no concept of shared memory between workers. This means that each worker must load all of the required object detection libraries and pre-trained models into both system and GPU memory.

We found that with this 8GB configuration and the GPU version of our ComputerVision server, only a single worker could be executed at a time without running out of system memory. In this configuration with a single worker process, Memory Utilization was 55%. Two workers would be 110%, which is not possible, and it shows us that a single worker is all we can run in this configuration.

GPU Support with High System RAM

This is the exact same CPU and GPU configuration as above, but with 32GB of system RAM, allowing multiple Django worker processes can be executed simultaneously.

RAM Size(GB)

32

Number of vCPUs

4

Disk Space(GB)

160

Number of GPUs

1

GPU Benchmarks

Measurement

To measure CPU and GPU utilization, we have a web service endpoint called “/server/usage/”. Here is an example curl command and some sample output while the server is under load:

curl https://cv-gpu-cluster.frankfurt-main.tdg.mobiledgex.net:8008/server/usage/

{"cpuutil": 34.6, "memutil": 49.9, "gpuutil": "46", "gpumem_util": "29"}

Internally, this web service uses the psutil Python package and the nvidia-smi command. Here is an example command line, followed by example output. The first two samples are before the test is started, and show “utilization” values of 0%. The rest of the output is during a single-client run with a single worker process:

nvidia-smi --query-gpu=utilization.gpu,utilization.memory,memory.total,memory.used --format=csv -l 1

utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.used [MiB]
0 %, 0 %, 15109 MiB, 3155 MiB
0 %, 0 %, 15109 MiB, 3155 MiB
40 %, 24 %, 15109 MiB, 3155 MiB
30 %, 18 %, 15109 MiB, 3155 MiB
46 %, 27 %, 15109 MiB, 3155 MiB
45 %, 27 %, 15109 MiB, 3155 MiB
39 %, 24 %, 15109 MiB, 3155 MiB

From this, we can see that even at idle, 3155 MiB of the GPU memory is allocated or “used”, even though the “utilization.memory [%]” value “0 %”. This shows that a worker process consumes GPU memory even if it is not currently processing. When the worker goes from idle to actively performing object detection, the “memory.used” value remains constant. Only the “utilization.gpu” and “utilization.memory” values increase and may become limiting factors. Defining what these utilization numbers mean is beyond the scope of this post. You can read specifically from the nvidia-smi documentation what each of these utilization values represents. While 100% utilization does not exactly imply the GPU’s maximum utilization, it gives us a rough approximation. To get a true reflection of the maximum capacity for a single GPU, we would need to use GPU profiling, which is an upcoming feature in the MobiledgeX platform.

Adding additional workers increases the “memory.used [MiB]” value. With our 16GB GPU, we found that four workers were the most we could use.

Number of Workers

memory.used [MiB]

Worker Startup Status

1

3155

CUDNN_STATUS_SUCCESS

2

6033

CUDNN_STATUS_SUCCESS

3

8911

CUDNN_STATUS_SUCCESS

4

11789

CUDNN_STATUS_SUCCESS

5

14927

CUDNN_STATUS_INTERNAL_ERROR

It seems that we should still have GPU memory free after starting the fifth worker, but in fact, it crashes with a CUDNNSTATUSINTERNAL_ERROR. More investigation is required to find a solution to this error, but for the time being, the max number of worker processes we can start is four.

Max GPU Usage

To get a feel for how the GPU can perform for our object detection implementation, we started with an artificial test that involved no network transfers, instead processing frames from a local video file.

It is important to note that all processing times and FPS values are based on image inference times specific to our Object Detection implementation using the pytorch package. They are also specific to the frame size and contents of the video file used as input, which remains constant throughout the tests. Using a different video or resolution is expected to change the values, possibly quite significantly. Additionally, the FPS values calculated are unrelated to any video rendering FPS values the reader might be familiar with.

The ComputerVision Server’s object_detector.py source file contains the code for the ObjectDetector class that does the actual object detection. It also contains code for creating an ObjectDetector instance and benchmarking it from the command line. In this case, we use the option to take an MP4 video file as an input, and perform object detection on each individual frame. Example command line and output:

python objectdetector.py -f ../client/objects320x180.mp4 --server-stats

==================================================================
Grand totals
1 threads repeated 1 times on 1 files. 313 total frames.
==================================================================
====> Average Processing Time=27.688 ms (stddev=3.865) FPS=36.12
====> Average CPU Utilization=26.5%
====> Average Memory Utilization=34.4%
====> Average GPU Utilization=72.8%
====> Average GPU Memory Utilization=42.2%

We see that the average GPU utilization is 72.8% and 36.12 FPS are processed. This is the maximum throughput possible for a single worker process. If we extrapolate the utilization and FPS numbers, we calculate that the theoretical maximum FPS at 100% GPU utilization would be 49.6 FPS (36.12/72.8*100).

How close could we get to this theory? To find out, we launched three simultaneous instances of object_detector.py, and aggregated the results. The three instances finished with FPS results of 15.37, 14.35, and 14.85, totaling 44.57 FPS, and these were the CPU, RAM, and GPU stats:

====> Average CPU Utilization=76.3%
====> Average Memory Utilization=87.8%
====> Average GPU Utilization=95.9%
====> Average GPU Memory Utilization=49.7%

At 95.9% GPU utilization, we are approaching the max. Checking the math, we see that 95.9% of our theoretical max is 47.57 FPS, so this measured 44.57 FPS isn’t too far off. Note that the CPU utilization is still significantly less than the GPU utilization, confirming that our implementation is GPU-bound, at least in this local-only test, where no network transfer is in play.

Lowest Latency Single Client

We don’t expect to reach the single client performance that we achieved with local-only processing in a real-world client-server configuration. A full round-trip includes the time to upload the image, process it with the object detection code, and return the client’s results.

Since each frame goes through this “full process latency”, it is the measurement we use to calculate FPS. For example, if the average full process time=48.028 ms, we calculate 1/48.028*1000=20.82 FPS.

Let’s take a look at the best-case client/server scenario. Since we are doing all of our testing in Germany, we did some measurements to see which cloudlets have the lowest latency between them. Of the 5 Germany cloudlets, Dusseldorf and Frankfurt are the closest to each other on the map, and not coincidentally, they have the lowest latency between them.

For this test we used a single client in Dusseldorf to run the test script, connecting to the server on the Frankfurt ComputerVision app instance. Here is the command line used, and the results:

$ curl -X POST https://cv-gpu-cluster.dusseldorf-main.tdg.mobiledgex.net:8008/client/benchmark/ -d "-s cv-gpu-cluster.frankfurt-main.tdg.mobiledgex.net --tls -e /object/detect/ -c websocket -f objects_320x180.mp4 -n PING --server-stats"

===========================================================================================
Grand totals for cv-gpu-cluster.frankfurt-main.tdg.mobiledgex.net /object/detect/ websocket
1 threads repeated 1 times on 1 files. 313 total frames.
===========================================================================================
====> Average Latency Full Process=48.028 ms (stddev=32.025) FPS=20.82
====> Average Latency Network Only=3.845 ms (stddev=0.263)
====> Average Server Processing Time=31.907 ms (stddev=2.935)
====> Average CPU Utilization=19.4%
====> Average Memory Utilization=17.9%
====> Average GPU Utilization=33.0%
====> Average GPU Memory Utilization=19.6%

This test resulted in a 3.8 ms latency and 20.82 FPS processed. GPU utilization was 33% -- a perfect 1/3 of the max. We would soon find that this didn’t mean we could just connect with three clients to hit 100%.

Benchmark Results

Now that we have established the theoretical maximum FPS for object detection on our GPU, and the measured results for the best-case single-client and single-worker scenarios, it’s time to examine multi-client and multi-worker scenarios. For each configuration, we start with one client and take measurements. We then increase the client count retake the measurement. We repeat this until we reach five clients, the most we tested with. Then we increase the worker process count and do the whole test again.

Single Worker Process

The single worker configuration is the default for all of our deployed ComputerVision-GPU app instances. When a single client is connected, we see the same ~20 FPS measurement we’ve seen before. When a second client connects, we see an immediate drop-off in FPS per client, though the total FPS does increase slightly to almost 24 FPS. The third, fourth, and fifth clients cause both the FPS per client and total FPS to drop significantly. Neither the CPU nor GPU is seriously taxed, both maxing out in the 2-client scenario. From this, we can assume that there is an I/O bottleneck, that hopefully additional worker processes will alleviate.

Note that the system RAM usage (%MEM) stayed constant throughout these tests. This is because once the worker process is instantiated and has loaded in the object detection libraries and pre-trained models, no further memory is consumed during image processing.

Num Clients

FPS/Client

Total FPS

%CPU

%MEM

%GPU

%GPU MEM

1

20.0

20.0

23.8

17.0

38.3

23.1

2

11.7

23.5

28.9

17.0

39.9

23.8

3

6.2

18.5

28.5

17.0

33.3

20.1

4

3.6

14.4

28.4

17.0

28.0

16.6

5

2.4

11.9

28.2

17.0

27.1

15.1

Two Worker Processes

Our second scenario adds a second worker process. We see better results for multiple clients, and both the CPU and GPU utilization is higher as expected, though we never seriously tax either. The highest total FPS that we see is with three clients and this is also the highest GPU usage seen. Adding clients 4 and 5 increases %CPU, but does not increase the %GPU or Total FPS. The bottleneck has been reached again.

While the FPS/Client did increase when the second worker was added, we were surprised that the two client configuration did not hit 20 FPS for each client, since they each had their own dedicated worker. This suggests there is likely an I/O bottleneck with the Python worker implementation to the GPU. We would likely need to investigate optimizing sending the load to the GPU.

Again note that %MEM remains steady no matter how many clients are served.

Num Clients

FPS/Client

Total FPS

%CPU

%MEM

%GPU

%GPU MEM

1

20.6

20.6

27.9

28.0

39.4

23.7

2

13.7

27.4

42.4

28.1

50.5

27.6

3

10.8

32.3

50.8

28.0

64.4

35.0

4

7.5

30.1

55.1

28.0

63.9

34.8

5

6.5

32.4

54.2

28.1

58.1

32.6

Three Worker Processes

The next test adds a third worker. Here at four clients, we are starting to tax the GPU at over 75%. We’re now seeing a clear pattern where the best overall throughput is when Number of Clients > Number of Workers.

Num Clients

FPS/Client

Total FPS

%CPU

%MEM

%GPU

%GPU MEM

1

20.7

20.7

21.2

39.2

36.3

21.6

2

15.2

30.5

45.8

39.1

56.8

30.6

3

11.2

33.7

71.0

39.1

73.3

38.0

4

8.7

34.7

73.8

39.2

75.9

39.5

5

7.0

35.0

63.2

39.4

65.4

35.2

Four Worker Processes

The final test scenario increases the worker process count to four, which is the most our GPU supports. The pattern we were seeing doesn’t quite hold, as using four clients hits the highest total FPS, %CPU, and %GPU of all the tests.

The max total FPS seen is 41.5 with a GPU usage of 84.8%. Another math check: 84.8% of our theoretical max of 49.6 comes out to 42.1 FPS. This measured 41.5 FPS is very close.

Num Clients

FPS/Client

Total FPS

%CPU

%MEM

%GPU

%GPU MEM

1

19.6

19.6

31.5

50.2

36.8

22.3

2

14.4

28.7

55.2

50.2

64.4

33.5

3

11.7

35.2

62.6

50.2

74.1

39.3

4

10.4

41.5

83.4

50.2

84.8

43.2

5

7.8

39.2

80.4

50.3

76.2

39.5

Conclusion

For a processing-intensive activity like object detection, a GPU improves processing time tremendously. Still, if the server is being used in a multi-client scenario, the GPU processing power is quickly exhausted.

It is clear that to get the most out of our GPU, we must enable multiple worker processes (at the expense of system RAM). Revisiting the earlier diagram of our use case, we present an updated image showing each client connecting to its own worker process. This is the ideal case.

To support multiple workers with our current implementation requires a high RAM server, which may not be financially viable for many organizations. We must continue to investigate ways to reduce our memory footprint and remove bottlenecks. These are some topics on our roadmap to hopefully improve our ComputerVision server performance:

  • Video streaming - Our current implementation sends a single frame at a time, and when processing results are received, it sends the next one. A video streaming implementation could improve the overall FPS, but possibly at an upstream bandwidth cost.

  • DeepStream - Early experiments with NVIDIA’s DeepStream SDK show enormous promise in terms of sheer FPS throughput. Defining our use case and implementing it should prove exciting and rewarding.