Measuring the performance of C++ code, using an std::vector example

28/07/2017

We know that using the std::vector comes with a performance hit, both in terms of memory and execution time, but how much exactly? And more importantly, how can we measure the performance of our code in this post's example scenario, as well as more generally?

In this post, I'll be using a recent change I made to the MONSTR plugin to demonstrate the performance impact of using a vector, and in doing so will explain my methodology in measuring the performance of the code. As you'll see, my measurements are by no means an exhaustive investigation of how the code performs, but I believe that it provides enough information to begin to more informed decisions about code implementation.

The code change which this post is based on is intended for realtime audio applications, so when referring to "performance" in the rest of this post, my priority is execution time, and memory usage is not the primary concern.

Implementation

Below is the original implementation of the code that I'll be using as an example. (You can find the full change here.
    void Process2in2out(double* leftSample, double* rightSample, size_t numSamples) {
        // make a copy of the buffers for each band to process in parallel
        std::vector band1LeftBuffer(numSamples);
        std::vector band1RightBuffer(numSamples);
        std::vector band2LeftBuffer(numSamples);
        std::vector band2RightBuffer(numSamples);
        std::vector band3LeftBuffer(numSamples);
        std::vector band3RightBuffer(numSamples);

        for (size_t iii {0}; iii < numSamples; iii++) {
            band1LeftBuffer[iii] = leftSample[iii];
            band2LeftBuffer[iii] = leftSample[iii];
            band3LeftBuffer[iii] = leftSample[iii];

            band1RightBuffer[iii] = rightSample[iii];
            band2RightBuffer[iii] = rightSample[iii];
            band3RightBuffer[iii] = rightSample[iii];
        }

        // let each band do its processing
        band1.process2in2out(band1LeftBuffer, band1RightBuffer);
        band2.process2in2out(band2LeftBuffer, band2RightBuffer);
        band3.process2in2out(band3LeftBuffer, band3RightBuffer);

        // combine the output from each band, and write to output
        for (size_t iii {0}; iii < numSamples; iii++) {
            leftSample[iii] = band1LeftBuffer[iii] + band2LeftBuffer[iii] + band3LeftBuffer[iii];
            rightSample[iii] = band1RightBuffer[iii] + band2RightBuffer[iii] + band3RightBuffer[iii];
        }
    }
In summary, this method is being given a pointer to the start of a buffer of samples that correspond to the left audio channel, and a pointer to the start of a buffer of samples that correspond to the right channel. Its job, is to make three copies of these buffers to pass these to the processing methods of the objects band1, band2, and band3. Finally, it must sum the buffers and write the output back to the original buffer.

It's worth noting that when each vector is created, it is already pre-sized to numSamples. This is so that values can be copied to the vector without needing to use its push_back method. It's more efficient to do this where possible, as this prevents the vector needing to grow and allocate more memory as you add elements to it (this can become quite an expensive operation!).

And now the same method using double pointers instead of vectors:
    void Process2in2out(double* leftSample, double* rightSample, size_t numSamples) {
        // make a copy of the buffers for each band to process in parallel
        double* band1LeftBuffer {new double[numSamples]};
        double* band1RightBuffer {new double[numSamples]};
        double* band2LeftBuffer {new double[numSamples]};
        double* band2RightBuffer {new double[numSamples]};
        double* band3LeftBuffer {new double[numSamples]};
        double* band3RightBuffer {new double[numSamples]};

        for (size_t iii {0}; iii < numSamples; iii++) {
            band1LeftBuffer[iii] = leftSample[iii];
            band2LeftBuffer[iii] = leftSample[iii];
            band3LeftBuffer[iii] = leftSample[iii];

            band1RightBuffer[iii] = rightSample[iii];
            band2RightBuffer[iii] = rightSample[iii];
            band3RightBuffer[iii] = rightSample[iii];
        }

        // let each band do its processing
        band1.process2in2out(band1LeftBuffer, band1RightBuffer, numSamples);
        band2.process2in2out(band2LeftBuffer, band2RightBuffer, numSamples);
        band3.process2in2out(band3LeftBuffer, band3RightBuffer, numSamples);

        // combine the output from each band, and write to output
        for (size_t iii {0}; iii < numSamples; iii++) {
            leftSample[iii] = band1LeftBuffer[iii] + band2LeftBuffer[iii] + band3LeftBuffer[iii];
            rightSample[iii] = band1RightBuffer[iii] + band2RightBuffer[iii] + band3RightBuffer[iii];
        }

        delete[] band1LeftBuffer;
        delete[] band1RightBuffer;
        delete[] band2LeftBuffer;
        delete[] band2RightBuffer;
        delete[] band3LeftBuffer;
        delete[] band3RightBuffer;
    }
It's not entirely clear from this snippet, but the process methods for the objects band1, band2, and band3 have also been modified to accept double pointers rather than vectors.

The obvious down side here is that we now have to remember to deallocate memory when we're done with it. In a short method like this its not really a problem, but in scenarios where the scope of your dynamically allocated objects is larger things often get more complicated.

Measuring Performance

So the interesting part, how does the new implementation compare to the old one? To measure this I've used two methods:
  1. Cachegrind
  2. Unit tests which measure execution time

Cachegrind

Cachegrind is part of the Valgrind suite of tools, and provides an estimate of the time spent in different parts of the code. There are three main steps to performing analysis using Cachegrind:

Downloading the required tools

You'll need two components to use Cachegrind effectively: the valgrind tool suite to generate the results, and KCachegrind or qCacheGrind to analyse and view the results.

I use the following script in my build to install valgrind:
    #!/bin/bash

    svn co svn://svn.valgrind.org/valgrind/trunk valgrind

    cd valgrind
    ./autogen.sh
    ./configure
    make
    sudo make install
Then to view the results, you'll need either KCachegrind (Windows and Linux) or qCacheGrind (Mac OS).

KCachegrind can be installed from the following website: http://kcachegrind.sourceforge.net/html/Home.html

qCacheGrind can be installed using homebrew:
    brew install qcachegrind

Running your binary with Cachegrind

This part is fairly easy, just run the following to start your binary with valgrind:
    valgrind/coregrind/valgrind --tool=callgrind 
When your program completes valgrind will then create a results file that looks like this: callgrind.out.<Process ID>.

Analysing the results

From now on, were ever I mention KCachegrind you can assume I'm also referring to qCachegrind

The results file on its own won't tell you much, so you'll need to start KCachegrind and go to File -> Open and select your results file. You should now see something that looks vaguely like the below. The screen is split into two panes, the rows of buttons above the top pane and below the bottom pane control the kind of information that you are presented with in the adjacent pane. If you're new to KCachegrind I recommend clicking through these to see what is available, but I find the Callee Map and Callees views to be the most useful.

But Cachegrind provides only provides an estimate of performance, which brings us to the next method...

Unit Tests

To get a more tangible feel for the effect that the changes had made, I created a performance oriented unit test. This simply sets up 100 buffers, each containing 1024 samples, and measures the time taken to process each buffer. In reality buffer sizes as small as 64 samples may be used, but 1024 was chosen as this would be representative of the setting used on the worst performing systems, where a performance saving could be most crucial.

The tests were designed to measure 3 performance criteria, and fail if the limits of any of these were breached. The criteria are as follows:
  1. Execution time for each individual buffer
  2. Average execution time of the buffers
  3. Standard deviation of the execution times for each buffer
Together, the average and standard deviation provide a more complete picture of what is happening. As this code is designed for realtime applications, the standard deviation is in some ways more important than the average alone, since you won't get any bonus points for a processing most buffers extremely quickly if every once in a while a buffer is processed slow enough that it doesn't arrive on time.

One disadvantage of this method of measuring performance is that your results will be influenced by the hardware used to run the tests, as well as any other processes running on the hardware at the same time.

The unit tests can be found in the following file, and contain several tests for the performance of other parts of the library: https://github.com/jd-13/WE-Core/blob/master/Tests/PerformanceTests.cpp

Results





Above you can see the before and after comparison in KCachegrind. As you can see, the change has reduced the cycle estimation of our method by 118162, or around 11%. But what does this actually mean in practice? Let's compare to the results of the unit tests:

Average Execution Time (us)Standard Deviation (us)
Before0.481560.0491323
After0.446720.0812017

That's an improvement in average execution time of around 7%, a little less than that indicated by KCachegrind. But the standard deviation has increased substantially. This could just be a one off, potentially caused by resource contention on my system, or it could be an indication of a genuine issue. To be sure, I ran the test multiple times in both the before and after configurations. As it turns out, the increased standard deviation did not appear reliably in further tests, so I decided this was unlikely to be a serious issue.

The results of these additional test runs are below.

Average Execution Time (us)Standard Deviation (us)
Before (1)0.451640.0254561
Before (2)0.461640.0324235
Before (3)0.466710.0313653
After (1)0.431270.0314517
After (2)0.432270.0289865
After (3)0.427540.0324029

Next Steps

In this post, I inspected the performance of only a portion of the code which is executed when a buffer needs to be processed. The code that has been measured does include the sections of code which I would expect to account for the majority of the execution time, but without measuring all of the code that is executed we can't be sure of the execution time of the code as a whole, and therefore don't know if our performance gains are meaningful. Sure, we may have improved the performance of this section of code by around 14%, but that doesn't mean much if a portion of code which was outside of the areas we have measured takes an order of magnitude longer to execute.

In this scenario I can be fairly certain that this is not the case, but nonetheless a more complete approach would measure the entirety of the code executed. There are several reasons that I was unable to do this here, one being that I am unable to spend enough time on this project to justify setting up a CI system that can build and test the entire plugin, though this is on my wishlist.

A further performance optimisation would be to have the class to which this method belongs allocate a reasonably sized buffer upon construction, and deallocate it upon destruction. I expect that this would yield a significant performance increase, however there is one key issue: we don't know how big the buffer needs to be until the method is called. This could be worked around by implementing a mechanism which grows the buffer if the numSamples parameter is larger than the currently allocated buffer, but this leads to an increase in complexity that I'm not sure the performance benefit can justify. It is however something that I will be considering in future implementations.

Final Thoughts

So we've seen that manually allocating memory can be faster than using vectors, but whether the erformance increase is worth the extra risk involved is something that must be decided on a case by case basis. In general I prioritise maintainability and readability over performance, but in the above case I've decided that the risk is low and the reduction in execution time is not insignificant, and that therefore std::vector might not be the best choice.

But more importantly, I think, is an emphasis on taking a more scientific approach, and actually measuring the performance of your code in order to be able to make more informed implementation decisions.