Tuesday, 1 July 2014

C++ Performance - Data Structure and Memory Fragment

Computer science is all about data structure and algorithm. Given a certain algorithm a good or bad data structure means a lot to the performance of the algorithm. Here I would like to present you a real example that I have experienced recently that shows how data structure impacts on the memory fragment and hence further impacts on the performance of algorithm.

As I discussed the performance issue on my other blog entry, Performance - Running out of memory (heap), this article can be a supplement to show a real example that a good data structure reduces memory fragment and improves the performance.

1. Scenario Background
I am working on scientific/mathematical modelling application. Assume that you have a system that consists of half a million to a million variables and equations. In order to boost up the speed (and/or accuracy) various top-level algorithms are employed to check the sanity of the system and decompose this large system into sub-systems. And these algorithms are usually the hot-spots of the program for structural analysis and initialization/re-initialization when system encountering discontinuity. Therefore optimizing these algorithms are vital for the performance of the system in all.

No surprise that the system can be represented in a sparse-matrix (equations as row and  variables as column), whose mathematical name is Jacobian matrix. For such a huge system, easily reaching hundreds of millions elements. crafting a good data structure means a great deal to the performance of the algorithms. A good data structure design will save process memory consumption, reduce/remove the memory fragment and hence boost the performance. Here I would to say the implementation does matter.

2. Problems
Initially this algorithm is implemented in Fortran, which is always working well on thousands of tests with consistent performance. Until 5 years ago the company decided to replace Fortran with C/C++ implementation. With this implementation a bit of extra responsibilty is introduced into this C/C++ implementation. Together some optimization of this algorithm is made as well. 

However the problem pops up that for some test cases release by release this algorithm present dramatic performance degrading, overall taking ~5x times longer than usual. Initially I was suspecting this performance is caused by the nature of this algorithm, as the developer spotted this performance issue when he was developing/testing the C/C++ implementation. At that time he mentioned the defect of this implementation but did not investigate further. Badly he simply blamed the algorithm. (Even worse I followed the indication he made. This is usually what developers do when under pressure and under limited resource and time.)

However if the same data fed into Fortran implementation it gives consistent and much faster performance. After ruling out one guess after another it finally leads me to believe that it is the C++ implementation that causes the worsening performance. After verifying that it is a correct implementation but a slower implementation as well, other experiments also proves that it is the data structure of  C++ implementation that causes this performance issue. 

3. Data Structure 
Surprisingly the system still uses Fortran 77. This means that all the memory has to be allocated in advance before the variables come to use. (Can't imagine how reluctant the industry is to adopt the new technique even when the business is so behind the technology. Again it proves that how the programmers should do their job. Try to make thing right at the first place, otherwise you will not have many chances to improve it or you just decide not to.) Of course in Fortran 77 there is no feature-laden containers like in C++. So the data structure is very simple - big arrays
    - A huge array to contain all the Jacobian elements (size of  hundreds of millions, a few billions)
    - A big array to contain the start and end indices of Jacobian elements (size of  hundreds of thousands,  a million)
    - Big arrays for the properties for each equation/variable (size of  hundreds of thousands,  a million)

Data structure used in C/C++ implementation
The C++ data structure is quite intuitive. Simply chop the Jacobian matrix into row and column and save the dependent variables/equation (row/column) into a std::vector.

// ********************************************************************************
struct JacobianRowOrColumn {
    std::vector<int> dependent; // dependent variables/equations 
    bool assgined; //
    // ......
};

typedef JacobianRowOrColumn JacobianRow; // equation has dependent variables
typedef JacobianRowOrColumn JacobinColumn; // variable has dependent equations
// ********************************************************************************

This data structure is very straightforward and simple. And this is not a bad design if not considering the scale of the data and complexity of this system. In the case I am  investigating, it has ~130K variables, ~130 equations, (the number of variables = the number of equations), and roughly ~25 million Jacobian elements. The test shows,
    - C++ version uses ~4x memory as much as Fortran
    - C++ version runs ~10x slower than Fortran version

There is no surprise that the more memory is used in C++, because the std::vector is used. More memory allocated/committed than it actually needs. The performance is really a surprise. I am convinced that the worsening performance is caused the memory fragment because a plain C++ implementation is developed and it has similar performance with Fortran, roughly ~25% slower. (From my personal experience this is language difference between C/C++ and Fortran.)

I have to dig a bit deep into what actually causes the performance degrading. It has to be the memory fragment. For the Fortran and the plain C++ implementation all the Jaciban matrix is allocated in a big array and and pointers to the array are passed around for calculation/computation. However for the C++ implementation the data structure means that ~260K small std::vector is created and committed in the stack and heap (instance plus the data). Plus the overhead the higher consumption is inevitable and slowing the performance is the only explanations. Besides as I explained in Section 1 these top-level algorithms are usually hot-spots in the program. The data allocation/releasing are done many times when solving the system/equations. 

Fix of this data structure
In the plain C++ implementation the idea is to reduce the frequency of small memory allocation/releasing and keep the data structure for readability and maintainability.
    - Keep a big array for the Jacobian matrix to reduce the memory fragment
    - Use a pointer to the big array to record the reference to dependent variables/equations for better design.

The new data structure is modified as, 
// ********************************************************************************
struct JacobianRowOrColumn {
    int* dependent; // dependent variables/equations 
    size_t count_of_dependent; // number of dependent variables/equations
    bool assgined; //
    // ......
};

typedef JacobianRowOrColumn JacobianRow; // equation has dependent variables
typedef JacobianRowOrColumn JacobinColumn; // variable has dependent equations
// ********************************************************************************

This solution is quite straight forward. Replacing std::vector with a pointer plus an extra data member - the size of dependent variables/equations. But here the pointer is not to be assigned with new operator or malloc. Following with Fortran implementation a big array will be allocated to contain the Jacobian matrix. And the pointer in the JacobianRow and JacobianColumn will be assigned with the big array plus the offset. With this data structure it removes all the small memory allocations by std::vector. Hence removes the memory fragment and at the same time the memory usage is going down by removing the overhead of std::vector. Better memory usage and faster memory access.

The result shows that the plain C++ implementation is marginally slower (~25%) than Fortran and the peak and committed memory usage is <10%  more than Fortran implementation. In my experience these numbers are very good and they come down to the language difference.

4. Summary
Tuning performance can be a fun and tough work. Having the right tools and methodology is the key. The right tools helps developers quickly find the hot-spots, for instance the most expensive path. The right methodology helps you always navigate into the correct direction. Always focus on the hot-spots in the descending order. And find a benchmark if working on regression performance.

The first thing to focus on - algorithm rather than implementation. This is a rear case that this performance degrading is caused by the implementation. Most of cases I have worked so far are related to algorithms. For instance O(N^2) algorithm to O(long(N)). And O(log(N)) to constant time. For instance use hash table to replace std::map.

Obviously this is very good example that shows how implementation is important for performance. The key is to understand your business domain or gain deep knowledge of your application. Be clear about the data size, the system complexity and the data flow. A big data application and/or a high performance real-time application  is definitely have extra design and implementation challenge. Understand the requirements before starting design and implementation.

Bibliography:
[1] Donald E. Knuth "The Art of Computer Programming", Volume 1, Fundamental Algorithms, Third Edition
[2] Donald E. Knuth "The Art of Computer Programming", Volume 2, Seminumerical Algorithms, Third Edition

No comments:

Post a Comment