Read/Upgradable/Write Lock Performance

Introduction

This document presents some performance data and design justification for the mutex/lock design in this companion document. This document is specifically meant to address statements such as:

This last point I really won't directly dispute. Rather I hope to show that the cost of supporting upgradable locks is negligible, and that there do exist cases where they really are the best solution. And therefore I argue that to have this tool in the tool chest can only do good, and no harm.

Implementation

The interface described here has been implemented in two different ways:

  1. In a relatively portable fashion utilizing the OS mutex (such as pthreads) and condition variables to synchronize changes to the rw_mutex state.
  2. Still using OS mutexes and condition variables, but greatly reducing the need for them by also taking advantage of PowerPC specific instructions: lwarx, stwcx., isync and sync to atomically update the rw_mutex state when possible (not quite lock free, but lock reduced).

Both implementations use an algorithm inspired by Alexander Terekhov which attempts to side step the issue of reader priority or writer priority by having all threads initially wait at the same entry gate. This leaves the priority for entry into the first gate up to the thread scheduler. Once the first gate is entered by a writer thread, it blocks until all threads currently holding read locks have released those locks. While the writer thread is blocked on the second gate, no new read locks can be acquired.

The second implementation above drives the arguments for this document. The first implementation is significantly more expensive. It is a given that a lock reduced implementation can be achieved on platforms other than PowerPC, but this document does not provide such evidence.

All tests presented herein were run on a Mac OS 10.3 system with dual G4 processors. The dual processor spec is key to demonstrating the desirability of the upgradable characteristic of the mutex. The goal is to minimize calendar time spent on a computation by keeping all available processors as busy as possible. The test below demonstrates that in some circumstances the lack of an upgradable mutex can have the effect of idling an available processor. As the number of available processors in a system increase, the possibility of idle processors becomes stronger.

The actual implementation of this interface isn't presented herein as it is the property of Freescale, however I can speak about the costs of adding upgradable support.

A single word is used to record the state of the exclusive/upgradable/sharable mutex so that the lock reduced implementation can atomically update the state of the mutex when there is no contention for the lock. When there is contention for the lock, one or more threads will block on a condition variable which of course requires a locked OS mutex.

The cost to support upgradable is a single extra bit out of the word holding the mutex state. This extra bit reduces the number of concurrent sharable threads that can run by a factor of 2. That is, on a 32 bit system, the number of sharable threads that can run concurrently in this design, but without supporting upgradable, is 536,870,911, and adding upgradable support reduces that figure to 268,435,455 concurrent sharable threads. If a thread attempts to obtain a sharable lock when the mutex has already allocated the maximum number of sharable locks, that thread will block until another thread releases a sharable lock.

There is no run time complexity required to support upgradable beyond the extra mutex functions that deal specifically with upgradable. The presence of the upgradable functionality does not adversely affect the performance or code size of the read/write mutex functionality that does not specifically deal with upgradable (that code is affected, just there is no code size or run time cost ... constants differ, that's all).

So the chief cost of supporting upgradable is a reduction in the number of concurrent sharable locks that can be granted from 536,870,911, to 268,435,455 (these numbers will of course vary predictably for non-32 bit systems), plus the code that deals specifically with upgradable locking such as lock_upgradable(). Out of 16 mutex functions, 8 exist to support upgradable (this excludes timed lock functions). They are all listed here.

Test Case

The test code is meant to motivate the existence of read/upgradable/write mutexes. Therefore it presents a use cases which attempt to expose the advantage of such mutexes. The prevalence of such use cases found in the wild is not addressed here.

The test case below was inspired by a recent posting by Peter Dimov to boost which presented a read/write mutex performance test case. This test case expands on that idea by providing configurable-cost protected sections, and by adding the upgradable concept. The test below also adds a rudimentary check that the protected section is actually being protected.

The test case below is offered, knowing it is not portable, in a spirit of openness and transparency.

#include <msl_thread>
#include <iostream>
#include <sstream>
#include <vector>
#include <assert.h>

#define RU  1
#define RW  2
#define WW  3
#define SL  4

#define MODE RU

unsigned const n = 1000000;
unsigned const vec_len = 10;

unsigned const n_readers = 8;
unsigned const n_upgradables = 2;
unsigned const n_writers = 2;

#if MODE == SL
    typedef Metrowerks::mutex mutex;
    typedef mutex::scoped_lock read_lock;
    typedef mutex::scoped_lock upgradable_lock;
    typedef mutex::scoped_lock write_lock;
#else
    typedef Metrowerks::rw_mutex mutex;
    typedef mutex::read_lock read_lock;
    typedef mutex::upgradable_lock upgradable_lock;
    typedef mutex::write_lock write_lock;
#endif

mutex mtx;
mutex cout_mut;

std::vector<int> v(vec_len);

void clearscr()
{
    std::cerr << "\033[2J";
}

void gotoxy(int row, int col)
{
    std::ostringstream out;
    out << "\033[" << row << ';' << col << 'H';
    std::cerr << out.str();
}

void reader_thread(int id)
{
    for( unsigned i = 0; i < n; ++i )
    {
        if ((i+1) % (n/100) == 0)
        {
            write_lock lock(cout_mut);
            gotoxy(id, 0);
            std::cerr << "read thread "<< id << " at " << (float)(i+1)/n * 100 << '%';
            gotoxy(30,0);
        }
#if MODE == WW
        write_lock lock(mtx);
#else
        read_lock lock(mtx);
#endif
        int m = v.front();
        for (std::vector<int>::const_iterator i = v.begin(), e = v.end();
            i < e; ++i)
        {
            assert(*i == m);
        }
    }
}

void writer_thread(int id)
{
    for( unsigned i = 0; i < n; ++i )
    {
        if ((i+1) % (n/100) == 0)
        {
            write_lock lock(cout_mut);
            gotoxy(id, 0);
            std::cerr << "writ thread "<< id << " at " << (float)(i+1)/n * 100 << '%';
            gotoxy(30,0);
        }
        write_lock lock(mtx);
        int m = v.front();
        for (std::vector<int>::iterator i = v.begin(), e = v.end();
            i < e; ++i)
        {
            *i += 1;
            assert(*i == m+1);
        }
    }
}

void upgradable_thread(int id)
{
    for( unsigned i = 0; i < n; ++i )
    {
        if ((i+1) % (n/100) == 0)
        {
            write_lock lock(cout_mut);
            gotoxy(id, 0);
            std::cerr << "upgr thread "<< id << " at " << (float)(i+1)/n * 100 << '%';
            gotoxy(30,0);
        }
#if MODE == RU
        upgradable_lock lock(mtx);
#else
        write_lock lock(mtx);
#endif
        int m = v.front();
        for (std::vector<int>::const_iterator i = v.begin(), e = v.end();
            i < e; ++i)
        {
            assert(*i == m);
        }
        if (i % 10 == 0)
        {
#if MODE == RU
            write_lock wlock(std::move(lock));
#endif
            for (std::vector<int>::iterator i = v.begin(), e = v.end();
                i < e; ++i)
            {
                *i += 1;
                assert(*i == m+1);
            }
        }
    }
}

#include <bind>
#include <vector>

int main()
{
    using std::tr1::bind;
    using namespace std::tr1::placeholders;
    using Metrowerks::thread;
    clearscr();
    std::vector<thread> tg;
    int id = 0;
    Metrowerks::universal_time t1;
    for( int i = 0; i < n_writers; ++i )
        tg.push_back(thread(bind(writer_thread, ++id)));
    for( int i = 0; i < n_readers; ++i )
        tg.push_back(thread(bind(reader_thread, ++id)));
    for( int i = 0; i < n_upgradables; ++i )
        tg.push_back(thread(bind(upgradable_thread, ++id)));
    std::for_each(tg.begin(), tg.end(), bind(&thread::join, _1));
    gotoxy(++id, 0);
    Metrowerks::universal_time t2;
    Metrowerks::elapsed_time d = t2 - t1;
    std::cout << d.sec_ + d.nsec_ / 1000000000. << '\n';
}

In a nutshell, the test creates a vector<int> of length vec_len, with all elements of the vector equal. The read test does nothing but assure that all elements of the vector are equal. The write test atomically increments all elements of the vector by 1. The upgradable test always performs the read test, and 10% of the time also performs the write test, confirming that the resultant vector is only 1 plus the value read under the read lock.

For purposes of monitoring, the status of each thread is output in a non-portable fashion (I hope it works for you). It is believed that the output operation (which also uses mutex operations) will not significantly impact the test results as status is reported for only 0.01% of the iterations.

The test code is configurable across several axes:

The above test is run for 60 different configurations:

For each of the above cases, the constant vec_len is varied from 10 to 100 to 1,000 to 10,000. And for each of those values, a test is run for:

  1. A portable read/write/upgradable implementation (P).
  2. A pthreads based implementation where all locks are exclusive (Pt).
  3. A lock reduced implementation including read, write and upgradable locks (RU).
  4. A lock reduced implementation including read and write but not upgradable locks (RW).
  5. A lock reduced implementation including only write (exclusive) locks (WW).

The results are:

1 read thread, 1 upgradable thread
seconds
vec_len P Pt RU RW WW
10 21.1 4.90 1.03 1.13 1.18
100 26.6 16.2 4.59 4.48 4.03
1000 18.0 45.5 15.9 25.6 28.8
10000 134 267 133 236 249
normalized by fastest
vec_len P Pt RU RW WW
10 20.5 4.76 1 1.10 1.15
100 6.60 4.02 1.14 1.11 1
1000 1.13 2.86 1 1.61 1.81
10000 1.01 2.01 1 1.77 1.87

2 read threads, 1 upgradable thread
seconds
vec_len P Pt RU RW WW
10 56.6 17.5 1.24 1.48 1.97
100 66.9 37.4 4.61 4.68 10.1
1000 31.8 76.2 21.1 36.4 70.5
10000 191 398 187 242 374
normalized by fastest
vec_len P Pt RU RW WW
10 45.6 14.1 1 1.19 1.59
100 14.5 8.11 1 1.02 2.19
1000 1.51 3.61 1 1.73 3.34
10000 1.02 2.13 1 1.29 2.00

8 read threads, 2 upgradable threads, 2 write threads
seconds
vec_len P Pt RU RW WW
10 510 190 5.31 5.71 8.88
100 581 202 16.1 18.1 36.9
1000 720 307 130 171 336
10000 1200 1670 997 1270 2200
normalized by fastest
vec_len P Pt RU RW WW
10 96.0 35.8 1 1.08 1.67
100 36.1 12.5 1 1.12 2.29
1000 5.54 2.36 1 1.32 2.58
10000 1.20 1.68 1 1.27 2.21

Tables on the left represent the average time in seconds for each configuration. Tables on the right represent the same data, but are normalized by the fastest time in each row.

For all runs it is apparent that for very short protected regions (vec_len small) the overhead of locking dominates the computation. The lock reduced configurations (RU, RW, WW) are often able to acquire their locks with only atomic operations, and no contention. As the protected region grows larger, the lock overhead becomes less important, the likelihood of contention grows, and the locking strategy (read/upgradable/write) becomes paramount.

In the cases where there are 2 or more read threads, both RU and RW significantly outperform WW, even for very short protected regions. As the protected region grows larger, RW fails to keep up with RU and eventually is even outperformed by the portable version of RU, P, even in the case of only 1 read thread, and 1 upgradable thread.

Conclusion

A lock reduced implementation of a read/upgradable/write mutex can be a useful tool for high performance concurrent computing when simultaneous read-only access is common. Furthermore, for situations where a thread may need to atomically promote from read to write access, the upgradable aspect of this design can provide a performance benefit.