I'm building a memory allocator for multi-threaded parallel computation. Here is a list of identified problems and defined goals to solve these problems.

    1. It is widely recognized that on a shared memory computer, bus contention is a serious bottleneck of computation, and scalable computing will inevitably be distributed. New code with scalability as a major goal should be written with distributed computer architecture in mind. But we have to recognize that shared memory is still the most convenient programming model, and it's much easier to retrofit single-threaded computation intensive code for shared memory parallel computing. For this reason, I've adopted the ideas of Cilk to use function call as a unit of work to be parallelized. However, Cilk is a source-to-source compiler that only accepts the Cilk language based on ANSI C as input. Many programs use modern C99 features that makes it hard to port to Cilk. I implemented a lightweight work-stealing parallel computing framework in C++ that can be used by an application program as a library.
    2. Modern shared memory architecture is actually distributed by nature in hardware. Each CPU comes with private memory called a cache. The caches speak to each other on the bus using a cache coherence protocol to maintain the illusion of consistent view of the main memory which is shared, and it is known that sharing between cache is a cause of bottleneck. It would be worse if many objects are not supposed to be shared but happen to fall on the same cache line with shared objects, and this problem is called false sharing. Moreover, many parallel programs operate mostly on a private working set, and only exchange work when computation yields a result. We design a memory allocator that meticulously segregates address space into private and shared regions to maximally reduce false sharing yet allow fast turnover of sharing and unsharing of objects.
    3. The memory allocator uses only mmap() as its source of memory because sbrk() is not thread-safe if you want to return memory back to the operating system. We also recognize that munmap() can be a slow function call on a multi-processor system and involuntarily slows down all the threads on all processors because of the cost of TLB shootdown. I designed a caching layer to coalesce mmap() and munmap() calls to improve performance.
    4. Benchmark is not an accurate measurement of fragmentation because the average performance of different allocation strategies are actually very close, and many allocation strategies (e.g. best-fit, binary buddy) do not expose their weakness unless you consider the worst case scenario. Based on Robson's result in 1976, fragmentation is trivially bound to 3 times the amount used if the ratio of the largest and smallest object sizes is within a factor of 2. I use this combined with segregate heap made possible by virtual memory to implement an allocation strategy that bounds the worst case fragmentation to a factor of 3. The object indexing scheme is similar to Doug Lea allocator.
    5. The allocator is designed to be extremely modular using a combination of Mixin and CRTP. Each heap implementation, for example, delegate the specification of memory range that the heap manages to the derived sub-class. This allows experimentation of heap layout, e.g. static heap (start/end addresses are hard-coded), internal format (where header is embedded in the same memory page used for user objects), external format (where header is separate from the page, allocated from a meta-data allocator). The non-blocking remote-free is also an optional add-on feature to many of the heap implementations.
    6. In order to track ownership of objects and prevent memory leak inside the allocator, I use linear pointer, an idea based on auto_ptr<>, extensively to manipulate data structures. Using RAII, the allocator would crash at the instant a resource leak happens. Everything is then unit tested to ensure correctness. Much of the work went into implementing a splay tree algorithm that observes linearity of objects.

This has been a work in progress for the past year, and is the basis of my Ph.D. thesis. I welcome any interest or discussion of this work.