c# volatile demystified -my take

According to MSDN (http://msdn.microsoft.com/en-us/library/6kac2kdh.aspx)
a thread is the basic operating system entity that is used to execute a given task defined by a set of instructions. Therefore, a thread has its context, which includes all information it needs to resume execution such as set of CPU registers and stack of local variables, all in the address space of the thread’s host process.

Consider the following class:


public EncryptionService
{
public static readonly EncryptionService Instance = new EncryptionService();
private EncryptionService(){}
private int _counter = 0;
public int Encrypt(string input)
{
// run some encryption algorithm
return ++counter;
}
}

This class is singleton because it contains a private constructor and you can only access the single Instance. Since there is only a single instance of EncryptionService, there is a chance
that multiple threads could be running Encrypt and updating the local variable _counter.  In other words, the class is not thread safe.

Now, recall the thread context we talked about.  When a thread is created, the CLR allocates it 1MB stack, atleast for x86 CPU architectures.

The stack space is used for passing arguments to a method and for holding local variables defined in the method. Before executing a method the CLR executes prologue code [REF CLR via C#] that initializes the method, such as getting its execution and return address and allocating memory for its local variables on the thread’s stack, which is part of the thread’s execution context. When the thread executes more methods, its stack fills up with local variables from these methods until the method’s epilogue code runs clearing all of the variables that should not [or be] out of scope.

While the unit of execution is a thread, the physical entity which executes code, that have been translated to machine instructions is the CPU.  Naturally, this means that those variables defined on the thread’s stack are stored in the CPU cache.  When a thread is executing code on a specific CPU, its variables are therefore stored in this particular CPU’s cache.

So what exactly happens when multiple threads execute a method on a single instance of our EncryptionService?  We first need to understand how the above C# program is converted to assembly instructions which is what the CPU executes.

The Microsoft C# compiler generates Microsoft Intermediate Language (MSIL).  Which is what is stored in the assembly.  When the CLR loads this assembly and attempts to invoke the Decrypt method, it Just-in-Time (JIT) compilers the code to CPU specific assembly language instructions, storing this in a location in memory, at which point execution transfers to the memory location where the Decrpyt method is saved.

When the executing thread invokes Decrypt, is encounters a local variable called _counter stored at a certain memory location.  This variable has to be loaded from its memory location into the CPU’s register.  If this memory location was not already in any of the CPU’s caches, a cache miss occurs and this data and sorrounding bytes, as defined by the cache line size, is retrieved from memory and saved in the CPU cache, then loaded into the CPU register.  Then the CPU updates the register value, as part of the add instruction, and then stores the changed value in the respective cache. For this new value to become visible to other processes, or threads, it needs to be written back to main memory.  When exactly this happens is defined by what is known as a CPU write policy.

If you have two threads doing the same thing, it could happen that the values of _counter in each CPU’s cache is not consistent, since each CPU was working on its local local copy of the _counter variable in its registers/cache. Based on your use case, this could not be a desirable outcome.

More on Static Variables

When you decorate a variable as static, interesting things start to happen.

A static type is defined as part of the type definition. What this effectively means is that a static field is defined in the the header block of the type definition object.

This means that the bytes backing these static fields are part of the type’s header block. And since there is just one type header block, we have only one static field in the entire application domain.

Since a static field is backed by a memory block defined as part of the object’s type header which is at a fixed location on the managed heap, several threads can potentially override the work of each other.

This is a similar effect to having a s single instance object being operated upon my multiple threads without any synchronization mechanism.

On Chip Cache Memory
As noted before, a static member is defined at a fixed location in the object’s type header block in memory.  This means that instructions executing on the CPU need to fetch this static variable in memory to write or read from it.

To improve performance, today’s CPUs have something called a Cache Memory, which is a set of CPU registers defined on the CPU chip [ref??].  These are called Level caches, based on how close they are to the CPU.  There is an L1 cache, L2 and L3 cache, with the L1 cache resident on the CPU chip. “L” standing for level.   This means that instead of accessing RAM via the memory bus [TODO PICS], the CPU simply refers to its register cache to retrieve a value of a variable.

The first time a thread needs to read a value from memory such as our static variable, the CPU will read this value and various surrounding bytes, also known as the cache-line and store this cache line it in its on-chip cache.  You can imagine an on-chip cache as simply a dictionary of of key/value pairs of cache lines, keyed on memory address or address range.  These address keys are actually known as tag and each value of this key-value pair is a cache entry that comprises of actual data and some sort of flag indicating if this field has been updated from the last time it was read from memory.

 

When the value needs to be updated, the CPU updates its registry cache instead of writing directly to static field in memory. Periodically, the CPU then flushes out all of its content to memory (RAM).

Static Variables Concurrency
With all of this in mind, let us examine what happens with our static variable, within a multi-threaded, multi-core execution context.\


public EncryptionService

{

public static readonly EncryptionService Instance = new EncryptionService();
private EncryptionService(){}
private int _counter = 0;
public int Encrypt(string input)
{
// run some encryption algorithm
return ++counter;
}

public int Decrypt(string input)
{
// run some encryption algorithm
return ++counter;
}
}

Say thread A running on CPU A is about to execute Encrypt so it reads bytes in memory
for _counter variable. This variable is now in CPU A’s cache as shown below:

Now a second thread, thread B running on CPU B, attempts to execute Encrypt and
copys the value of _counter from memory, just before CPU A has had a chance to
flush out its update to the memory, recall this is a singleton.
The content of CPU B’s cache is now different from what is in memory for the value of _counter.

This is a classic cache concurrency problem and happens because CPU flushes happen at an unpredictable time in the future. Although some CPUs provide an ability to control this problem such as the IA64 with volatile read with write acquire semantics.

Volatile Read/Write
Recall a cache concurrency happens when one CPU updates a value in its on chip cache and before it has flushed this change to memory, another CPU has read the previous value from memory.  So the new flushed value is not reflected in the second CPU’s cache.

Volatile write means that a CPU writes a value of a register out to its cache and then flushes the entire content of the CPU cache to memory.   Interesting choice of word volatile.

Volatile read means that a CPU reads a value into memory and invalidates its cache?  Invalidating the cache means that this register value is no longer current and therefore, the next subsequent read of this value will come from memory.  Hence the word volatile, which means liable to change rapidly or unpredictably.  The fact that CPU register read and write is causing rapid flushing and invalidation of the CPU cache is reason why these are referred to as volatile operations, since CPU caches are meant to do otherwise.

There is also a Memory Fence which flushes the cache content to memory and then invalidates the cache.

How does this prevent cache concurrency?


public EncryptionService
{
public static readonly EncryptionService Instance = new EncryptionService();
private EncryptionService(){}
private int _counter = 0;
public int Encrypt(string input)
{
var current = Thread.VolatileRead(ref _counter); //1
Thread.VolatileWrite(ref _counter, current + 1); //2
return _counter;
}

public int Decrypt(string input)
{
// run some decryption algorithm
var current = _counter;
counter = current + 1;
return _counter;
}
}

With these changes, this is what happens now.

When thread A on CPU A executes line 1, it performs a volatile read meaning that is reads the current value of _counter from memory but invalidates it cache, so that the next subsequent read of this value comes from memory.  If thread B now modifies that same memory in its CPU register location, since the field is marked a volatile, CPU B will also flush out the cache line containing this variable to memory.

In C#, this is what methods VolatileRead and VolatileWrite do.

c# volatile To The Rescue

All of the above is nice but cumbersome to use so the C# team provided a simple keyword volatile that simply wraps reading and writing to a variable in volatile read/write.


public EncryptionService
{
public static readonly EncryptionService Instance = new EncryptionService();
private EncryptionService(){}
private volatile int _counter = 0;
public int Encrypt(string input)
{
var current = _counter; //1
_counter, current + 1; //2
return _counter;
}

public int Decrypt(string input)
{
// run some decryption algorithm
var current = _counter;
counter = current + 1;
return _counter;
}
}

This means that a CPU does not cache this variable and reads/writes directly to memory. So while this solves cache concurrency, there is still a problem since multiple threads can attempt to write to this memory location at the same time.

To prevent this you have to use C# thread synchronization constructs,

Note that this problem, while not the same as two threads writing to/reading from the same memory location, manifest the same way as an application problem. This is a problem resulting from running on a multi-core machine.

 

References:

http://computer.howstuffworks.com/computer-memory3.htm

http://arstechnica.com/gadgets/2002/07/caching/2/

http://en.wikipedia.org/wiki/CPU_cache

 

Leave a comment