What are the dangers when creating a thread with a stack size of 50x the default?

Question

I'm currently working on a very performance critical program and one path I decided to explore that may help reduce resource consumption was increasing my worker threads' stack size so I can move most of the data (float[]s) that I'll be accesing onto the stack (using stackalloc).

I've read that the default stack size for a thread is 1 MB, so in order to move all my float[]s I would have to expand the stack by approximately 50 times (to 50 MB~).

I understand this is generally considered "unsafe" and isn't recommended, but after benchmarking my current code against this method, I've discovered a 530% increase in processing speed! So I can not simply pass by this option without further investigation, which leads me to my question; what are the dangers associated with increasing the stack to such a large size (what could go wrong), and what precautions should I take to minimise such dangers?

My test code,

public static unsafe void TestMethod1()
{
    float* samples = stackalloc float[12500000];

    for (var ii = 0; ii < 12500000; ii++)
    {
        samples[ii] = 32768;
    }
}

public static void TestMethod2()
{
    var samples = new float[12500000];

    for (var i = 0; i < 12500000; i++)
    {
        samples[i] = 32768;
    }
}

+1. Seriously. You ask what LOOKS Like an idiotic question out of the norm and then you make a VERY good case that in your particular scenario it is a sensible thing to consider because you made your homework and measured the outcome. This is VERY good - I miss that with many questions. Very nice - good you consider something like this, sadly many many C# programmers are not aware of those optimization opportunities. Yes, often not needed - but sometimes it is critical and makes a hugh difference. — TomTom
I'm interested to see the two codes that have 530% difference in processing speed, solely on account of moving array to stack. That just does not feel right. — Dialecticus
Before you leap down that road: have you tried using Marshal.AllocHGlobal (don't forget to FreeHGlobal too) to allocate the data outside of managed memory? Then cast the pointer to a float*, and you should be sorted. — Marc Gravell♦
It does feel right if you do a lot of allocations. Stackalloc bypasses all the GC issues which also can create / does create a very strong locality on processor level. This is one of the things hat look like micro optimizations - unless you write a high performance mathematical program and are having exactly this behavior and it make a difference ;) — TomTom
My suspicion: one of these methods triggers bounds-checking on every loop iteration while the other one does not, or it is optimized away. — pjc50

Vercas Vercas · Accepted Answer · 2014-06-13T14:50:04

Upon comparing test code with Sam, I determined that we are both right!
However, about different things:

Accessing memory (reading and writing) is just as fast wherever it is - stack, global or heap.
Allocating it, however, is fastest on stack and slowest on heap.

It goes like this: stack < global < heap. (allocation time)
Technically, stack allocation isn't really an allocation, the runtime just makes sure a part of the stack (frame?) is reserved for the array.

I strongly advise being careful with this, though.
I recommend the following:

When you need to create arrays frequently which never leave the function (e.g. by passing its reference), using the stack will be an enormous improvement.
If you can recycle an array, do so whenever you can! The heap is the best place for long-term object storage. (polluting global memory isn't nice; stack frames can disappear)

(Note: 1. only applies to value types; reference types will be allocated on the heap and the benefit will be reduced to 0)

To answer the question itself: I have not encountered any problem at all with any large-stack test.
I believe the only possible problems are a stack overflow, if you are not careful with your function calls and running out of memory when creating your thread(s) if the system is running low.

The section below is my initial answer. It is wrong-ish and the tests aren't correct. It is kept only for reference.

My test indicates the stack-allocated memory and global memory is at least 15% slower than (takes 120% the time of) heap-allocated memory for usage in arrays!

This is my test code, and this is a sample output:

Stack-allocated array time: 00:00:00.2224429
Globally-allocated array time: 00:00:00.2206767
Heap-allocated array time: 00:00:00.1842670
------------------------------------------
Fastest: Heap.

  |    S    |    G    |    H    |
--+---------+---------+---------+
S |    -    | 100.80 %| 120.72 %|
--+---------+---------+---------+
G |  99.21 %|    -    | 119.76 %|
--+---------+---------+---------+
H |  82.84 %|  83.50 %|    -    |
--+---------+---------+---------+
Rates are calculated by dividing the row's value to the column's.

I tested on Windows 8.1 Pro (with Update 1), using an i7 4700 MQ, under .NET 4.5.1
I tested both with x86 and x64 and the results are identical.

Edit: I increased the stack size of all threads 201 MB, the sample size to 50 million and decreased iterations to 5.
The results are the same as above:

Stack-allocated array time: 00:00:00.4504903
Globally-allocated array time: 00:00:00.4020328
Heap-allocated array time: 00:00:00.3439016
------------------------------------------
Fastest: Heap.

  |    S    |    G    |    H    |
--+---------+---------+---------+
S |    -    | 112.05 %| 130.99 %|
--+---------+---------+---------+
G |  89.24 %|    -    | 116.90 %|
--+---------+---------+---------+
H |  76.34 %|  85.54 %|    -    |
--+---------+---------+---------+
Rates are calculated by dividing the row's value to the column's.

Though, it seems the stack is actually getting slower.

What are the dangers when creating a thread with a stack size of 50x the default?

8 Answers