Ten years after and things have changed. I honestly can't believe this (but the geek in me is ultra-happy).
As you have noted there are chances where some String::hashCode
for some Strings is zero
and this was not cached (will get to that). A lot of people argued (including in this Q&A) why there was no addition of a field in java.lang.String
, something like : hashAlreadyComputed
and simply use that. The problem is obvious : extra-space for every single String instance. There is btw a reason java-9
introduced compact String
s, for the simple fact that many benchmarks have shown that this is a rather (over)used class, in the majority of the applications. Adding more space? The decision was : no. Especially since the smallest possible addition would have been 1 byte
, not 1 bit
(for 32 bit JMV
s, the extra space would have been 8 bytes
: 1 for the flag, 7 for alignment).
So, Compact String
s came along in java-9
, and if you look carefully (or care) they did add a field in java.lang.String
: coder
. Didn't I just argue against that? It's not that easy. It seems that the importance of compact strings out-weighted the "extra space" argument. It is also important to say that extra space matters for 32 bits VM
only (because there was no gap in alignment). In contrast, in jdk-8
the layout of java.lang.String
is:
java.lang.String object internals:
OFFSET SIZE TYPE DESCRIPTION VALUE
0 12 (object header) N/A
12 4 char[] String.value N/A
16 4 int String.hash N/A
20 4 (loss due to the next object alignment)
Instance size: 24 bytes
Space losses: 0 bytes internal + 4 bytes external = 4 bytes total
Notice an important thing right there:
Space losses : ... 4 bytes total.
Because every java Object is aligned (to how much depends on the JVM and some start-up flags like UseCompressedOops
for example), in String
there is a gap of 4 bytes
, un-used. So when adding coder
, it simply took 1 byte
without adding additional space. As such, after Compact String
s were added, the layout has changed:
java.lang.String object internals:
OFFSET SIZE TYPE DESCRIPTION VALUE
0 12 (object header) N/A
12 4 byte[] String.value N/A
16 4 int String.hash N/A
20 1 byte String.coder N/A
21 3 (loss due to the next object alignment)
Instance size: 24 bytes
Space losses: 0 bytes internal + 3 bytes external = 3 bytes total
coder
eats 1 byte
and the gap was shrank to 3 bytes
. So the "damage" was already made in jdk-9
. For 32 bits JVM
there was an increase with 8 bytes : 1 coder + 7 gap
and for 64 bit JVM
- there was no increase, coder
occupied some space from the gap.
And now, in jdk-13
they decided to leverage that gap
, since it exists anyway. Let me just remind you that the probability to have a String with zero hashCode is 1 in 4 billion; still there are people that say : so what? let's fix this! Voilá: jdk-13
layout of java.lang.String
:
java.lang.String object internals:
OFFSET SIZE TYPE DESCRIPTION VALUE
0 12 (object header) N/A
12 4 byte[] String.value N/A
16 4 int String.hash N/A
20 1 byte String.coder N/A
21 1 boolean String.hashIsZero N/A
22 2 (loss due to the next object alignment)
Instance size: 24 bytes
Space losses: 0 bytes internal + 2 bytes external = 2 bytes total
And here it is : boolean String.hashIsZero
. And here it is in the code-base:
public int hashCode() {
int h = hash;
if (h == 0 && !hashIsZero) {
h = isLatin1() ? StringLatin1.hashCode(value)
: StringUTF16.hashCode(value);
if (h == 0) {
hashIsZero = true;
} else {
hash = h;
}
}
return h;
}
Wait! h == 0
and hashIsZero
field? Shouldn't that be named something like : hashAlreadyComputed
? Why isn't the implementation something along the lines of :
@Override
public int hashCode(){
if(!hashCodeComputed){
// or any other sane computation
hash = 42;
hashCodeComputed = true;
}
return hash;
}
Even if I read the comment under the source code:
// The hash or hashIsZero fields are subject to a benign data race,
// making it crucial to ensure that any observable result of the
// calculation in this method stays correct under any possible read of
// these fields. Necessary restrictions to allow this to be correct
// without explicit memory fences or similar concurrency primitives is
// that we can ever only write to one of these two fields for a given
// String instance, and that the computation is idempotent and derived
// from immutable state
It only made sense after I read this. Rather tricky, but this does one write at a time, lots more details in the discussion above.