Is a significant part of git's typical profile spent computing hashes? I'm genuinely asking because I don't know the answer. I'd expect all the diffing and (potentially fuzzy) merging to be significantly more expensive operations, at least as far as big-O is concerned.
> Is a significant part of git's typical profile spent computing hashes?
No.
Hashes are really cheap.
This annoys me a bit, because every discussion about hashing goes into endless bikeshedding which hash function to use. The simple truth is: SHA2, SHA3, Blake2/3 are all good enough from both a security and performance perspective that for almost any use case and the advantages and disadvantages are so minor that it really doesn't matter.
Length extension is an unnecessary problem in MD constructions. It makes sense to get rid of the problem. So if you are building a new thing today there's some sense in not picking SHA-256 in order that you won't later hit your head on a length extension attack. SHA-512/256 (that's not a choice, it's just one hash in the SHA2 family) is a reasonable choice though, and of course if Git was vulnerable to length extension somehow they'd be in trouble years ago so for them why not SHA-256.
The length extension attack is a non-issue for Git’s use case, and SHA-256 (unlike SHA-512) benefits from having hardware acceleration in the new Ice Lake Intel chips (as well as on the AMD side of things), and has been around 11 years longer than SHA-512/256. And, yes, there are places which say “If you will use a hash, you will use SHA-256”.
Personally, the last time I was in a place where I had to choose which cryptography to use, I used SHA3’s direct predecessor, RadioGatún, because I needed a combined hash + stream cipher and, at the time (late 2007), RadioGatún was the only option.
RadioGatún also benefits from being about as fast as BLAKE2 (it would be faster in hardware, FWIW, having SHA3’s hardware advantages), and is approaching 14 years old without being broken by cryptoanalysis. Also, unlike BLAKE2/3, and like SHA3 and all sponge functions, it’s computationally expensive to “fast forward” in RadioGatún’s XOF (stream cipher, if you will) mode, which is beneficial for things like password hashing. Another nice thing about RadioGatún: It doesn’t have any magic constants in its specification, allowing a useful implementation to fit on my coffee mug, e.g.
If someone asked me which hash algorithm to use, I would suggest SHA-256, unless I think they needed protection from length extension attacks (so SHA-512/256), or needed an XOF (stream cipher-like) construction (so SHAKE256).
If performance mattered more than a conservative security margin, BLAKE3 (software performance) or KangarooTwelve (SHA3 variant; excellent hardware performance) would be good choices. If I were to do choose a hash + XOF for use today, I would use KangarooTwelve’s variant with a little larger security margin: MarsupilamiFourteen.
Cryptographically strong random numbers in MaraDNS 2.0. The hash nature of RadioGatún allows me to combine multiple entropy sources with varying amounts of randomness together to seed it then use it as a stream cipher to generate good random numbers. This way, the DNS query ID and source port are hard to guess, making blind DNS spoofing harder.
The nice thing about RadioGatún is that it only takes about 2k of compiled code (and can fit in under 600 bytes of source code, as seen in the parent) to pull all this off.
This was the best way to pull it off back in 2007, when RadioGatún was the only secure Extendable-Output Function (XOF) that existed.
Linux also has the ethos to choose boring technology. SHA2 has been here for so long and battle tested. For the majority of us, it is the natural choice. I'm not implying anything negative about SHA3/Blake/Keccak.
The decision was made before the release of Blake3. The article did mention the algorithm is no longer hardcoded (hence the ability to support both SHA1 & SHA256). This means it's possible to transition to Blake3 (or any other) in future, though it won't be trivial.
Of course, processors that use one of the Atom/Celeron/Pentium microarchitectures are not the best choice if you desire maximum speed, but otherwise they are surprisingly interesting processors (IMHO much more interesting than what Intel delivers with the Core series).
At this time, Intel often experiments with or introduces features that are particularly interesting for embedded usages first on the Atom. For example the already mentioned SHA-NI. Another example are the MOVBE instructions (insanely useful if you handle big-endian data, for example in network packages (I am aware that on older x86 processors, there exists the BSWAP instruction)) - they were first introduced with Atom.
There are organisations that can only use approved crypto for various certifications and government contracts. It would be bad to drive such users away from git.