The majority of mdcrack was written prior to 2001. It isn't "hand crafted" assembly, and even if was a lot has changed since it was written. The "sse" version of mdcrack is merely compiled with sse optimizations.
Ah I'm wrong ! I always figured it was SSE2 assembly as it was touted to be so much better than the competition. However, I do have a working example. The SHA1 and MD5 implementations in john the ripper. The original MD5 is just plain x86, and not too fast. For 32 bits versions I did provide a SSE implementation years ago, with the reverse trick (for MD5 obviously). I admit it could probably be made much faster, but not without breaking the very convenient macros I used, and thus ending with a nightmare of hand adjusting everything.
Or you could just use the intrinsics code I provided years later, after I saw that barswf program in IDA and realized why it was fast. It's much easier to write than the assembly, doesn't include the reverse trick, but is probably more than twice faster (I don't have hard number, but you are free to download and check it). It is even more dramatic with 64 bits.
I agree I'm far from being a good developper, but I still believe that :
* in the worst case, with applications that are not designed to break ICC, it is on par with hand crafted assembly
* on certain applications (like these hash functions) it will give you a much better result than most hand crafted assembly
* it will always be faster to write something fast with it
* it will be more portable (yes, not portable)