Can someone please shed some light on how popular languages like Python, Ruby implements hash tables internally for symbol lookup? Do they use the classic “array with linked-list” method, or use a balanced tree?
The classic “array of hash buckets” you mention is used in every implementation I’ve seen.
One of the most educative versions is the hash implementation in the Tcl language, in file tcl/generic/tclHash.c. More than half of the lines in the file are comments explaining everything in detail: allocation, search, different hash table types, strategies, etc. Sidenote: the code implementating the Tcl language is reallyreadable.
Perl uses an array with linked lists to hold collisions. It has a simple heuristic to automatically double the size of the array as necessary. There’s also code to share keys between hashes to save a little memory. You can read about it in the dated but still relevant Perl Illustrated Guts under “HV”. If you’re truly adventurous you can dig into hv.c.
The hashing algorithm used to be pretty simple but its probably a lot more complicated now with Unicode. Because the algorithm was predictable there was a DoS attack whereby the attacker generated data which would cause hash collisions. For example, a huge list of keys sent to a web site as POST data. The Perl program would likely split it and dump it into a hash which then shoved it all into one bucket. The resulting hash was O(n) rather than O(1). Throw a whole lot of POST requests at a server and you might clog the CPU. As a result Perl now perturbs the hash function with a bit of random data.
You also might want to look at how Parrot implements basic hashes which is significantly less terrifying than the Perl 5 implementation.
Lua tables use an utterly ingenious implemenation which for arbitrary keys behaves like ‘array of buckets’, but if you use consecutive integers as keys, it has the same representation and space overhead as an array. In the implementation each table has a hash part and an array part.