In ##java, a question came up about generating unique identifiers based on String
s, with someone suggesting that hashCode()
would generate generally usable numbers, without any guarantee of uniqueness. However, duplicate hash codes can be generated from very small strings, with ordinary character sets – see this stackoverflow answer – and therefore I thought it’d be interesting to find other short strings with the same hash values.
It was discussed how prone to collisions the Strings hashCode() method is, especially when using small strings. You would naturally assume a hashCode with an int as result – and thus 2 billion possible values – will be unique for small and simple strings.
Here’s a simple class to demonstrate this:
package org.javachannel.collisions;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.Map.Entry;
/**
* Simple example class to show amount of hashCode collisions in Strings.
*
* TODO: Make the number of characters and the character set to be more
* configurable
*
* @author Michael Stummvoll
*/
public class HashCodeCollision {
public static void main(String[] args) {
Map<Integer, List<String>> hashMap = new HashMap<>();
String str = "abcdefghijklmnopqrstuvwxyz";
str += str.toUpperCase();
for (char c1 : str.toCharArray()) {
for (char c2 : str.toCharArray()) {
// for (char c3 : str.toCharArray()) {
String s = c1 + "" + c2; // + "" + c3;
int code = s.hashCode();
if (!hashMap.containsKey(code)) {
hashMap.put(code, new ArrayList<String>());
}
hashMap.get(code).add(s);
// }
}
}
int collisions = 0;
int max = 0;
List<String> maxList = null;
for (Entry<Integer, List<String>> e : hashMap.entrySet()) {
List<String> l = e.getValue();
if (l.size() > max) {
max = l.size();
maxList = l;
}
if (l.size() > 1) {
System.out.println("Collision: " + l);
++collisions;
}
}
System.out.println("collisions found: " + collisions);
System.out.println("biggest collision: " + maxList);
}
}
This reveals that in all permutations of 2 letter strings consisting of letters we already have 1250 collisions (with two strings for each given hash code). When using 3 letter strings, we’d see that we have 37,500 collisions with up to four strings per hash code.
When reviewing the implementation of String
‘s hashCode()
method, you can conclude that it’s very easy to provoke collisions both ways, both intentionally and accidentally. So you shouldn’t rely on hash codes being unique for your String
s.