Interesting Links, 15 Sep 2016

  • It may be a little early to suggest that Eclipse is dead, but “The Fall of Eclipse” says it anyway, and says why. The Eclipse community would disagree, of course.
  • User liste pointed out MariaDB4j, which is… MariaDB embedded into a jar, suitable for integration testing with MariaDB in a build tool. Sure, H2 and Derby exist, but this allows you to test against MySQL and MariaDB, because if you have to use MySQL in whatever terrible world you happen to live in, you definitely want to test against it instead of a good database, to help you work out what awful bugs you need to avoid.
  • User yawkat also pointed out something that your humble author was unaware of: the hash code of Java Strings isn’t calculated until hashCode() is called. That makes perfect sense, actually. The javadoc for String’s hashCode() points out how the hashcode is calculated, but not when.
  • User cheeser pointed out O’Reilly’s (legal and free) Data Ebook Archive: “An archive of all O’Reilly data ebooks is available below for free download. Dive deep into the latest in data science and big data, compiled by O’Reilly editors, authors, and Strata speakers.”
  • The channel has mentioned tries (pronounced “trees”) as a form of data structure a few times lately; in case you don’t know what a trie is, or how it differs from a tree, see The Trie: A Neglected Data Structure.
  • Using NPM (Node.js‘ package manager) as part of a Java build came up one morning. Without any further context, here are a few references show up to integrating NPM into a maven build, none of which has been tried and tested by the person writing this up for you:

Finding hash collisions in Java Strings

In ##java, a question came up about generating unique identifiers based on Strings, with someone suggesting that hashCode() would generate generally usable numbers, without any guarantee of uniqueness. However, duplicate hash codes can be generated from very small strings, with ordinary character sets – see this stackoverflow answer – and therefore I thought it’d be interesting to find other short strings with the same hash values.
It was discussed how prone to collisions the Strings hashCode() method is, especially when using small strings. You would naturally assume a hashCode with an int as result – and thus 2 billion possible values – will be unique for small and simple strings.
Here’s a simple class to demonstrate this:

package org.javachannel.collisions;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.Map.Entry;
/**
* Simple example class to show amount of hashCode collisions in Strings.
*
* TODO: Make the number of characters and the character set to be more
* configurable
*
* @author Michael Stummvoll
*/
public class HashCodeCollision {
  public static void main(String[] args) {
    Map<Integer, List<String>> hashMap = new HashMap<>();
    String str = "abcdefghijklmnopqrstuvwxyz";
    str += str.toUpperCase();
    for (char c1 : str.toCharArray()) {
      for (char c2 : str.toCharArray()) {
        // for (char c3 : str.toCharArray()) {
        String s = c1 + "" + c2; // + "" + c3;
        int code = s.hashCode();
        if (!hashMap.containsKey(code)) {
          hashMap.put(code, new ArrayList<String>());
        }
      hashMap.get(code).add(s);
      // }
    }
  }
  int collisions = 0;
  int max = 0;
  List<String> maxList = null;
  for (Entry<Integer, List<String>> e : hashMap.entrySet()) {
    List<String> l = e.getValue();
    if (l.size() > max) {
      max = l.size();
      maxList = l;
    }
    if (l.size() > 1) {
      System.out.println("Collision: " + l);
      ++collisions;
    }
  }
  System.out.println("collisions found: " + collisions);
  System.out.println("biggest collision: " + maxList);
  }
}

This reveals that in all permutations of 2 letter strings consisting of letters we already have 1250 collisions (with two strings for each given hash code). When using 3 letter strings, we’d see that we have 37,500 collisions with up to four strings per hash code.
When reviewing the implementation of String‘s hashCode() method, you can conclude that it’s very easy to provoke collisions both ways, both intentionally and accidentally. So you shouldn’t rely on hash codes being unique for your Strings.

Guava's Equivalence: strategy for equals()/hashCode()

Introduction

Given a class C, if you want to implement ordering, you have two choices:

  1. Have your class implement Comparable<c> (or Comparable<x>, where X is a superclass of C).
  2. Define a Comparator<c> to use with this class (or a Comparator<x>, where X is a superclass of C).

In fact, many classes in the JDK will have you supply a Comparator if your class does not implement Comparable; examples include Collections.sort() and Arrays.sort().
It can be said that for a given class C, a Comparator defines a strategy for ordering, and that you need to supply a Comparator if the class itself does not define this strategy (that is, does not implement Comparable).
And while the JDK offers a means to provide different strategies for ordering, it does not do so for a more fundamental contract of Object: equals() and hashCode().
And this is where Guava comes in.

Equivalence: a strategy for Object’s `equals()` and `hashCode()`

Guava’s Equivalence intends to fill this gap. In the same vein as a Comparator, using an Equivalence allows you to either define a different equals()/hashCode() strategy than the one already defined by your target class, or to define one for a target class which “has none at all” (meaning, in this case, that the class uses Object‘s equals()/hashCode() implementations).

Usage part 1: implementing an Equivalence

Equivalence is a parameterized class; for a class C, you will need to implement two methods:

  • doEquivalent(C a, C b): given two instances a and b of class C, are those two classes considered equivalent? This is really like writing an implementation of equals() for class C, except that you don’t have to handle nulls (it is guaranteed that both parameters are non-null) nor casts (it is guaranteed that both arguments are “at least” of type C).
  • doHash(C t): given an instance t of class C, compute a hash value. Of course, an implementation must mirror Object‘s hashCode()/equals() contract: if doEquivalent(a, b) is true, then doHash(a) == doHash(b).

Note that it is guaranteed that arguments to these methods will never be null.

Usage part 2: “out of Collections” usage

There is really only one method you will need here: .equivalent(). Provided you want to test equivalence between two instances of a given class C, you will do:

final C a = ...;
final C b = ...;
final Equivalence<C> eq = ...;
// Test equivalence of a and b against equivalence stragey eq
if (eq.equivalent(a, b)) {
    // Yup, they are
}

Usage part 3: Collection usage

Unlike the Comparable/Comparator relationship, equivalence between objects has to be “engraved” into collections. This unfortunately means that the syntax to initiate a collection is somewhat on the verbose side. Namely, if you want to initiate a Set of elements of class C wrapped into an Equivalence, you will have to initialize it as such:

// Java 7 and better...
Set<Equivalence.Wrapper<C>> set = new HashSet<>();
// Java 5 or 6; and since we use Guava...
Set<Equivalence.Wrapper<C>> set = Sets.newHashSet();

You will also need to rely on an Equivalence implementation in order to interact with this collection (of course, you also need to ensure that you use the same implementation all along!):

// Your Equivalence...
Equivalence<C> eq = ...;
// Inserting an element c into a Set<Equivalence.Wrapper<C>> set
set.add(eq.wrap(c));
// Removing, testing...
set.remove(eq.wrap(c));
set.contains(eq.wrap(c));
// Retrieve the original element
C element;
for (final Equivalence.Wrapper<C> wrapped: set) {
    // get the original element
    element = wrapped.get();
    // do something with element
}

Conclusion

Admittedly, having to redefine an equivalence strategy is far less common than having to redefine an ordering strategy. It is, however, a welcome tool to use when you have to deal with a “foreign” class which doesn’t meet your equals()/hashCode() expectations, either because there is no implementation at all for this class, or because the existing implementations don’t suit your needs.
Happy coding!