Until now, if I had a multi terabyte database, my implementation choice would have been Oracle. I am in no way a Oracle fan, Oracle has its flaws like for example its crappy installer application, but compared with the other chices like Sybase and DB2 for Linux, it is light years ahead...
The only strategy of dealing with large amount of data is Divide and Conquer.
With Oracle you use the features that will help you achieve that are Partitioning and Compression. Depending on your use case, RAC and Exadata might provide you some decent choices. Map reduce is a popular algorithm that can be used to analyze your data and you can do it with Oracle as well: http://blogs.oracle.com/datawarehousing/entry/in-database_map-reduce
OK, enough about classic databases, let look into the "new" technologies available out there that I could use instead.
Let's look into Cassandra + Hadoop.
Installation: after downloading, and setting up 4 solaris containers with cassandra I have a running test cluster:
user@host:/zones/common/apache-cassandra-1.0.5/bin$ ./nodetool -host 192.168.1.50 ring
Address DC Rack Status State Load Owns Token
111092484197642270557254836842258840364
192.168.1.51 datacenter1 rack1 Up Normal 20.45 KB 42.95% 14027083099678615413015982312665843724
192.168.1.50 datacenter1 rack1 Up Normal 22.09 KB 1.64% 16817568390080366247986870584687685022
192.168.1.52 datacenter1 rack1 Up Normal 11.1 KB 12.46% 38016701835136693969806387655967731276
192.168.1.53 datacenter1 rack1 Up Normal 15.47 KB 42.95% 111092484197642270557254836842258840364
OK, installation was really easy, let's download the code (no maven based build, disapointing), and inspect code quality a bit...
Bad code case 1, comparing with equals different types (QueryProcessor.java):
if (rows.get(0).key.key.equals(startKey))
rows.remove(0);
where the equals compares a ByteBuffer with a RowPosition that will always be false, this one could be a potentially funny bug, there are 2 instances of like this in the codebase.
Bad code case 2, resource leak:
FileInputStream tsf = new FileInputStream(options.truststore);
FileInputStream ksf = new FileInputStream(options.keystore);
SSLContext ctx;
try
{
...
}
catch (Exception e)
{
throw new IOException("Error creating the initializing the SSL Context", e);
}
finally
{
FileUtils.closeQuietly(tsf);
FileUtils.closeQuietly(ksf);
}
What if options.keystore does not exist and a ex is thrown? there 12 instances like this in the codebase.
Bad code case 3 (dead code):
public int compareTo(TimedOutException other) {
if (!getClass().equals(other.getClass())) {
return getClass().getName().compareTo(other.getClass().getName());
}
int lastComparison = 0;
TimedOutException typedOther = (TimedOutException)other;
return 0;
}
there are 27 instances of similar dead code in the codebase...
There are other things in the code like inefficient iterations through Maps, potential null pointer dereferences ...
The good thing that all of the above issues are detected by findbugs.
In my team we enforce findbugs (build fails, the false positives we suppress with the SuppressWarnings annotation.) and I am extremely happy with the result, enforcing findbugs helped us improve the quality of our product.
If your team does not enforce findbugs yet, its never to late to start enforcing it...
in the next article I will look at the Hadoop integration...