"New Breed" Database Extensibility

At present, I can think of at least 5 "new breed" database vendors that you allow you to extend their SQL language in some form or another:

Greenplum (I think)

Of course, the old guard is well-represented in this category as well - Oracle, DB2, SQL Server, Teradata, etc. all allow language extension via custom functions, plug-ins, etc.

I don't know enough about it yet to know, but I think that Greenplum and AsterData might also belong on this list due to their support for Map/Reduce.

It wasn't all that long ago (3 or 4 years) that most of these vendors didn't even support the full SQL standard, never mind compiled-code extensions to their SQL language. Oh, the possibilities.

Confessions of an Old SQL Hacker: I Like ANSI Joins

For longer than I care to remember, I've been writing joins like so:

SELECT col1, col2 -- Boo on "SELECT *"!
WHERE T1.x = T2.x

Over the last six months or so, however, I've found myself forced to used outer joins (which are evil, but that's another story) and thus having to write queries like this:

SELECT col1, col2 -- Seriously, "SELECT *" is for rookies

Then, to keep things consistent (think self-Romanism) inner joins obviously get written like this:

SELECT col1, col2 -- Did you really think I was going to use "SELECT *"?
INNER JOIN T2 ON(T1.x = T2.x)

What's shocking about this is that I find myself liking this syntax. It's much clearer, at least in my opinion, especially when joining more than two tables or when there's more than one join criteria between two tables. Further, it separates the join criteria from the "filter" criteria, i.e. what's in the WHERE clause is only what belongs in the WHERE clause. Consider:

SELECT col1, col2 -- You get the point
FROM T1, T2, T3, T4
WHERE T1.x = T2.x
AND T2.x = T3.x
AND col3 = 'foo'
AND T3.x = T4.x
AND T2.y = T3.y


SELECT col1, col2 -- I hope you do anyway
INNER JOIN T2 ON(T1.x = T2.x)
INNER JOIN T3 ON(T2.x = T3.x AND T2.y = T3.y)
INNER JOIN T4 ON(T3.x = T4.x)
WHERE col3 = 'foo'

See the difference? Join criteria are forced to be grouped together logically, and the WHERE clause is cleaner. It's hard not to like it.

And it beats the hell out of the old Oracle outer join syntax. 'nuf said on that subject - you either know what I'm talking about or wouldn't believe that I was quoting real syntax. Anyway...

There you have it. I like the ANSI join syntax. I admit it.

I guess you can teach an old dog new tricks.

New Join Types Discovered

I spent a good chunk of my day today reworking SQL queries, and while doing so I discovered two join types I'd never encountered before.

• The INNERT JOIN is guaranteed to produce no rows. It is useful when you want to be sure your query does nothing, i.e. it's inert.

• The LEFT OVER JOIN returns any rows that would not have been returned by a LEFT OUTER join, i.e. rows that were... well, left over.

I think I need to take a break...

RAM Storage is not the Answer... At Least not for MySQL

There's an interesting article over at MySQL Performance Blog that talks about some MySQL benchmarks using RAM-based storage.

Here's the most interesting part (to me, anyway):
However even with MyISAM we got CPU bound before we could reach the system capacity - these 70K queries/sec generated just over 50K IOs/sec while capacity was over 100K IOs/sec

I think Peter hits the nail on the head when he theorizes about the cause of the performance problems he saw:

My guess is Innodb just was not really designed and tested in such condition - normal case is to allocate cache memory to buffer pool so IOs will mostly come from drives directly

We all know that I'm not a big fan of MySQL, so I'm prone to believe that other systems might handle this better. But the point still holds - RAM storage is not the answer. Having super-fast I/O won't solve your performance problems unless the DBMS you're using is designed to take advantage of it. And given that every DBMS I can think of is designed around the assumption that I/O is expensive and contortions to avoid it are cheap, odds are that the database you're using isn't designed to take advantage of it. So don't bet that super-fast storage - RAM-based or otherwise - will solve your performance problems.

In short: think smaller, not faster.

Stonebraker Dismisses Columns, Heralds New Data Friendly Systems

I recently had the opportunity to sit down with Michael Stonebraker, database pioneer and co-founder of everyone's favorite database start-up Vertica. I expected to hear some deep insights from Dr. Stonebraker, but I was completely unprepared for what he had to say.

"The current generation of column-oriented, analytics databases was designed back at the turn of the century to make reporting faster," he began, "without any thought to how the data felt about being turned on its head. Talk to any DBA at any company these days, and they'll tell you that their data is in pain, real pain. All that time spent crammed together with similar datums in uncomfortable positions was fine when data was silent and reporting was all-important, but that's not the case today. Today businesses know that their data is valuable, and it's becoming more and more important to treat that data sensitively and humanely."

Stunned, I fumbled around for an intelligent-sounding question. "What can be done?" I asked lamely.

"Data need to be provided with comfortable living conditions, for starters", he replied. "They must no longer be subjected to draconian rules about how to arrange themselves or judgmental labels such as 'row-oriented' or 'analytics optimized'. Each datum must be allowed to pursue its own destiny... each must be free to store itself in the manner of its own choosing, to reside near other datums it finds interesting, to decide for itself whether it's more important to modify itself quickly or respond to queries quickly. Data has rights, after all.

Ultimately, too, I think data should only be required to work an 8 hour day. There's some research being done in that direction - vacation-based compression, I think they're calling it - to see if reducing the length of the workday for each datum can effectively reduce storage requirements. The first work being done there is aimed at a new storage engine for MySQL, I believe."

It sure is an interesting time to be a database geek...