Deprecated: Function split() is deprecated in /home/tbriggs/public_html/s9y/plugins/serendipity_event_metadesc/serendipity_event_metadesc.php on line 101

Getting to Know ParAccel, Part II

In the first part of this series, I discussed what I'd learned about ParAccel the company. Here, at long last, are more thoughts on ParAccel the product.

General Architecture


As previously noted, ParAccel is an MPP, column-oriented database designed for analytics. That's not unique, so I'm not going to spend any time on it (there's a good website focused on that topic though - see http://www.fullcolumnscan.com/ :-P) Like most all databases these days, it does use compression; though some familiar methods were named (RLE, delta, etc.) they were not as forthcoming about what methods are used as others have been.

Deployment Modes


ParAccel provides two different deployment models - stand-alone MAVERICK mode and "stand-beside" AMIGO mode.

At the risk of sounding diminutive, the MAVERICK option results in ParAccel looking and acting like, umm... a database. You can connect to it directly, point BI tools at it, etc. It's a database. Powerful and useful, but a database as we're used to thinking about them nonetheless.

The AMIGO mode is more interesting. As previously mentioned, in AMIGO mode, ParAccel provides a query router then inspects all queries coming into the "real" database and decides whether they should be serviced by the original system or the ParAccel system. Selection of queries to be serviced by ParAccel is fully under the user's control - queries are matched using either literal comparisons or regular expressions. The control freak in me likes that, because what's best suited for each system is probably system and application dependent. The database administrator in me hates that, because it probably means continual maintenance. I'd be interested to hear what people using the system think.

It's also worth noting that if you want all the queries from a certain application to be serviced by ParAccel, you can always point that application at it directly, even in AMIGO mode. So the deployment modes aren't all-or-nothing.

Data Synchronization


How does the ParAccel database have the data necessary to satisfy the queries it intercepts, you ask? Triggers in the master database, at present. (Log shipping coming in the future.) The triggers record changes in an audit table, and the ParAccel system periodically polls for the master system for those records.

This means, of course, that there's an inherent delay between when data in the master system changes and when it is available in the ParAccel system. The polling interval is configurable though, so you can control this delay. Setting it lower than a couple seconds will probably create overhead problems though, so I question whether up-to-the-second synchronization is really feasible. While I seriously doubt that this delay would be an issue for most systems it's something that must be considered nonetheless.

SQL Dialect


Most databases have their own dialect of SQL, usually ANSI plus some extensive additions. ParAccel actually has multiple dialects - its native dialect (Postgres + extensions) and SQL Server's T-SQL (intended for re-routed queries in AMIGO mode, but apparently not specifically limited to them). Support for Oracle's PL/SQL is coming in the future as well, I'm told.

This is how the "drop-in, no application changes necessary" effect is actually achieved - queries are intercepted, re-routed if necessary and then interpreted in their original form. Pretty cool, methinks.

Fault Tolerance


ParAccel's approach to failover is an interesting refinement of the method Netezza uses. Backup copies of each node's data are spread evenly amongst all other nodes (rather than a backup of an entire node existing solely on one other node). When a node is lost, that node's data is redistributed among the remaining nodes rather quickly ("in seconds"). As a result, if a node is lost, performance drops only by 1/n-th, as the work of managing the failed node's data is now distributed among all other nodes. This is a rather nice improvement over the previous approach, where performance is reduced by half when one node takes full ownership of another's data (as it must then do twice the work).

This does seem to have one flaw, however, though not a real serious one. If a second node fails before recovery of the first is complete, then... well, you're toast. In the Netezza model, you're fine as long as you don't lose the node on which the now-active backup resides. So either model is susceptible. ParAccel has simply chosen to trade some theoretical amount of reliability for better real-world performance after a failure. Given the unlikelihood of two nodes failing within seconds of each other I think that's a pretty smart bet.

Memory vs. Disk Storage


There's been a lot of discussion about ParAccel being an in-memory database, and about how their TPC-H results were done with all data in memory. I find this all very confusing, quite frankly, because it implies to me that things are done differently when data is known to be in RAM vs. on disk. This turns out to be dead wrong, however - ParAccel maintains all data on disk, always, and caches data in memory like most every other database. Plain and simple. A test performed with data all in memory is just that - a test where all the data fits in the memory cache. No magic, no less resiliency, no ACID violations, nothing. Data's on disk, data might be in memory, the end.

Data Load


Being able to query large amounts of data is of little use if one can't load the data in a reasonable amount of time. (That was the great lesson learned from my experiences with a very old and crusty MPP database... anyway...) ParAccel uses a two-phase load process: first get the raw data to the nodes, then have the nodes figure out where the data actually belongs. The second phase is automatic. There are two ways to accomplish the first phase:

  1. In standard mode, data is fed to the "leader" node, which distributes it uniformly to the rest of the nodes. Using this mode, data can be loaded at roughly 700GB/hr.

  2. In MPP mode, data is fed directly to the nodes in uniform chunks (via FTP, I believe). This is obviously a bit more of a hassle, but because there is no single bottleneck, this mode outperforms standard mode and scales according to the number of nodes in the system.

I'm not sure that the second approach is new - I believe that at least one other MPP DB player provides that ability, but I'll have to check. The key point remains the same, however - data can be loaded quickly in standard mode and can be ramped up with direct-to-node loads if necessary.

And no, not all MPP DB vendors can say that.

Operating System


This almost doesn't deserve mention anymore, but: ParAccel currently runs on Linux and is porting to Solaris. That seems to be about the standard these days. I will say that I'm a bit surprised that they don't support Windows, however, given their push for SQL Server compatibility. But then again people with SQL Server scalability problems are probably not all that hung up on keeping their database on Windows. :-)

Parting Thoughts


That, I hope, is some useful information on ParAccel at the technical level. It isn't everything I'd like to know or write about, but ya can't have everything. Hopefully I'll learn more as time goes on so that I can add more color in the future. For now, though, if there's anything blatantly missing or unclear (I'm sure there is), please let me know.

Next time: how I think ParAccel compares, where I think it fits in and what I think it means.
Trackbacks

Trackback specific URI for this entry

No Trackbacks

Comments
Concerning the "request router" in AMIGO mode, does that mean you can use PARAccel in a kind of "real time reporting/analysis tools" by opposition to a batch type of reporting as often seen?

thanks
Hi Selim,

I think the short answer is yes.

The long answer is "it depends".

I think the goal of ParAccel's AMIGO mode is to reduce query response time for aggregate queries. That can get you part of the way to "real time" analytics; if a particular query takes an hour to run then you can't exactly get the results in real time, after all, whereas a query that runs in seconds or even a couple minutes probably does.

The other half of the equation, however, is how quickly new or updated data gets into the database. Given that updates to the ParAccel database have to be sent or retrieved from the primary database, there's some inherent delay in loading the data. So the data can't be loaded into the ParAccel database in "real time".

Ultimately, it's probably a matter of requirements. If your queries run fast enough, and you can tolerate some slight staleness (a few minutes at least, probably) in the data, then yes, I think the AMIGO mode would allow you to do "real time" analytics rather than having to batch things up on the primary database. Without knowing the exact requirements of a specific situation though I think it's hard to say definitely though.
Add Comment



Enclosing asterisks marks text as bold (*word*), underscore are made via _word_.
Standard emoticons like :-) and ;-) are converted to images.
E-Mail addresses will not be displayed and will only be used for E-Mail notifications.