Question: Why is Row-based Processing Bad?
A reader writes:We have recently bought Netezza v4.5 and planning to move out EDW from Oracle 10g. We have a lot of PL/SQL ETL written along with OWB ETL. Do you have any examples why cursors ( Row Based Processing) are bad choice for both Netezza and Oracle for data warehousing application.
Unfortunately, I do not have any real-world examples of why this is a bad idea that I can share. At least not without getting myself in trouble and/or making some people angry at me, anyway.
When dealing with relational databases think sets, not rows.
Though we're used to thinking in terms of individual rows, SQL works on sets of rows. Relational databases process SQL, so... if you want your database to perform well, you've got to design things in terms of sets, not rows.
To a large degree people already think about sets without really realizing it. Here's a really contrived example: Let's say you wanted to find all the rows in a table where the Color field has a value of blue. You wouldn't write a stored procedure to loop over all the rows, inspecting the Color field of each, would you? No, you'd write a query that includes "WHERE Color = 'blue'" in it, and let the database figure out which set of rows within that table fits that criterion. See? You're already thinking about sets without even realizing it.
Where set-based thinking really becomes important, however, is scalability. Set-based operations are easy to break up into chunks and thus easy the parallelize. In a partitioned table, each partition is essentially its own set, so each partition can be operated on independently. In an MPP system like Netezza, where each table is distributed across all the SPUs, each SPU can operate on its set of data independently. Operating on independent sets in parallel makes things not only fast, but scalable - more independent sets equals higher possible scalability.
Cursors, on the other hand, well, they allow you to go one row at a time over all the rows you have to deal with. That's what they were designed for, after all. Wanna guess how (not) fast and (not) scalable that is?
The other factor to consider here is portability, or lack thereof. I'm not going to claim that all SQL is effortlessly portable between different database systems, but there's no question that queries are easier to move between systems than stored procedures. All the stored procedure languages I'm familiar with are proprietary, at any rate, and proprietary generally equals 'intentionally not portable'.
Why is portability important? Nobody likes to admit it, but for any given software system, the odds are good that you're going to have to rework everything at least once. Whether that's because you're forced to change database systems or because you need to generally overhaul things to get the functionality you want, it's highly likely to happen. Functionality built on a non-proprietary foundation makes both of those easier.
If you don't believe me, see the question that started this whole discussion.
Disclaimer: I detest stored procedures, and it has been a mercifully long time since I've worked extensively with Oracle. Thus I am unaware of any easy way to have multiple instances of a stored procedure running with each operating on different subsets of data, with Oracle or any other DB. If there is then I expect to be soundly and publicly redressed by those who know better. Regardless, I would think that set-based approaches would still make for more scalable solutions simply because sets are what relational databases are designed to work on, not procedural programs.


1. Fetch the 1B rows into an app server and count them using a java cursor
2. Open a cursor in PLSQL and count the same rows
3. select count(*) from table_1B (or avoid the FTS and count values in the PK col with an index FFS
there are orders of magnitude performance/scalability as you move from 1 to 2 to 3. Only do in PLSQL, JAVA or C what you absolutely HAVE to do. Use set based SQL for everything else.
- Concerned citizen