Oracle availability options

There are at least 5 options to consider:

We'll look at some of the pros and con's of each.
OPS (Oracle Parallel Server)
In this scenario we are using Oracle's ability to have 1 computer, more then 1-instance mount and open a single database. In the event a machine fails -- the other machines continue processing with little or no interruption. You might, for example, use a configuration with 2 nodes. If one node fails, the other node will recover any transactions the failed node had outstanding and rollback any in progress transactions from the failed node. The system will never go down (but users that were processing a transaction on the failed node will receive an error and will have to resubmit their transaction, your software could of course detect the failure and automatically resubmit as well).

The upside to this is:

- Increased scalability horizontally. If you need higher capacity, you can add another machine or node. You do not need to necessarily buy bigger machines -- you need more machines.

- Proven technology. An OPS has been around for a long time, we have many people using this -- it is well known and understood by many people. There are design considerations you have to be aware of -- but they are well known (which is half the battle). For example, partitioning will play a key role in designing an application with OPS.

- Fairly transparent. As opposed to replication for example -- replication will impact how the application interacts with data. OPS will not. At the application level, the fact you are using OPS is not as relevant as it would be with replication. OPS will impact the DBA but not necessarily the programmer. Replication will impact both.

Some of the downsides are:

- You have to design to work with OPS. You have to understand OPS. You need to be aware of the potential performance bottlenecks. In 8.1.6 (Oracle8i, release 2) there is a special mode of OPS in a 2 node cluster where 100% of the performance issues are a NON-issue. Both nodes are used to read data -- only one at a time to modify. The second node will take over for the first node in the event of a failure.

- Performing upgrades will most likely cause server downtime. Typically (there are exceptions) all nodes would be offline to upgrade. The time to perform the upgrade will vary but may be as long as 1 hour. These are upgrades to the Oracle software -- not to the OS.

- DBA must be fully aware of an OPS environment. Demands a DBA who knows OPS, not just an Oracle DBA.
 

Replication
Replication can be used to perform failover -- but it wasn't really designed for failover. It was designed to make data available in more then one location. The way replication works is that a transaction takes place at 'db1'. That transaction is stored in a local queue at 'db1'. When the user commits, the database will (in the background, not as part of the user transaction) forward the transaction stored in the queue to other databases interested in that data. This transaction is then performed at these other sites.

Replication is asynchronous in nature - you have to assume that at any point in time, the various databases can and will contain different data as they are all at different points in time. This fact can make application development somewhat harder. Let's say the site you are building is a site to sell stuff. You have an inventory. You have X widgets in inventory. At 2 different sites, people simultaneously order X widgets. As far as your software is concerned, both orders can be filled. It is only AFTER the transactions at both sites have been performed that you discover in the background that you now have -X widgets in inventory (you had an order for 2X widgets). This is known as an update conflict and it is up to you to 'fix it'. Oracle discovers it for you but will then invoke user-defined code to fix the problem. This process must be part of your application design.

You can decide to avoid conflicts by partitioning data (e.g.: West Coast data and East Coast data, you need to go to the west coast server to affect west coast data and vice versa). Many people do this however if you fail over from the west coast to the east coast -- you must remember that there probably are transactions queued in the failed west coast machine that will be propagated to the east coast when you fix the problem. You still have the potential for update conflicts when using replication for failover -- you still need to design for how you will handle these issues.

To answer your replication questions directly:

Replication vs. Parallel Server - which one would you recommend and for what reasons? What are the pros and cons of each (keep in mind scalability, redundancy, availability and backup issues).

If failover and availability are the key issues -- OPS wins out in my opinion.
If redundancy is the main issue -- replication is the answer (as OPS is not a redundant solution -- there is only one database).
They are equal in the backup area (same issues, same considerations). Replication may be better for uptime and backup purposes, but how is response?

Both replication and OPS have certain performance implications. Both will have to be designed for in many respects. With OPS and a 2-node configuration -- we can eliminate the performance considerations. With OPS and a single write node and many (more then 2) read nodes, we can eliminate most of the performance considerations. With replication, a client transaction will generate about 2x the IO for any given WRITE transaction and that will not affect read transactions at all.

In short -- both will have issues. Replication - Long distance (for disaster recovery purposes) - how would the replication work? Would you be able to update both DB.s, then the changes get written to both of them (during the update/replicate process) or is only one database updated then the changes from that database get written to the second/replicated database?

Replication works in a store and forward fashion. The client updates a row in db1, db1 will not only update that row but it will queue a transaction (locally). After the client commits, db1 will pick up the transaction from the queue and send it to db2, db3, ... and so on. You can update the same row in any number of databases -- but you have to be able to handle the resulting update conflicts this causes -- Oracle detect them, you fix them. OPS would allow at most 1 person at a time to modify a single row (just like a single database would since OPS is a single database).

Some of the upsides of replication are:

- Replication works over long distances and WANS.

- Replication is tolerant of short network outages between database servers (short is relative here and is a function of the amount of changes performed in a given period of time. You do not want the servers to get too far out of sync. It could affect the application as well if the data diverges too much).

- Replication imposes a 'fixed' overhead (2X the data modified for inserts/deletes, 3X for updates (before and after images are stored in the queue).

- Replication provides you with a copy of the data elsewhere. With OPS -- there is one database. An earthquake can take out the only copy.
 

Some of the downsides are:

- Conflict resolution and its impact must be carefully studied and designed for.
- You need a DBA familiar with this technology.
- Your developers need to be familiar with the impact of this technology.

(NOTE: you can use OPS and replication together. OPS for rapid failover, replication for a hot spare. Something else to consider).
 

Standby Database
The standby database is a nice solution for failover. In this mode, the logs generated by Oracle are forwarded to another computer in the network and applied to a database instance that is in constant recovery mode. This standby database can be opened for reporting and such (read only in 8I only) but generally is unavailable while in recovery mode. This standby database lags behind the real database by some small amount (governed by the size of your redo files) but can be brought on line rapidly after a failure and can be totally caught up to the master database if you can get to the file system on the down machine (e.g.: plug the disks that contain the redo into another machine). You should be aware the replication has the same sort of lag time as a standby database would.

Some of the upsides of standby databases are:

- They are extremely easy to setup.
- Used for disaster and/or failure tolerance
- They have no performance impact on the master machine.
- Real-time synchronization of the standby database
- Standby database may be used for read-only transactions (8i only)
- They impose no design considerations on the application.

Some of the downsides are:

- The DBA must be aware of the standby database. If you make structural changes to the master, they must be coordinate to the standby. This is minimal.
- Once you failover to the standby -- there is *no* failback. The standby will become the master. You will build a new standby to cover this new master.
- Only one standby database is supported
- Must have sufficient network bandwidth between master and standby databases to transfer the archived log files.
- Separate license is required

NOTE = Oracle 8I send the archived log files automatically to the destination, also automatically applies the archived redo logs to the standby DB as they arrive.

OS oriented solutions.
The concept here is that 2 machines have the same set of disks mounted. In the event one machine fails, the second machine takes over the IP address and starts a recovery script of some sort. This script would startup the Oracle instance, let it recover and you'd be good to go.
Talk to your OS vendor about these types of solutions.
 
Architectural Solutions.
This is what I call the E-Bay versus Yahoo solution. E-Bay is a giant single server running on an E-10000. When they have a failure -- everything is down (their failures have to date mostly been operator error. failure to install recommended patches, failure to follow procedures and such). A database failure at E-bay is visible to everyone immediately. It is all or nothing.

Yahoo on the other hand has lots of little machines with lots of little databases. You want stocks -- you go to quote.yahoo.com. You want some business news -- you goto that server. Tech news - else where. You want a random site, that is random.yahoo.com. And so on. Yahoo is a whole bunch of databases working together. If part of yahoo is down -- hardly anyone notices since each server is just doing a little part of the whole. Yahoo never goes down but at any point in time, part of it might be. This requires lots more design and a really good / redundant directory service (to find stuff) but is doable.

Whether this can work for you depend on the nature of your site.

For more information, you can read this article:  Implementing an Automated Standby Database
 
 

Comparison of Disaster Recovery Solutions

 
  Replication Standby Database Geographic Mirroring Geo-mirrored Standby
Overhead on Production System High, limit to max transactions per second Low Medium Low to medium
Potential Data and Transaction Loss Yes 
(transactions that have not been replicated to remote site when disaster occurs)
Yes 
(transactions in active logs of production site that have not been transferred to standby site)
No No
Concurrent Use of Failover System Yes Read-only Reporting Only No 
(except other applications until failover initiated)
Read-only reporting only
Fall Back or Recovery Technique Instance and media recovery with conflict resolution or rebuild Re-clone the database Storage re-synchronization Storage re-synchronization or re-cloan the database
Separation Distance Thousands of miles Thousands of miles Large performance impact for distances over 66 km hundres of miles
Application support Application must be replication aware any application should work any application should work any application should work
System Environment heterogeneous mix of systems vendors and Oracle versions Same systems vendor and Oracle version Same systems vendor  and Oracle version Same systems vendor and Oracle version
Cost Medium Low High Medium to High
Complexity High Medium Low Medium