The typical real-life ETL cycle consists of the following execution steps:
1. Cycle initiation
2. Build reference data
3. Extract (from sources)
5. Transform (clean, apply business rules, check for data integrity, create aggregates or disaggregates)
6. Stage (load into staging tables, if used)
7. Audit reports (for example, on compliance with business rules. Also, in case of failure, helps to diagnose/repair)
8. Publish (to target tables)
10. Clean up
Four-layered approach for ETL architecture design
- Functional layer: Core functional ETL processing (extract, transform, and load).
- Operational management layer: Job-stream definition and management, parameters, scheduling, monitoring, communication and alerting.
- Audit, balance and control (ABC) layer: Job-execution statistics, balancing and controls, rejects- and error-handling, codes management.
- Utility layer: Common components supporting all other layers.
Use file-based ETL processing where possible
- Storage costs relatively little
- Intermediate files serve multiple purposes:
- Used for testing and debugging
- Used for restart and recover processing
- Used to calculate control statistics
- Helps to reduce dependencies - enables modular programming.
- Allows flexibility for job-execution and -scheduling
- Better performance if coded properly, and can take advantage of parallel processing capabilities when the need arises.
Use data-driven methods and minimize custom ETL coding
- Parameter-driven jobs, functions, and job-control
- Code definitions and mapping in database
- Consideration for data-driven tables to support more complex code-mappings and business-rule application.
Qualities of a good ETL architecture design :
- Recoverable (run_id, ...)
- Operable (completion-codes for phases, re-running from checkpoints, etc.)
- Auditable (in two dimensions: business requirements and technical troubleshooting)