Components of a Data Warehouse Architecture - Part 1, ETL and the Staging Area

Decision support systems are usually based onprocess, often represents a major part of the
the development of Data Warehouseman effort to develop a DW (international
infrastructures. A data warehouse (hereafter DW)statistics estimate that it exceeds 70% of total
architecture has two major areas: the stagingeffort). The DW staging area, is often
area and the presentation area. In this article weimplemented in a separate physical server (staging
present the staging area. The sources from whichserver), thus adding complexity and cost.
data shall be systematically extracted, in order toHowever, this approach has certain advantages
be loaded in the DW, are determined. Thelike:
database schema documentation of these- isolation of raw data which are extracted from
sources, is reviewed in order to design the datasources, from processed data which are
extraction logic. Documentation quality of the dataaccessible by business analysts
structures of these sources, influences the- additional security and process quality, given that
degree of difficulty in designing the dataDW users have no access in this area
extraction logic. Data extracted are loaded in the- load sharing, given that 'data preparation' tasks
staging area, either as simple files or as updates inand DW querying tasks are handled by separate
database tables. The staging area may havesystems.
various stages. Extraction of data from sources,- development of a central metadata repository
transformation of data into new structures andwhich maintains documentation for all involved
data loading in the DW, a process known as ETL,systems: operational systems (data sources), ETL
takes places in the staging area.process, data warehouse, BI tools and predefined
The extraction process requires the determinationreports Various types of raw data processing,
of source relational tables - fields, from which datatake place at the staging area:
shall be extracted (as mentioned above,- Data standardization: data transformation to a
documentation of these structures is crucial forstandard format, if needed
design). The design of the extraction process- Sorting of records
determines:- Matching and merging records of the same
- the frequency of data extractionentity, which are derived from different sources
- the extraction method (e.g. changes only) and(e.g. order records of the same Customer from
technology (database partial replication)different order handling systems), after
- the database instance or the file in which datastandardization
are initially loaded, in the staging area Moreover,- Processing of calculated facts (facts derived
the volume of data to be extracted is estimated,from detailed data e.g. total monetary value of an
in order to plan for computational & storageorder)
capacity. Estimation sheets known as 'volumetric- Management of surrogate keys, which replace
sheets' are developed with the followingoperational systems keys
information per source field:- Enrichment of records with default values, if
- extraction frequencyrequired
- estimated volume- Production of aggregate data, if needed
- standardisation and transformation rules applied- Data conversion according to the technological
(if any)platform used by the DW (DBMS, operating
- DW database field to which data will be loadedsystem) The ETL process is automated by
In many cases, data quality assessment and datasoftware and executed periodically to update the
cleansing steps also take place in the staging area.DW.
Design and implementation of the automated ETL