Sunday, May 27, 2012

DW Basics


When we speak about a data warehouse two terms that we frequently come across are “Fact” and “Dimension”. What is a fact or a Dimension?
FACT : A fact table is the one that contains measures. So all the entities like counts, quantity, price etc are stored in fact table. Fact has basically two type of fields  measures and keys (foreign key) to connect to the dimension table.
Dimension : On the other hand  Dimension table is the one that is a subject against which we measure data.Dimension tables help us get answers to questions like when, what, where, who. For instance time dimension (Calendar Dimension) help us get answer for WHEN was product sold. Typically there are multiple dimensions and single fact in a simple data warehouse.
An interesting question was asked to one of my friend during an interview. The question was pretty simple “Why is fact table called a fact and Dimension table called as dimension, why not any other name?”.
My answer to this question is pretty simple. Whenever we look at things we look at it as a 2D or 3D (2nd or 3rd dimension) object.  Analysis of anything that is visible is done from the point of view of dimensions that is why the subject with respect to which we are viewing data is called a DIMENSION.
Coming to fact. “FACT” in literal terms means “truth”. Since we are displaying true values against dimensions that is way the table that contains measurable data (true data) is called a FACT table.
When speaking about the basics we often come across multiple terms related to Fact and Dimension, following are some commonly used terms and their meaning.
Junk Dimension:  While designing DW there are some columns which are not much important from the point of view of reporting but cannot be ignored (Ex Flags or indicators). Such things are stored in a separate table called junk dimension. We always see a store room in our house and things kept there are not actually used but we do not sell them, we keep them because they are important but are not currently being used. Junk Dimension contains these type of values, not used but cannot be ignored.
Role Playing Dimension : Most common example of role playing dimension is Date Dimension. Consider an environment where we are reporting for a large Product based organization which produces, ships and sells some products. Date in such scenario can be Manufacturing date or Shipped Date or Invoice date based on the process any product goes through. Date would come from same table  (Date Dimension) but will have different roles.
Conformed Dimension: Theoretically a conformed dimension is the one which behaves in same manner with respect to different Facts. To understand this we will consider a scenario where in we have multiple fact tables. Again let’s consider Date Dimension, Date and other attributes (day, week, month, quarter, year) would behave in a similar manner with respect to any of the two fact tables. It would convey same meaning across multiple facts.
Another important concept is about SCD (slowly changing dimensions), As the name suggests a Dimension which changes with time is called as slowly changing dimension. For instance address of a person can be dynamic; it can change with time also the age of person changes with time. All these are common examples of slowly changing dimensions. There are three basic techniques to deal with SCDs as follow.
SCD1: In this case new data replaces the old data. Historical data is not preserved.
SCD2: In this case new records are added in the Dimension table. The old records with old data are preserved and the new records contain new data.
SCD3:  New columns are added in the dimension table so that new field can hold new data and the old fields contain old data. Both values are preserved in this case.


Below is the basic architecture for DW dataflow



3 comments: