Friday, September 18, 2015

Data Warehouse / Data Marts – Fact Tables

Facts tables play a very dynamic role in the data warehouses or data marts because they consist of the measurements, metrics or facts of a business process e.g., sales revenue by month by product.

Example of Fact Table

In the schema above, we have fact table FactSales that has a grain which gives us a number of units sold by date, by store, by customer and by product.
All other tables such as DimDate, DimStore, DimCustomer and DimProduct are dimensions tables. This schema is known as star schema.

In this fact table, each entry represents the sale of a specific product on a specific day to a specific customer in accordance with a specific promotion at a specific store. The business measurements captured are the value of the sale, the cost to the store, and the quantity sold.

In a very simple way, we can say that a fact table is nothing but might be contained business sales events such as cash register transactions or the contributions and expenditures of a nonprofit organization. Fact tables usually contain large numbers of rows, sometimes in the hundreds of millions of records when they contain one or more years of history for a large organization.
In the real world, it is possible to have a fact table that contains no measures or facts. These tables are called "Factless fact tables", or "Junction tables". The "Factless fact tables" can for example be used for modeling many-to-many relationships or capture events

Important Features of Fact Tables
  1. They placed at the center of a star schema or a snowflake schema surrounded by dimension tables. Where multiple fact tables are used, these are arranged as a fact constellation schema. A fact table typically has two types of columns: those that contain facts and those that are a foreign key to dimension tables.
  2. The primary key of a fact table is usually a composite key that is made up of all of its foreign keys. Fact tables contain the content of the data warehouse and store different types of measures like additive, non-additive, and semi additive measures.
  3. Mostly fact tables contain numerical data (facts) that can be summarized to provide information about the history of the operation of the organization.
  4. Each fact table also includes a multipart index that contains as foreign keys the primary keys of related dimension tables, which contain the attributes of the fact records.
  5. Fact tables should not contain descriptive information or any data other than the numerical measurement fields and the index fields that relate the facts to corresponding entries in the dimension tables.
  6. Fact tables provide the (usually) additive measures that act as independent variables by which dimensional attributes are analyzed. Fact tables are often defined by their grain. The grain of a fact table represents the most atomic level by which the facts may be defined.
  7. Additive measures allow summary information to be obtained by adding various quantities of the measure, such as the sales of a specific item at a group of stores for a particular time period. Non-additive measures such as inventory quantity-on-hand values can also be used in fact tables, but different summarization techniques must then be used.
Measure types
Fact table can store different types of measures such as given below-
  • Additive Measure– As it name implied, additive measures are measures which can be added across all dimensions.
  • Non-additive Measure– different from additive measures, non-additive measures are measures that cannot be added across all dimensions.
  • Semi additive Measure– semi additive measures are measure that can be added across only some dimensions and not across other.
A fact table might contain either detail level facts or facts that have been aggregated (fact tables that contain aggregated facts are often instead called summary tables).
Special care must be taken when handling ratios and percentage. One good design rule  is to never store percentages or ratios in fact tables but only calculate these in the data access tool. Thus only store the numerator and denominator in the fact table, which then can be aggregated and the aggregated stored values can then be used for calculating the ratio or percentage in the data access tool.


Types of fact tables
All fact tables are categorized by four most basic measurement events:
Transactional – Transactional fact table is the most basic one that each grain associated with it indicated as “one row per line in a transaction”, e.g., every line item appears on an invoice. Transaction fact table stores data of the most detailed level therefore it has high number of dimensions associated with.
Periodic snapshots – A Periodic snapshots fact table stores data that is a snapshot in a period of time. The source data of periodic snapshots fact table is data from a transaction fact table where you choose period to get the output. A periodic snapshot table is dependent on the transactional table, as it needs the detailed data held in the transactional fact table in order to deliver the chosen performance output.
Accumulating snapshots – The accumulating snapshots fact table describes activity of a business process that has clear beginning and end. This type of fact table therefore has multiple date columns to represent milestones in the process. A good example of accumulating snapshots fact table is processing of a material. As steps towards handling the material are finished, the corresponding record in the accumulating snapshots fact table get updated.
Temporal snapshots - By applying Temporal Database theory and modelling techniques the Temporal Snapshot Fact Table allows to have the equivalent of daily snapshots without really having daily snapshots. It introduces the concept of Time Intervals into a fact table, allowing to save a lot of space, optimizing performances while allowing the end user to have the logical equivalent of the "picture of the moment" he is interested in.

Aggregation in Fact Tables
Aggregation is the process of calculating summary data from detail records. It is often tempting to reduce the size of fact tables by aggregating data into summary records when the fact table is created. However, when data is summarized in the fact table, detailed information is no longer directly available to the analyst. If detailed information is needed, the detail rows that were summarized will have to be identified and located, possibly in the source system that provided the data. Fact table data should be maintained at the finest granularity possible. Aggregating data in the fact table should only be done after considering the consequences.

Designing fact table steps
There are very simple four steps to design a fact table described by Kimball:
Identify business process to model – The first step is to decide what business process to model by gathering and understanding business needs and available data such as Identify a business process for analysis (like sales)
Identify measures of facts – by declaring a grain means describing exactly what a fact table record represents
Identify dimensions for facts once grain of fact table is stated clearly, it is time to determine dimensions for the fact table.

Sunday, September 13, 2015

SQL – Views

Views are the most amazing feature in the SQL because they are just pre-established query commands known as Virtual Tables which are kept in the database dictionary. In some NoSQL databases, views are the only way to query data. Views hide the database complexity and they are good for providing security, de-normalization and  data model abstraction. Views are commonly used to implement business logic. 

Views can centralise or consolidate data and they save a lot of repeated complex JOIN statements in your SQL scripts. This would sometimes be handy, straight forward and easier than writing out the join statements in every query. Views also provide an abstracting layer preventing direct access to the tables.
In another parallel with functions, database users can manipulate nested views, thus one view can aggregate data from other views. Without the use of views, the normalization of databases above second normal form would become much more difficult. Views can make it easier to create loss less join decomposition. 

Advantages of Views
Views can provide advantages over tables:
  • Views can represent a subset of the data contained in a table. Consequently, a view can limit the degree of exposure of the underlying tables to the outer world: a given user may have permission to query the view, while denied access to the rest of the base table.
  • Views can be used as security mechanisms by letting users access data through the view, without granting the users permissions to directly access the underlying base tables of the view. 
  • Views can join and simplify multiple tables into a single virtual table.
  • Views can act as aggregated tables, where the database engine aggregates data (sum, average, etc.) and presents the calculated results as part of the data.
  • Views can hide the complexity of data. For example, a view could appear as Sales2000 or Sales2001, transparently partitioning the actual underlying table.
  • Views take very little space to store; the database contains only the definition of a view, not a copy of all the data that it presents.
  • Depending on the SQL engine used, views can provide extra security.
  • Views can also be used when you copy data to and from SQL Server to improve performance and to partition data.
  • Views can be used to provide a backward compatible interface to emulate a table that used to exist but whose schema has changed. 
  • Views are a nice way of providing something simple to report writers. 
  • In a sense views deformalize, Denormalization is sometimes necessary to provide data in a more meaningful manner.
Due to having the virtual nature, a view has no physical existence in the database until it is invoked and this is the biggest reason that you cannot put constraints on a view.  Just like a base table name, the name of the view must be unique within the entire database schema. The view definition cannot reference itself, since it does not exist yet and they are generally used to focus, simplify, and customise the perception each user has of the database. 

Types of Views
Besides the standard role of basic user-defined views, SQL Server provides the following types of views that serve special purposes in a database.

Indexed Views
An indexed view is a view that has been materialized. This means the view definition has been computed and the resulting data stored just like a table. You index a view by creating a unique clustered index on it. Indexed views can dramatically improve the performance of some types of queries. Indexed views work best for queries that aggregate many rows. They are not well-suited for underlying data sets that are frequently updated.

Partitioned Views
A partitioned view joins horizontally partitioned data from a set of member tables across one or more servers. This makes the data appear as if from one table. A view that joins member tables on the same instance of SQL Server is a local partitioned view.

System Views
System views expose catalog metadata. You can use system views to return information about the instance of SQL Server or the objects defined in the instance. For example, you can query the sys.databases catalog view to return information about the user-defined databases available in the instance. Microsoft SQL Server provides the following collections of system views that expose metadata:
  1. Catalog ViewsCatalog views return information that is used by the SQL Server Database Engine and they are the most general interface to the catalog metadata and provide the most efficient way to obtain, transform, and present customized forms of this information. All user-available catalog metadata is exposed through catalog views.
  2. Information Schema Views: An information schema view is one of several methods SQL Server provides for obtaining metadata. Information schema views provide an internal, system table-independent view of the SQL Server metadata. Information schema views enable applications to work correctly although significant changes have been made to the underlying system tables. The information schema views included in SQL Server comply with the ISO standard definition for the INFORMATION_SCHEMA.
  3. Compatibility Views: Many of the system tables from earlier releases of SQL Server are now implemented as a set of views. These views are known as compatibility views, and they are meant for backward compatibility only. The compatibility views expose the same metadata that was available in SQL Server 2000. However, the compatibility views do not expose any of the metadata related to features that are introduced in SQL Server 2005 and later. Therefore, when you use new features, such as Service Broker or partitioning, you must switch to using the catalog views.
  4. Replication Views: These views contain information that is used by replication in Microsoft SQL Server. The views enable easier access to data in replication system tables. Views are created in a user database when that database is enabled as a publication or subscription database. All replication objects are removed from user databases when the database is removed from a replication topology. The preferred method for accessing replication metadata is by using Replication Stored Procedures. 
  5. Data-tier Application Views (Transact-SQL): The following views (dbo.sysdac_instances (Transact-SQL), sysdac_instances_internal (Transact-SQL)) display information about data-tier application (DAC) instances. These views are in the msdb database and are owned by the dbo schema.
  6. Dynamic Management Views and FunctionsDynamic management views and functions return server state information that can be used to monitor the health of a server instance, diagnose problems, and tune performance. There are two types of dynamic management views and functions:
  • Server-scoped dynamic management views and functions. These require VIEW SERVER STATE permission on the server.
  • Database-scoped dynamic management views and functions. These require VIEW DATABASE STATE permission on the database.
When to Use a View
In our daily needs, there are a number of scenarios where we will like to create our own View:
  1. To hide the complexity of the underlying database schema, or customize the data and schema for a set of users.
  2. To control the access on rows and columns of data in the tables.
Conclusion
A view is a virtual table, through which a selective portion of the data from one or more tables can be seen and they do not contain data of their own. They are used to restrict access to the database or to hide data complexity. A view is stored as a SELECT statement in the database. DML operations on a view like INSERT, UPDATE, DELETE affects the data in the original table upon which the view is based. Views are a nice way of providing something simple to report writers. 
Learn more on another features of SQL as:
  1. SQL - Stored Procedure
  2. SQL - Create Stored Procedure
  3. SQL - Execute Stored Procedure
  4. SQL - Alter Stored Procedure
  5. SQL - Views
  6. SQL – Stored Procedure Vs Ad-Hoc (In-Line)
  7. SQL - Merge Statement
  8. SQL - Functions
  9. SQL - Cursors
Reference
https://msdn.microsoft.com/en-us/library/ms190174(v=sql.110).aspx

Friday, September 11, 2015

DW - OLAP Servers

OLAP is a powerful analysis tool for forecasting, statistical computations, aggregations and involves more than just the multidimensional display of information. OLAP tools also must be able to extract and summarise requested data according to the needs of an end user, and there are two approaches for this data extraction that need to be discussed.
As we know about that data in a data warehouse is organized to support analysis rather than to process real-time transactions as in online transaction processing systems (OLTP).  OLAP technology enables data warehouses to be used effectively for online analysis, providing rapid responses to iterative complex analytical queries.

OLAP is stand for Online Analytical Processing and OLTP Server is the chief component which stays between a client and a database management systems (DBMS).

In very simple words, OLAP servers present business users with multidimensional data from data warehouse or data marts, without concerns regarding how or where the data are stored. The OLAP servers are key points to understand that how data is organized in the database and has special functions for analyzing the data.

OLAP's multidimensional data model and data aggregation techniques organise and summarise large amounts of data so it can be evaluated quickly using online analysis and graphical tools. The answer to a query into historical data often leads to subsequent queries as the analyst searches for answers or explores possibilities. OLAP systems provide the speed and flexibility to support the analyst in real time.
Types of OLAP Servers
Cubes in a data warehouse are stored in three different modes and we can have four types of OLAP servers which are given below:
Relational OLAP (ROLAP) Servers: These are the intermediate servers that stand in between a relational back-end server and client front-end tools. They use a relational or extended-relational DBMS to store and manage warehouse data, and OLAP middle-ware to support missing pieces. ROLAP servers include optimization for each DBMS back end, implementation of aggregation navigation logic, and additional tools and services. ROLAP technology tends to have greater scalability than MOLAP technology. 
The ROLAP storage mode causes the aggregations of the partition to be stored in indexed views in the relational database that was specified in the partition's data source.
Advantages of ROLAP
  1. ROLAP is considered to be more scalable in handling large data volumes, especially models with dimensions with very high cardinality (i.e., millions of members).
  2. With a variety of data loading tools available, and the ability to fine-tune the ETL code to the particular data model, load times are generally much shorter than with the automated MOLAP loads.
  3. The data are stored in a standard relational database and can be accessed by any SQL reporting tool (the tool does not have to be an OLAP tool).
  4. ROLAP tools are better at handling non-aggregable facts (e.g., textual descriptions). MOLAP tools tend to suffer from slow performance when querying these elements.
  5. By decoupling the data storage from the multi-dimensional model, it is possible to successfully model data that would not otherwise fit into a strict dimensional model.
  6. The ROLAP approach can leverage database authorization controls such as row-level security, whereby the query results are filtered depending on preset criteria applied, for example, to a given user or group of users (SQL WHERE clause).
Disadvantages of ROLAP
  1. There is a consensus in the industry that ROLAP tools have slower performance than MOLAP tools. However, see the discussion below about ROLAP performance.
  2. The loading of aggregate tables must be managed by custom ETL code. The ROLAP tools do not help with this task. This means additional development time and more code to support.
  3. When the step of creating aggregate tables is skipped, the query performance then suffers because the larger detailed tables must be queried. This can be partially remedied by adding additional aggregate tables, however it is still not practical to create aggregate tables for all combinations of dimensions/attributes.
  4. ROLAP relies on the general purpose database for querying and caching, and therefore several special techniques employed by MOLAP tools are not available (such as special hierarchical indexing). However, modern ROLAP tools take advantage of latest improvements in SQL language such as CUBE and ROLLUP operators, DB2 Cube Views, as well as other SQL OLAP extensions. These SQL improvements can mitigate the benefits of the MOLAP tools.
  5. Since ROLAP tools rely on SQL for all of the computations, they are not suitable when the model is heavy on calculations which don't translate well into SQL. Examples of such models include budgeting, allocations, financial reporting and other scenarios.
Multidimensional OLAP (MOLAP) servers: These servers support multidimensional views of data through array-based multidimensional storage engines. They map multidimensional views directly to data cube array structures. The advantage of using a data cube is that it allows fast indexing to pre-computed summarized data.
This is the more traditional way of OLAP analysis. In MOLAP, data is stored in a multidimensional cube. The storage is not in the relational database, but in proprietary formats. Most MOLAP solutions store these data in optimized multidimensional array storage, rather than in a relational database.

Advantages of MOLAP
  1. Fast query performance due to optimized storage, multidimensional indexing and caching.
  2. Smaller on-disk size of data compared to data stored in relational database due to compression techniques.
  3. Automated computation of higher level aggregates of the data.
  4. It is very compact for low dimension data sets.
  5. Array models provide natural indexing.
  6. Effective data extraction achieved through the pre-structuring of aggregated data.
  7. Cube technology are often proprietary and do not already exist in the organization. Therefore, to adopt MOLAP technology, chances are additional investments in human and capital resources are needed.
Disadvantages of MOLAP
  1. Within some MOLAP Solutions the processing step (data load) can be quite lengthy, especially on large data volumes. This is usually remedied by doing only incremental processing, i.e., processing only the data which have changed (usually new data) instead of reprocessing the entire data set.
  2. Some MOLAP methodologies introduce data redundancy.
Hybrid OLAP Servers: They are combination of ROLAP (Relational OLAP) and MOLAP (Multidimensional OLAP) which are other possible implementations of OLAP. HOLAP allows storing part of the data in a MOLAP store and another part of the data in a ROLAP store, allowing a tradeoff of the advantages of each. The degree of control that the cube designer has over this partitioning varies from product to product.

HOLAP technologies attempt to combine the advantages of MOLAP and ROLAP. For summary-type information, HOLAP leverages cube technology for faster performance. When detail information is needed, HOLAP can "drill through" from the cube into the underlying relational data. For example, a HOLAP server may allow large volumes of detail data to be stored in a relational database, while aggregations are kept in a separate MOLAP store. The Microsoft SQL Server 7.0 OLAP Services supports a hybrid OLAP server.


Specialized SQL Servers: Specialized SQL servers provide advanced query language and query processing support for SQL queries over star and snowflake schemas in a read-only environment.

References:
http://social.technet.microsoft.com/wiki/contents/articles/19898.aspx

Tuesday, September 8, 2015

DW - Multidimensional Cube Types

As we know that, a data cube stores data in a summarised version which helps in a faster analysis of data. Based on the business requirement, you can have different types of the cubes such as given below:

Regular cubes- These are very common cubes which are based on real tables for their data source. They will have aggregations and will occupy physical storage space of some kind. If there are some changes in the contributed data source then these cubes must be reprocessed.

Virtual cubes -They are also known as logical cubes which are based on one or more regular cubes or linked cubes. Virtual cubes use the aggregations of their component regular cubes or linked cubes. This is the main reason that they not required any storage space.

Linked cubes - Linked cubes are based on regular cubes defined and stored on another Analysis Server. Linked cubes also use the aggregations and storage of the regular cube they reference. The linking in cube ensures that the data in the cubes remain consistent. Linking different data cubes reduces the possibility of sparse data. Linked cubes are the cubes that are linked in order to make the data remain constant.

Local cubes - Local cubes are entirely contained in portable files (tables) and can be browsed without a connection to an Analysis Server. They do not have aggregations. This is really like being in "disconnected" mode.

Real-time cubes - A real-time cube is a cube in which one or more Relational Online Analytical Processing Server (ROLAP) partitions or dimensions support real-time updates. Multiple dimensions or partitions can support real-time updates, and a real-time cube can have a mixture of dimensions or partitions that may or may not be enabled for real-time updates. Because of the complexity involved in managing such real-time cube data, the requirements for creating a real-time cube are more stringent than for a regular cube.

In Analysis Services, real-time Online Analytical Processing Server (OLAP) represents the capability to quickly retrieve, organise, aggregate and present multidimensional data for cubes and dimensions whenever the data changes in the underlying relational data source, without requiring the cube or dimension to be explicitly processed first.

Write-enabled cubes - The write-enabled cube is often of limited scope, receives smaller and less regular data updates, and is used by smaller numbers of more specialized (often expert) users.  These cubes are sometimes mapped to a data warehouse but more frequently are built against a specialized data mart.  On the topic of security, any end-users you wish to perform write back to the cube must be granted Read/Write permissions at the cube level.


Please visit to know more on -
  1. Collaboration of OLTP and OLAP systems
  2. Major differences between OLTP and OLAP
  3. Data Warehouse
  4. Data Warehouse - Multidimensional Cube
  5. Data Warehouse - Multidimensional Cube Types
  6. Data Warehouse - Architecture and Multidimensional Model
  7. Data Warehouse - Dimension tables.
  8. Data Warehouse - Fact tables.
  9. Data Warehouse - Conceptual Modeling.
  10. Data Warehouse - Star schema.
  11. Data Warehouse - Snowflake schema.
  12. Data Warehouse - Fact constellations
  13. Data Warehouse - OLAP Servers 

Monday, September 7, 2015

DW - Multidimensional Cube

Cubes are known as multidimensional (composition of fact tables and dimensions) data processing units to conceptually straightforward operations to retrieve decision support information from the data warehouse or data marts. The edge of the any cube contains dimension members and the body of the cube contains data values.

They can have 2-dimensional, 3-dimensional or more up to 128 dimensions as per the business requirement which are mainly designed and used for analysis purpose.
In the cube, each dimension represents some attribute in the database and the cells in the data cube represent the measure of interest.
Simply, a cube provides an easy-to-use mechanism for querying data with quick and uniform response times. End users use client applications to connect to an Analysis server and query the cubes on the server. In most client applications, end users issue a query on a cube by manipulating the user interface controls, which determine the contents of the query.

As for an example, you may have the following dimensions in a warehouse for keeping date, customer, store, product and sales data.
CustomerID
StoreId
ProductId
Date 
Sales
If you want to draw the dimension layout then it will support 4D "cube" layout which would allow queries to be easily performed for customers that may have different purchasing patterns throughout the year (and even at different times of the month). The above example has four dimensions (date, customer, store, and product) and one fact (sales) which could follow the star schema model look like as:
 

Conceiving data as a cube with hierarchical dimensions leads to conceptually straightforward operations to facilitate analysis. Aligning the data content with a familiar visualization enhances analyst learning and productivity. The user-initiated process of navigating by calling for page displays interactively, through the specification of slices via rotations and drill down/up is sometimes called "slice and dice". Common operations include slice and dice, drill down, roll up, and pivot.

If you try to view the data in the cube model then it would be display like as shown below:

A dimension (table) is a look-up table for properties of objects that rarely change. Product, customer and store may change some of their properties (attributes), but they rarely do. Dimension levels are a powerful data modeling tool because they allow end users to ask questions at a high level and then expand a dimension hierarchy to reveal more detail.

Fact table captures interactions between these objects. At the intersection of dimensions date, store, product and customer lies a measure SalesAmount. In addition, measures are the central values of a cube that are analyzed. That is, measures are the numeric data of primary interest to end users browsing a cube. 

A measure group has one or more partitions to manage the physical data. A measure group can have a default aggregation design. The default aggregation design can be used by all partitions in the measure group; also, each partition can have its own aggregation design.

Note how easy is to aggregate (sum) the sales amount by date, by year, by product, by brand, by city, by country, by age group, by whatever which was the idea in the first place.

Subordinate to the cubes
In the object hierarchy, cubes are immediately subordinate to the database. A database just behaves like a container for all related cubes and the objects they share. Before creating a cube, we must create or have a database. In the object hierarchy, the following objects are immediately subordinate to the cube:

Data sources
The most important point is that a cube can have a single data source. It can be selected from the data sources in the database or created during cube creation. A cube’s dimensions must have the same data source as the cube, but its partitions can have different data sources.

Measures
True fact about the measures that a cube's measures are not shared with other cubes. They are created when the cube is created. A cube can contain up to 128 dimensions, each with millions of members, and up to 1,024 measures.

Dimensions
A cube's dimensions are either shared with other cubes in the database or private to the cube. Shared dimensions or public dimension can be created before or during cube creation. Private dimensions are created when the cube is created. Although the term cube suggests three dimensions but a cube can have up to 128 dimensions. Dimension members should be textual and are used as criteria for queries and as row and column headers in query results.

Partitions
A single partition is automatically created for a cube when the cube is created. If you have installed Analysis Services for SQL Server 2000 Enterprise Edition, after creating a cube, you can create additional partitions in the cube.

Cube roles
Every cube must have at least one cube role in order to provide access to end users. Cube roles are derived from database roles, which can be created before or after cube creation. They are created after cube creation.

Commands
Commands are optional. Commands are created after cube creation.


Please visit to know more on -
  1. Collaboration of OLTP and OLAP systems
  2. Major differences between OLTP and OLAP
  3. Data Warehouse
  4. Data Warehouse - Multidimensional Cube
  5. Data Warehouse - Multidimensional Cube Types
  6. Data Warehouse - Architecture and Multidimensional Model
  7. Data Warehouse - Dimension tables.
  8. Data Warehouse - Fact tables.
  9. Data Warehouse - Conceptual Modeling.
  10. Data Warehouse - Star schema.
  11. Data Warehouse - Snowflake schema.
  12. Data Warehouse - Fact constellations
  13. Data Warehouse - OLAP Servers 
References:
https://technet.microsoft.com/en-us/library/aa216365(v=sql.80).aspx