Terminology: DW, Data warehouse, ETL, Cube, Data Mart, Data Mining, Reporting, OLAP, OLTP, Metadata (Data Catalog, Data Directory, DWH Roadmap, Nerve Center), Data Drilling (Drill Up, Drill Through, Drill Across, Drill Down),

Data Source > [ ODS (Operational Data Source) ], Staging Area > Warehouse > Data Marts > Presentation

Reference: https://learndatamodeling.com/blog/time-dimension/,

https://www.1keydata.com/datawarehousing/junk-dimension.html

1. Data warehouse - Intro

Data comes or extract from multiple sources - clean up and Load the data - Generate the report from that - from the report management take decision
Supports Multidimensional Queries
Enable users to view the same data in different ways
Each aspect of information represents a different dimension

Abbrevations

Operational data store (ODS)

2. Table Types

2.1. Dimension Table

Primary Key and Master Data (non measurable data)

2.1.1. Types of Dimensions

2.1.1.1. Conform / Reusable Dim

A conformed dimension can exist as a single dimension table that relates to multiple fact tables within the same data warehouse, or as identical dimension tables in separate data marts.
Date is a common conformed dimension because its attributes (day, week, month, quarter, year, etc.) have the same meaning when joined to any fact table.

2.1.1.2. Junk Dim

The content in the junk dimension table is the combination of all possible values of the individual indicator fields.
https://www.1keydata.com/datawarehousing/junk-dimension.html

2.1.1.3. Degenerated Dim

Sometimes a dimension is deﬁned that has no content except for its primary key. For example, when an invoice has multiple line items, the line item fact rows inherit all the descriptive dimension foreign keys of the invoice, and the invoice is left with no unique content. But the invoice number remains a valid dimension key for fact tables at the line item level. This degenerate dimension is placed in the fact table with the explicit acknowledgment that there is no associated dimension table. Degenerate dimensions are most common with transaction and accumulating snapshot fact tables.

2.1.1.4. Slowly Changing Dim (SCD) (3 Types)

SCD Type 1 - Latest / current data

New record will be inserted - No history maintained
Updated data will be updated without maintaining history

SCD Type 2 - Historical Data

Earlier record’s UpdatedData as system date
Update will be inserted as New record with future updated date

SCD Type 3 - Partial History

Only one previous data based on table design available - need to define which column we need to maintain the partial history

2.2. Fact Table

PKs of DIMs / FKs and Measurable Columns/data

2.2.1. Additive Fact - Star Schema ???

can be summed up across all dimensions in the fact table, such as sales
All dim tables primary key connected with Fact Table

2.2.2. Semi Additive Fact - Snowflake Schema???

can be summed up for some, but not all, dimensions in the fact table, such as a headcount
Some of the Dim primary keys connected with Fact table and some connected with other DIM tables

2.2.3. Non Additive Fact

Non-additive facts are facts that cannot be summed up for any of the dimensions present in the fact table.

2.2.4. Fact Less Fact

3. Schema Types

3.1. Star Schema

Fact connected with all dimensions

3.2. Snowflake Schema

Fact table only few of the dim keys connected

4. Approach

4.1. Top Down

Data will flow from DW to Data Mart

4.2. Bottom Up

Data Mart to Data warehouse

5. Data Mart

Small set of subject data area

5.1. Types of Data Mart

5.1.1. Dependent Data Mart

5.1.2. Independent Data Mart

5.1.3. Hybrid Data Mart

6. Data Modeling

Data modeling is the process of creating a visual representation of either a whole information system or parts of it to communicate connections between data points and structures.
Data models are used to describe how the data is stored, accessed, and updated in a DBMS. A set of symbols and text is used to represent them so that all the members of an organization can understand how the data is organized.

6.1. Types of data modeling

6.1.1. Relational:

Although “older” in approach, the most common database model still in use today is relational, which stores the data in fixed-format records and arranges data in tables with rows and columns. The most basic type of data model has two elements: measures and dimensions. Measures are numeric values, such as quantities and revenue, used in mathematical calculations like sum or average. Dimensions can be text or numeric. They are not used in calculations and include descriptions or locations. The raw data is defined as a measure or a dimension. Other terminology used in relational database design includes “relations” (the table with rows and columns), “attributes” (columns), “tuples” (rows), and “domain” (set of values allowed in a column). While there are additional terms and structural requirements that define a relational database, the important factor is the relationships defined within that structure. Common data elements (or keys) link tables and data sets together. Tables can also be related explicitly, like parent and child relationships including one-to-one, one-to-many, or many-to-many.

6.1.2. Dimensional:

Less rigid and structured, the dimensional approach favors a contextual data structure that is more related to the business use or context. This database structure is optimized for online queries and data warehousing tools. Critical data elements, like a transaction quantity for example, are called “facts” and are accompanied by reference information called “dimensions,” be that product ID, unit price, or transaction date. A fact table is a primary table in a dimensional model. Retrieval can be quick and efficient – with data for a specific type of activity stored together – but the lack of relationship links can complicate analytical retrieval and use of the data. Since the data structure is tied to the business function that produces and uses the data, combining data produced by dissimilar systems (in a data warehouse, for instance) can be problematic.

6.1.3. Entity-Rich (E-R):

An E-R model represents a business data structure in graphical form containing boxes of various shapes to represent activities, functions, or “entities” and lines to represent associations, dependencies, or “relationships.” The E-R model is then used to create a relational database with each row representing an entity and the fields in that row contain attributes. As in all relational databases, “key” data elements are used to link tables together.

6.2. Three levels of data abstraction

6.2.1. Conceptual Model Design

6.2.2. Logical

6.2.3. Physical

6.3. Critical Relationship in a Data Model

6.3.1. Identifying

A relationship line normally connects parent and child tables. But if a child table’s reference column is part of the table’s primary key, the tables are connected by a thick line, signifying an identifying relationship.

6.3.2. Non-identifying

Non-identifying. If a child table’s reference column is NOT a part of the table’s primary key, the tables are connected by a dotted line, signifying a no-identifying relationship.

6.3.3. Self-recursive.

Self-recursive. A recursive relationship is a standalone column in a table connected to the primary key in the same table.

7. Comparison

7.1. OLTP Vs OLAP

OLTP	OLAP
Online Transaction Procession Operational System	Online Analytical Processing System
Focuses on small transactions - huge transactions	Low transactions - More data (historical also)
Uses Operational Data	Uses consolidated Data

7.2. Data Warehouse Vs Data Marts

Data Warehouse	Data Marts
Enterprise wide data	Department wide data
Multiple subject areas	Single subject area
Multiple data sources	Limited data sources
Occupies large memory	Occupies limited memory
Long time to implement	Shorter time to implement

Search This Blog

Thursday, March 14, 2024

DW