DWH INTERVIEW QUESTIONS
  
  
  Data Warehousing Concepts - Questions & Answers
  
  
  
  
  
    Data Warehousing Concepts - Detailed Q&A
  
  
    Interview Questions and Answers
    
      - What is Data Normalization and Denormalization? Normalization is the process of organizing data to eliminate redundancy and improve integrity. Denormalization introduces redundancy to improve read performance, often used in reporting systems for faster querying.
- Why would you like to Denormalize your design? Denormalization reduces the number of joins needed during query execution. It enhances performance in read-heavy operations like reporting, especially in analytical systems like data warehouses.
- What is Data Warehouse? A data warehouse is a centralized repository used to store integrated, historical data from multiple sources. It supports analytics and decision-making processes across the organization.
- Why do we need Data Warehouse? Data warehouses enable businesses to perform complex queries and analysis on large datasets, unify data from various systems, and generate valuable insights for strategic decisions.
- What is data Mart? A data mart is a smaller, subject-specific version of a data warehouse. It focuses on a particular business function like finance, marketing, or sales, and is faster and easier to implement.
- Difference between Bill Inmon and Ralph Kimball approaches? Inmon's approach is top-down, starting with a centralized, normalized enterprise data warehouse. Kimball's approach is bottom-up, using dimensional models and data marts that later combine to form a warehouse.
- What is Dimensional Model? A dimensional model organizes data into facts and dimensions to optimize querying and reporting. It's commonly used in data warehousing and supports star and snowflake schema structures.
- Difference between Dimensional and ER Model? Dimensional modeling focuses on ease of querying and reporting, while ER modeling is used for transaction processing. Dimensional models use star/snowflake schema; ER models use normalized relations.
- What is Fact table and types? Fact tables store numerical performance metrics (measures) and foreign keys to dimension tables. Types include Transactional, Snapshot, and Accumulating Fact Tables.
- What is FactLess Fact Table? A factless fact table stores only keys without measurable data. It is used to track events or conditions, such as student attendance or product availability.
- Steps in Designing a Fact Table? Identify the business process, define the grain, determine the facts (measures), and identify dimensions. Ensure consistency in the design for performance and accuracy.
- What is Measure and types? Measures are quantitative values used in analysis. Types include Additive (can be summed), Semi-additive (can be summed over some dimensions), and Non-additive (cannot be summed).
- What is Dimension table? A dimension table contains descriptive attributes (textual or categorical) to provide context to facts. Examples include customer, time, and product dimensions.
- Types of Dimension Tables? Types include Conformed, Junk, Role-playing, Degenerate, and Slowly Changing Dimensions (SCD), each designed to serve different analytical purposes.
- What is Slowly Changing Dimension? A Slowly Changing Dimension tracks changes to attribute values over time. This enables historical reporting, where changes in values are not lost.
- SCD1 vs SCD2? SCD1 overwrites existing data without preserving history. SCD2 creates new records to maintain historical changes, allowing accurate time-based reporting.
- Two columns in SCD2 for tracking? Common columns include EffectiveDate and ExpiryDate, or an IsCurrent flag to identify the latest version of the record.
- Preferred Dimension Type? SCD2 is widely preferred for tracking full history. SCD3 allows only limited history (previous and current), making it less versatile.
- SCD Type 0, 4, 6? Type 0 retains original data. Type 4 stores historical data in a separate table. Type 6 combines Types 1 and 2, showing current and historical info.
- What is Junk Dimension? A junk dimension combines unrelated flags or attributes with low cardinality into a single dimension to simplify schema design and improve performance.
- What is Conformed Dimension? A conformed dimension is shared across multiple fact tables or data marts, maintaining consistency in reporting across business processes.
- What is Date Dimension? A date dimension is a time-based lookup table with attributes like day, week, month, and fiscal periods. It's usually pre-loaded using scripts or tools.
- What is Star Schema? A star schema consists of a central fact table surrounded by denormalized dimension tables. It's simple and ideal for performance in BI tools.
- What is Snowflake Schema? A snowflake schema is a normalized version of a star schema, where dimension tables are split into sub-dimensions, making the model more complex but space efficient.
- Advantages/Disadvantages of Star Schema? Pros: Faster querying, simple joins, intuitive. Cons: Data redundancy and larger storage footprint.
- Advantages/Disadvantages of Snowflake? Pros: Less data redundancy, normalized. Cons: Complex joins, slower performance in queries compared to star schema.
- Steps to load DWH/Data Mart? Steps: Extract data → Cleanse → Transform → Stage → Load dimension tables → Load fact tables → Validate and audit.
- Which tables to load first? Load dimension tables first to resolve surrogate keys. Fact tables reference those keys, so loading them afterward ensures integrity.
- What is Early/Late Arriving Fact/Dimension? Early-arriving fact: fact arrives before dimension. Late-arriving dimension: dimension is delayed. Use dummy records or update processes to handle them.
- What is Natural Key? A natural key is a real-world identifier like Email, SSN, or CustomerID. It's used in source systems and often carries business meaning.
- What is Surrogate Key? A surrogate key is an artificial key, typically auto-incremented, used to uniquely identify rows in dimension tables and maintain consistency.
- What is Primary key? A primary key uniquely identifies a row in a table and ensures no duplicates. It enforces entity integrity.
- What is Foreign Key? A foreign key is a reference to the primary key in another table, establishing relationships between tables and enforcing referential integrity.
- Natural Key vs Surrogate Key? Surrogate keys are preferred in data warehouses to avoid issues from changes in natural keys and ensure consistency across systems.
- Approach for loading 300M records? Use staging tables, disable constraints and indexes temporarily, bulk load in batches, then rebuild indexes and update statistics.
- Best practices for big fact table? Yes, implement partitioning. Consider access patterns, update frequency, and data volume to improve query performance and manageability.
- Clustered/Non-clustered index on Surrogate Key? Use a clustered index on surrogate keys for faster joins. Consider non-clustered indexes on frequently queried attributes as needed.
- Indexes for frequently searched dimensions? Create non-clustered indexes on columns like name, category, or status that are used in WHERE or JOIN conditions frequently.
- Partitioning facts or dimensions? Partition large fact tables by date or region. Dimension tables are typically small and not partitioned unless extremely large.
- Ensure data integrity? Implement foreign keys, use constraints, regular audits, and ETL validation steps to ensure data accuracy and consistency.
- Recovery mode for 700M record load? Use SIMPLE or BULK_LOGGED mode for faster performance during initial load, and switch back to FULL for point-in-time recovery afterward.
- Deleting 100M wrong records? Use partition switching if supported, or delete in small batches using indexed filters. Ensure minimal logging and transaction logging overhead.
- Update statistics on fact tables? After major loads or weekly. Monitor query performance and execution plans to determine when to update statistics to maintain performance.
- What is Merge Statement? MERGE is a SQL operation that synchronizes two tables by performing INSERT, UPDATE, or DELETE based on a condition. It's used in upserts and delta processing.
 
 
 
          
      
 
  
 
 
 
 
 
 
 
 
 
 
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.