📋 Exam Information

Item	Details
Total Points	100
Time Allowed	90 minutes
Format	Closed book, calculator allowed
Structure	4 Blocks, 8 Questions total

Block 1: Analytics Types & Applications (25 points)

Q1.1 (12 points)

Complete the comparison table for four analytics types:

Analytics Type	Key Question	Example Technique	Business Example
Descriptive	?	?	?
Diagnostic	?	?	?
Predictive	?	?	?
Prescriptive	?	?	?

💡 Click to View Answer & Solution

Analytics Type	Key Question	Example Technique	Business Example
Descriptive	What happened?	Summary statistics, dashboards, reporting	Monthly sales report showing $2M revenue
Diagnostic	Why did it happen?	Drill-down analysis, correlation analysis	Sales drop due to competitor price cut
Predictive	What will happen?	Regression, ML models, forecasting	Customer churn prediction (70% likely to leave)
Prescriptive	What should we do?	Optimization, simulation, decision models	Optimal inventory levels to minimize costs

Memory Trick:

Descriptive = Explain past
Diagnostic = Investigate why
Predictive = Estimate future
Prescriptive = Execute action

Q1.2 (13 points)

Match each business problem to the correct analytics type and explain:

A: "Our sales dropped 15% last quarter. What caused this decline?"

B: "Which products should we recommend based on browsing patterns?"

C: "What is the optimal price point to maximize profit?"

D: "What were our top 5 selling products last month?"

💡 Click to View Answer & Solution

A: "Sales dropped 15%. What caused this?"

Type: DIAGNOSTIC
Reason: Investigating WHY something happened (root cause analysis)

B: "Which products to recommend based on browsing?"

Type: PREDICTIVE
Reason: Using patterns to predict what customers will want

C: "Optimal price point to maximize profit?"

Type: PRESCRIPTIVE
Reason: Optimization problem - determining best action

D: "Top 5 selling products last month?"

Type: DESCRIPTIVE
Reason: Simply summarizing historical data

Block 2: Analytics Lifecycle (25 points)

Q2.1 (15 points)

List and describe the SIX stages of Data Analytics Lifecycle in order.

💡 Click to View Answer & Solution

Stage 1: DISCOVERY

Understand business problem and objectives
Define key questions to answer
Identify stakeholders and success criteria

Stage 2: DATA PREPARATION

Collect data from various sources
Clean data (handle missing values, outliers)
Transform and integrate datasets

Stage 3: MODEL PLANNING

Select appropriate techniques/algorithms
Identify features (variables) to use
Plan evaluation metrics

Stage 4: MODEL BUILDING

Build and train models
Test different algorithms
Tune hyperparameters

Stage 5: COMMUNICATE RESULTS

Present findings to stakeholders
Create visualizations and reports
Translate technical results to business insights

Stage 6: OPERATIONALIZE

Deploy model to production
Monitor performance over time
Maintain and update as needed

Memory Trick: D-D-M-M-C-O = "Data Doctors Make Models, Communicate, Operate"

Q2.2 (10 points)

Identify which lifecycle stage each scenario describes:

A: "The team is cleaning missing values and removing outliers."

B: "Management wants to understand why churn increased. Team is defining specific questions."

C: "The model is deployed in production. Team monitors accuracy weekly."

D: "Data scientists are testing Random Forest, SVM, and Logistic Regression."

💡 Click to View Answer & Solution

A: Cleaning missing values and outliers

Stage: DATA PREPARATION
Reason: Data cleaning is core preparation activity

B: Defining specific questions to answer

Stage: DISCOVERY
Reason: Understanding problem and defining scope

C: Model deployed, monitoring accuracy

Stage: OPERATIONALIZE
Reason: Production deployment and monitoring

D: Testing multiple algorithms

Stage: MODEL BUILDING
Reason: Training and comparing different models

Block 3: Regression Analysis (25 points)

Scenario:

Real estate price prediction model:

$$\text{Price} = 50000 + 200 \times \text{SquareFeet} + 15000 \times \text{Bedrooms} - 5000 \times \text{Age}$$

Statistic	Value
R²	0.82
Adjusted R²	0.80
All p-values	< 0.01
Sample size	200 houses

Q3.1 (8 points)

Interpret each coefficient in plain business language.

💡 Click to View Answer & Solution

Constant (50,000):

Base price when all other variables = 0
Theoretical minimum house value

SquareFeet Coefficient (200):

For each additional square foot, price increases by $200
Holding bedrooms and age constant

Bedrooms Coefficient (15,000):

Each additional bedroom adds $15,000 to price
Holding square feet and age constant

Age Coefficient (-5,000):

Each year older decreases price by $5,000
Negative = older houses worth less
Holding square feet and bedrooms constant

Key Phrase: Always say "holding other variables constant"

Q3.2 (8 points)

Interpret R² = 0.82. Calculate predicted price for a house with:

2,000 sq ft
3 bedrooms
10 years old

💡 Click to View Answer & Solution

R² Interpretation:

R² = 0.82 means 82% of the variation in house prices is explained by square feet, bedrooms, and age. The remaining 18% is due to other factors (location, condition, etc.). This is a good model for real estate.

Price Prediction: $\text{Price} = 50000 + 200(2000) + 15000(3) - 5000(10)$ $= 50000 + 400000 + 45000 - 50000$ $= 445000$

Predicted Price = $445,000

Q3.3 (9 points)

Compare two investment houses:

House A: 1,800 sq ft, 3 bedrooms, 5 years old House B: 1,500 sq ft, 4 bedrooms, 2 years old

Which has higher predicted price? Calculate the difference.

💡 Click to View Answer & Solution

House A: $= 50000 + 200(1800) + 15000(3) - 5000(5)$ $= 50000 + 360000 + 45000 - 25000$ $= 430000$

House B: $= 50000 + 200(1500) + 15000(4) - 5000(2)$ $= 50000 + 300000 + 60000 - 10000$ $= 400000$

Results:

House A: $430,000
House B: $400,000
House A is higher by $30,000

Insight: Square footage has more impact than bedrooms. House A's extra 300 sq ft ($60,000 value) outweighs House B's extra bedroom ($15,000).

Block 4: Data Mining Algorithms (25 points)

Q4.1 (15 points)

Complete the algorithm comparison table:

Aspect	Decision Tree	KNN	Naive Bayes
Algorithm Type	?	?	?
How it works	?	?	?
Main Advantage	?	?	?
Main Disadvantage	?	?	?
Best Use Case	?	?	?

💡 Click to View Answer & Solution

Aspect	Decision Tree	KNN	Naive Bayes
Type	Both (Classification & Regression)	Both	Classification
How it works	Splits data using if-then rules based on feature thresholds	Classifies based on k nearest neighbors' majority vote	Uses Bayes theorem with feature independence assumption
Advantage	Easy to interpret, visual	Simple, no training needed	Fast, works well with small data
Disadvantage	Prone to overfitting	Slow prediction (compares all points)	Assumes feature independence (often unrealistic)
Best Use Case	When explainability matters (credit decisions)	When similar items cluster together (recommendations)	Text classification (spam detection)

Q4.2 (10 points)

Scenario: Build email spam classifier with:

1 million emails (large dataset)
Need real-time predictions (<100ms)
Binary: Spam or Not Spam

Which algorithm? Why not the others?

💡 Click to View Answer & Solution

Best Choice: Naive Bayes

Reasons:

Very fast prediction - perfect for real-time (<100ms)
Works great for text classification (spam detection is classic NB use case)
Handles high-dimensional data well (many word features)
Scales to large datasets efficiently

Why NOT Decision Tree:

Can overfit with many text features
Large tree = slower prediction
Less suited for text data

Why NOT KNN:

Way too slow for 1 million emails
Must compare against ALL training examples
Real-time requirement impossible to meet
Memory intensive (stores all data)

Summary:

Naive Bayes: ✅ Fast, good for text, scalable
Decision Tree: ⚠️ Possible but not optimal for text
KNN: ❌ Too slow for large data + real-time

🎁 Bonus: Correlation vs Causation (5 extra points)

Explain the difference. Give an example where two variables are correlated but NOT causally related.

💡 Click to View Answer & Solution

Correlation: Two variables move together (positive or negative relationship)

Causation: One variable directly causes changes in another

Key Difference: Correlation ≠ Causation. Just because A and B move together doesn't mean A causes B (or B causes A).

Example:

Ice cream sales and drowning deaths are positively correlated
But ice cream doesn't CAUSE drowning!
Confounding variable: Hot weather
- Hot weather → More ice cream sales
- Hot weather → More swimming → More drownings

Other Examples:

Shoe size correlates with reading ability (age is the confounding variable)
Nicolas Cage movies correlate with swimming pool drownings (pure coincidence)

🏁 End of Exam

Block	Topic	Points
Block 1	Analytics Types	25
Block 2	Lifecycle	25
Block 3	Regression	25
Block 4	Data Mining	25
Total		100
Bonus	Correlation/Causation	+5

📝 Quick Reference

Analytics Types:

Descriptive: What happened?
Diagnostic: Why?
Predictive: What will happen?
Prescriptive: What should we do?

Lifecycle: Discovery → Data Prep → Model Plan → Model Build → Communicate → Operationalize

Regression: Coefficient = change in Y per 1-unit change in X

R²: % of variance explained by model

Show your work for partial credit. Good luck!

ABW501 Mock Exam 2 - Analytics Edge

📋 Exam Information

Item	Details
Total Points	100
Time Allowed	90 minutes
Format	Closed book, calculator allowed
Structure	4 Blocks, 8 Questions

Block 1: Analytics Strategy & Applications (25 points)

Q1.1 (12 points)

A hospital wants to reduce patient readmission rates. For each analytics approach, give a specific example of how it could help:

a) Descriptive Analytics b) Diagnostic Analytics c) Predictive Analytics d) Prescriptive Analytics

💡 Click to View Answer & Solution

a) Descriptive Analytics: "Dashboard showing readmission rates by department, age group, and diagnosis. Example: 'Cardiology has 15% readmission rate vs. 8% hospital average.'"

b) Diagnostic Analytics: "Root cause analysis to understand WHY readmissions happen. Example: 'Patients discharged on Friday have 20% higher readmission - likely due to weekend pharmacy closures.'"

c) Predictive Analytics: "ML model predicting which patients are likely to be readmitted. Example: 'Patient John has 78% probability of readmission within 30 days based on his diagnosis, age, and prior history.'"

d) Prescriptive Analytics: "Recommending specific interventions. Example: 'For high-risk patients, schedule follow-up call within 48 hours, arrange home nurse visits, and ensure medication delivery.'"

Key Pattern:

Descriptive: Summarize what happened
Diagnostic: Explain why
Predictive: Forecast risk
Prescriptive: Recommend actions

Q1.2 (13 points)

Explain the concept of data-driven decision making vs intuition-based decision making. Give TWO advantages and TWO disadvantages of each approach.

💡 Click to View Answer & Solution

Data-Driven Decision Making:

Advantages	Disadvantages
1. Objective - removes bias	1. Requires quality data (garbage in = garbage out)
2. Scalable - can analyze millions of records	2. May miss context that humans understand
3. Reproducible - same data → same decision	3. Expensive to set up and maintain
4. Measurable - can track outcomes	4. Can lead to "analysis paralysis"

Intuition-Based Decision Making:

Advantages	Disadvantages
1. Fast - no data collection needed	1. Subject to cognitive biases
2. Works when data is unavailable	2. Hard to explain or justify
3. Can capture tacit knowledge	3. Not scalable
4. Good for unprecedented situations	4. Inconsistent results

Best Practice: Combine both - use data to inform decisions, but let human judgment handle context and ethics.

Block 2: Data Quality & Preparation (25 points)

Q2.1 (12 points)

Explain FOUR common data quality issues and how to address each:

💡 Click to View Answer & Solution

1. Missing Values

Problem: Empty cells in dataset
Causes: Survey non-response, system errors, data not collected
Solutions:
- Delete rows (if few missing)
- Impute with mean/median/mode
- Use predictive models to estimate
- Create "missing" category for categorical

2. Outliers

Problem: Extreme values far from normal range
Causes: Data entry errors, genuine rare events
Solutions:
- Remove if clearly erroneous
- Cap at percentiles (winsorization)
- Transform data (log scale)
- Use robust algorithms

3. Inconsistent Formatting

Problem: Same thing recorded differently
Examples: "USA", "U.S.A.", "United States"
Solutions:
- Standardize formats
- Create lookup tables
- Use data validation rules
- Regular expressions for cleaning

4. Duplicate Records

Problem: Same entity recorded multiple times
Causes: Multiple data sources, entry errors
Solutions:
- Exact matching (same ID)
- Fuzzy matching (similar names)
- Deduplication algorithms
- Define business rules for merging

Q2.2 (13 points)

You receive a dataset with the following issues:

Age column has values: 25, 30, -5, 150, 45, NULL, 35
Gender column has: "M", "Male", "m", "F", "female", "Female"

a) Identify what's wrong with each column b) Propose specific cleaning steps

💡 Click to View Answer & Solution

a) Age Column Issues:

-5: Invalid (negative age impossible)
150: Outlier (likely data entry error - no one is 150)
NULL: Missing value

Gender Column Issues:

Inconsistent case: "M" vs "m", "Male" vs "male"
Inconsistent format: "M" vs "Male" (abbreviation vs full word)
Inconsistent capitalization: "female" vs "Female"

b) Cleaning Steps:

For Age:

# Step 1: Replace invalid values with NaN
df.loc[df['Age'] < 0, 'Age'] = np.nan
 
# Step 2: Replace outliers (>120) with NaN
df.loc[df['Age'] > 120, 'Age'] = np.nan
 
# Step 3: Fill missing with median
df['Age'].fillna(df['Age'].median(), inplace=True)

For Gender:

# Step 1: Convert to lowercase
df['Gender'] = df['Gender'].str.lower()
 
# Step 2: Standardize to single format
df['Gender'] = df['Gender'].replace({
    'm': 'Male',
    'male': 'Male',
    'f': 'Female',
    'female': 'Female'
})

Result:

Age: [25, 30, 32.5, 32.5, 45, 32.5, 35] (assuming median = 32.5)
Gender: ['Male', 'Male', 'Male', 'Female', 'Female', 'Female']

Block 3: Model Evaluation & Interpretation (25 points)

Q3.1 (12 points)

Explain the following model evaluation concepts:

a) Accuracy, Precision, Recall b) When is accuracy NOT a good metric? c) What is the F1 Score and when to use it?

💡 Click to View Answer & Solution

a) Definitions:

Accuracy: $\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}$

Overall correctness
"Of all predictions, how many were right?"

Precision: $\text{Precision} = \frac{TP}{TP + FP}$

Of positive predictions, how many were actually positive?
"When I predict positive, how often am I right?"

Recall (Sensitivity): $\text{Recall} = \frac{TP}{TP + FN}$

Of actual positives, how many did I catch?
"Of all actual positives, how many did I find?"

b) When Accuracy Fails:

Imbalanced Classes!

Example: Fraud detection

99% legitimate, 1% fraud
Model predicts "all legitimate" → 99% accuracy!
But catches 0% of fraud (useless)

Rule: Use precision/recall for imbalanced datasets.

c) F1 Score:

$F1 = 2 \times \frac{Precision \times Recall}{Precision + Recall}$

Harmonic mean of precision and recall
Balances both metrics
Range: 0 to 1 (higher is better)

Use When:

Class imbalance exists
Both false positives AND false negatives matter
Need single metric to compare models

Q3.2 (13 points)

A spam detection model has the following confusion matrix:

	Predicted: Spam	Predicted: Not Spam
Actual: Spam	85	15
Actual: Not Spam	10	890

Calculate: a) Accuracy b) Precision (for Spam) c) Recall (for Spam) d) F1 Score e) Which is more important for spam detection: Precision or Recall? Why?

💡 Click to View Answer & Solution

Confusion Matrix Values:

TP (Spam correctly identified) = 85
FN (Spam missed) = 15
FP (Not spam marked as spam) = 10
TN (Not spam correctly identified) = 890
Total = 1000

a) Accuracy: $\text{Accuracy} = \frac{85 + 890}{1000} = \frac{975}{1000} = 97.5%$

b) Precision: $\text{Precision} = \frac{85}{85 + 10} = \frac{85}{95} = 89.47%$

c) Recall: $\text{Recall} = \frac{85}{85 + 15} = \frac{85}{100} = 85.0%$

d) F1 Score: $F1 = 2 \times \frac{0.8947 \times 0.85}{0.8947 + 0.85} = 2 \times \frac{0.760}{1.745} = 87.17%$

e) Precision is MORE important for spam detection

Reasoning:

High precision = Few false positives
False positive = Important email marked as spam
This is WORSE than missing some spam (user might miss critical email!)

Trade-off:

Prefer: Some spam in inbox (annoying but visible)
Avoid: Important email in spam folder (might never be seen)

Block 4: Advanced Analytics Concepts (25 points)

Q4.1 (12 points)

Compare Supervised vs Unsupervised learning:

Aspect	Supervised	Unsupervised
Definition	?	?
Data Requirements	?	?
Example Algorithms	?	?
Business Use Cases	?	?

💡 Click to View Answer & Solution

Aspect	Supervised	Unsupervised
Definition	Learning from labeled data (input → known output)	Finding patterns in unlabeled data
Data Requirements	Needs labeled training data (expensive to create)	Only needs input data (no labels needed)
Example Algorithms	Decision Tree, Random Forest, SVM, Linear Regression, Naive Bayes	K-Means Clustering, Hierarchical Clustering, PCA, Association Rules
Business Use Cases	Spam detection, price prediction, customer churn, loan approval	Customer segmentation, market basket analysis, anomaly detection

Key Difference: Supervised has a "teacher" (labels), unsupervised discovers structure on its own.

Q4.2 (13 points)

Explain the concept of overfitting:

a) What is overfitting? b) How can you detect it? c) List FOUR techniques to prevent overfitting

💡 Click to View Answer & Solution

a) What is Overfitting?

Model learns training data TOO well, including noise and random fluctuations.

Performs excellently on training data
Performs poorly on new/unseen data
Model has memorized rather than learned general patterns

Analogy: Student who memorizes test answers but can't solve new problems.

b) How to Detect Overfitting

Train-Test Gap:

Training accuracy = 99%
Test accuracy = 70%
Large gap = overfitting

Learning Curves:

Training error keeps decreasing
Validation error increases or plateaus
Lines diverge

Cross-Validation:

High variance in scores across folds
Some folds much worse than others

c) Four Techniques to Prevent Overfitting

Cross-Validation

Split data into k folds
Train on k-1, test on 1, rotate
More reliable performance estimate

Regularization (L1/L2)

Adds penalty for complex models
L1 (Lasso): Can eliminate features
L2 (Ridge): Shrinks coefficients

Early Stopping

Monitor validation error during training
Stop when validation error starts increasing
Prevents over-training

Reduce Model Complexity

Fewer features (feature selection)
Simpler model (shallower tree)
Less parameters

Bonus techniques:

Dropout (neural networks)
More training data
Data augmentation
Ensemble methods

🎁 Bonus: Ethics in Analytics (5 extra points)

A company wants to use ML for hiring decisions. The model was trained on historical hiring data (who was hired and who succeeded).

What ethical concerns should be considered?

💡 Click to View Answer & Solution

Ethical Concerns:

Historical Bias Perpetuation

If past hiring was biased (e.g., fewer women in tech), model learns this
Model will discriminate against groups historically underrepresented
"The model is only as fair as its training data"

Proxy Discrimination

Even without protected attributes (race, gender), proxies exist
Zip code correlates with race
Name style correlates with ethnicity
Model may discriminate indirectly

Lack of Transparency

Candidates can't understand why they were rejected
"Black box" decisions are hard to appeal
Legal requirement for explainable decisions

Feedback Loop

Only hired people have success data
Rejected candidates might have succeeded
Model never learns from its mistakes

Recommendations:

Audit model for bias regularly
Use diverse training data
Ensure human oversight
Allow candidates to appeal/explain
Be transparent about AI use

🏁 End of Exam

Block	Topic	Points
Block 1	Analytics Strategy	25
Block 2	Data Quality	25
Block 3	Model Evaluation	25
Block 4	Advanced Concepts	25
Total		100
Bonus	Ethics	+5

📝 Key Formulas Reference

Metric	Formula
Accuracy	(TP + TN) / Total
Precision	TP / (TP + FP)
Recall	TP / (TP + FN)
F1 Score	2 × (Precision × Recall) / (Precision + Recall)
Specificity	TN / (TN + FP)

Confusion Matrix:

                Predicted
              Pos    Neg
Actual Pos    TP     FN
       Neg    FP     TN

Show all working for partial credit. Good luck!