ABW501 Mock Exams (1+2): Analytics Edge (With Answers)
19 min read
1
π Exam Information
| Item | Details |
|---|---|
| Total Points | 100 |
| Time Allowed | 90 minutes |
| Format | Closed book, calculator allowed |
| Structure | 4 Blocks, 8 Questions total |
Block 1: Analytics Types & Applications (25 points)
Q1.1 (12 points)
Complete the comparison table for four analytics types:
| Analytics Type | Key Question | Example Technique | Business Example |
|---|---|---|---|
| Descriptive | ? | ? | ? |
| Diagnostic | ? | ? | ? |
| Predictive | ? | ? | ? |
| Prescriptive | ? | ? | ? |
π‘ Click to View Answer & Solution
| Analytics Type | Key Question | Example Technique | Business Example |
|---|---|---|---|
| Descriptive | What happened? | Summary statistics, dashboards, reporting | Monthly sales report showing $2M revenue |
| Diagnostic | Why did it happen? | Drill-down analysis, correlation analysis | Sales drop due to competitor price cut |
| Predictive | What will happen? | Regression, ML models, forecasting | Customer churn prediction (70% likely to leave) |
| Prescriptive | What should we do? | Optimization, simulation, decision models | Optimal inventory levels to minimize costs |
Memory Trick:
- Descriptive = Explain past
- Diagnostic = Investigate why
- Predictive = Estimate future
- Prescriptive = Execute action
Q1.2 (13 points)
Match each business problem to the correct analytics type and explain:
A: "Our sales dropped 15% last quarter. What caused this decline?"
B: "Which products should we recommend based on browsing patterns?"
C: "What is the optimal price point to maximize profit?"
D: "What were our top 5 selling products last month?"
π‘ Click to View Answer & Solution
A: "Sales dropped 15%. What caused this?"
- Type: DIAGNOSTIC
- Reason: Investigating WHY something happened (root cause analysis)
B: "Which products to recommend based on browsing?"
- Type: PREDICTIVE
- Reason: Using patterns to predict what customers will want
C: "Optimal price point to maximize profit?"
- Type: PRESCRIPTIVE
- Reason: Optimization problem - determining best action
D: "Top 5 selling products last month?"
- Type: DESCRIPTIVE
- Reason: Simply summarizing historical data
Block 2: Analytics Lifecycle (25 points)
Q2.1 (15 points)
List and describe the SIX stages of Data Analytics Lifecycle in order.
π‘ Click to View Answer & Solution
Stage 1: DISCOVERY
- Understand business problem and objectives
- Define key questions to answer
- Identify stakeholders and success criteria
Stage 2: DATA PREPARATION
- Collect data from various sources
- Clean data (handle missing values, outliers)
- Transform and integrate datasets
Stage 3: MODEL PLANNING
- Select appropriate techniques/algorithms
- Identify features (variables) to use
- Plan evaluation metrics
Stage 4: MODEL BUILDING
- Build and train models
- Test different algorithms
- Tune hyperparameters
Stage 5: COMMUNICATE RESULTS
- Present findings to stakeholders
- Create visualizations and reports
- Translate technical results to business insights
Stage 6: OPERATIONALIZE
- Deploy model to production
- Monitor performance over time
- Maintain and update as needed
Memory Trick: D-D-M-M-C-O = "Data Doctors Make Models, Communicate, Operate"
Q2.2 (10 points)
Identify which lifecycle stage each scenario describes:
A: "The team is cleaning missing values and removing outliers."
B: "Management wants to understand why churn increased. Team is defining specific questions."
C: "The model is deployed in production. Team monitors accuracy weekly."
D: "Data scientists are testing Random Forest, SVM, and Logistic Regression."
π‘ Click to View Answer & Solution
A: Cleaning missing values and outliers
- Stage: DATA PREPARATION
- Reason: Data cleaning is core preparation activity
B: Defining specific questions to answer
- Stage: DISCOVERY
- Reason: Understanding problem and defining scope
C: Model deployed, monitoring accuracy
- Stage: OPERATIONALIZE
- Reason: Production deployment and monitoring
D: Testing multiple algorithms
- Stage: MODEL BUILDING
- Reason: Training and comparing different models
Block 3: Regression Analysis (25 points)
Scenario:
Real estate price prediction model:
$$\text{Price} = 50000 + 200 \times \text{SquareFeet} + 15000 \times \text{Bedrooms} - 5000 \times \text{Age}$$
| Statistic | Value |
|---|---|
| RΒ² | 0.82 |
| Adjusted RΒ² | 0.80 |
| All p-values | < 0.01 |
| Sample size | 200 houses |
Q3.1 (8 points)
Interpret each coefficient in plain business language.
π‘ Click to View Answer & Solution
Constant (50,000):
- Base price when all other variables = 0
- Theoretical minimum house value
SquareFeet Coefficient (200):
- For each additional square foot, price increases by $200
- Holding bedrooms and age constant
Bedrooms Coefficient (15,000):
- Each additional bedroom adds $15,000 to price
- Holding square feet and age constant
Age Coefficient (-5,000):
- Each year older decreases price by $5,000
- Negative = older houses worth less
- Holding square feet and bedrooms constant
Key Phrase: Always say "holding other variables constant"
Q3.2 (8 points)
Interpret RΒ² = 0.82. Calculate predicted price for a house with:
- 2,000 sq ft
- 3 bedrooms
- 10 years old
π‘ Click to View Answer & Solution
RΒ² Interpretation:
RΒ² = 0.82 means 82% of the variation in house prices is explained by square feet, bedrooms, and age. The remaining 18% is due to other factors (location, condition, etc.). This is a good model for real estate.
Price Prediction: $\text{Price} = 50000 + 200(2000) + 15000(3) - 5000(10)$ $= 50000 + 400000 + 45000 - 50000$ $= 445000$
Predicted Price = $445,000
Q3.3 (9 points)
Compare two investment houses:
House A: 1,800 sq ft, 3 bedrooms, 5 years old House B: 1,500 sq ft, 4 bedrooms, 2 years old
Which has higher predicted price? Calculate the difference.
π‘ Click to View Answer & Solution
House A: $= 50000 + 200(1800) + 15000(3) - 5000(5)$ $= 50000 + 360000 + 45000 - 25000$ $= 430000$
House B: $= 50000 + 200(1500) + 15000(4) - 5000(2)$ $= 50000 + 300000 + 60000 - 10000$ $= 400000$
Results:
- House A: $430,000
- House B: $400,000
- House A is higher by $30,000
Insight: Square footage has more impact than bedrooms. House A's extra 300 sq ft ($60,000 value) outweighs House B's extra bedroom ($15,000).
Block 4: Data Mining Algorithms (25 points)
Q4.1 (15 points)
Complete the algorithm comparison table:
| Aspect | Decision Tree | KNN | Naive Bayes |
|---|---|---|---|
| Algorithm Type | ? | ? | ? |
| How it works | ? | ? | ? |
| Main Advantage | ? | ? | ? |
| Main Disadvantage | ? | ? | ? |
| Best Use Case | ? | ? | ? |
π‘ Click to View Answer & Solution
| Aspect | Decision Tree | KNN | Naive Bayes |
|---|---|---|---|
| Type | Both (Classification & Regression) | Both | Classification |
| How it works | Splits data using if-then rules based on feature thresholds | Classifies based on k nearest neighbors' majority vote | Uses Bayes theorem with feature independence assumption |
| Advantage | Easy to interpret, visual | Simple, no training needed | Fast, works well with small data |
| Disadvantage | Prone to overfitting | Slow prediction (compares all points) | Assumes feature independence (often unrealistic) |
| Best Use Case | When explainability matters (credit decisions) | When similar items cluster together (recommendations) | Text classification (spam detection) |
Q4.2 (10 points)
Scenario: Build email spam classifier with:
- 1 million emails (large dataset)
- Need real-time predictions (<100ms)
- Binary: Spam or Not Spam
Which algorithm? Why not the others?
π‘ Click to View Answer & Solution
Best Choice: Naive Bayes
Reasons:
- Very fast prediction - perfect for real-time (<100ms)
- Works great for text classification (spam detection is classic NB use case)
- Handles high-dimensional data well (many word features)
- Scales to large datasets efficiently
Why NOT Decision Tree:
- Can overfit with many text features
- Large tree = slower prediction
- Less suited for text data
Why NOT KNN:
- Way too slow for 1 million emails
- Must compare against ALL training examples
- Real-time requirement impossible to meet
- Memory intensive (stores all data)
Summary:
- Naive Bayes: β Fast, good for text, scalable
- Decision Tree: β οΈ Possible but not optimal for text
- KNN: β Too slow for large data + real-time
π Bonus: Correlation vs Causation (5 extra points)
Explain the difference. Give an example where two variables are correlated but NOT causally related.
π‘ Click to View Answer & Solution
Correlation: Two variables move together (positive or negative relationship)
Causation: One variable directly causes changes in another
Key Difference: Correlation β Causation. Just because A and B move together doesn't mean A causes B (or B causes A).
Example:
- Ice cream sales and drowning deaths are positively correlated
- But ice cream doesn't CAUSE drowning!
- Confounding variable: Hot weather
- Hot weather β More ice cream sales
- Hot weather β More swimming β More drownings
Other Examples:
- Shoe size correlates with reading ability (age is the confounding variable)
- Nicolas Cage movies correlate with swimming pool drownings (pure coincidence)
π End of Exam
| Block | Topic | Points |
|---|---|---|
| Block 1 | Analytics Types | 25 |
| Block 2 | Lifecycle | 25 |
| Block 3 | Regression | 25 |
| Block 4 | Data Mining | 25 |
| Total | 100 | |
| Bonus | Correlation/Causation | +5 |
π Quick Reference
Analytics Types:
- Descriptive: What happened?
- Diagnostic: Why?
- Predictive: What will happen?
- Prescriptive: What should we do?
Lifecycle: Discovery β Data Prep β Model Plan β Model Build β Communicate β Operationalize
Regression: Coefficient = change in Y per 1-unit change in X
RΒ²: % of variance explained by model
Show your work for partial credit. Good luck!
ABW501 Mock Exam 2 - Analytics Edge
π Exam Information
| Item | Details |
|---|---|
| Total Points | 100 |
| Time Allowed | 90 minutes |
| Format | Closed book, calculator allowed |
| Structure | 4 Blocks, 8 Questions |
Block 1: Analytics Strategy & Applications (25 points)
Q1.1 (12 points)
A hospital wants to reduce patient readmission rates. For each analytics approach, give a specific example of how it could help:
a) Descriptive Analytics b) Diagnostic Analytics c) Predictive Analytics d) Prescriptive Analytics
π‘ Click to View Answer & Solution
a) Descriptive Analytics: "Dashboard showing readmission rates by department, age group, and diagnosis. Example: 'Cardiology has 15% readmission rate vs. 8% hospital average.'"
b) Diagnostic Analytics: "Root cause analysis to understand WHY readmissions happen. Example: 'Patients discharged on Friday have 20% higher readmission - likely due to weekend pharmacy closures.'"
c) Predictive Analytics: "ML model predicting which patients are likely to be readmitted. Example: 'Patient John has 78% probability of readmission within 30 days based on his diagnosis, age, and prior history.'"
d) Prescriptive Analytics: "Recommending specific interventions. Example: 'For high-risk patients, schedule follow-up call within 48 hours, arrange home nurse visits, and ensure medication delivery.'"
Key Pattern:
- Descriptive: Summarize what happened
- Diagnostic: Explain why
- Predictive: Forecast risk
- Prescriptive: Recommend actions
Q1.2 (13 points)
Explain the concept of data-driven decision making vs intuition-based decision making. Give TWO advantages and TWO disadvantages of each approach.
π‘ Click to View Answer & Solution
Data-Driven Decision Making:
| Advantages | Disadvantages |
|---|---|
| 1. Objective - removes bias | 1. Requires quality data (garbage in = garbage out) |
| 2. Scalable - can analyze millions of records | 2. May miss context that humans understand |
| 3. Reproducible - same data β same decision | 3. Expensive to set up and maintain |
| 4. Measurable - can track outcomes | 4. Can lead to "analysis paralysis" |
Intuition-Based Decision Making:
| Advantages | Disadvantages |
|---|---|
| 1. Fast - no data collection needed | 1. Subject to cognitive biases |
| 2. Works when data is unavailable | 2. Hard to explain or justify |
| 3. Can capture tacit knowledge | 3. Not scalable |
| 4. Good for unprecedented situations | 4. Inconsistent results |
Best Practice: Combine both - use data to inform decisions, but let human judgment handle context and ethics.
Block 2: Data Quality & Preparation (25 points)
Q2.1 (12 points)
Explain FOUR common data quality issues and how to address each:
π‘ Click to View Answer & Solution
1. Missing Values
- Problem: Empty cells in dataset
- Causes: Survey non-response, system errors, data not collected
- Solutions:
- Delete rows (if few missing)
- Impute with mean/median/mode
- Use predictive models to estimate
- Create "missing" category for categorical
2. Outliers
- Problem: Extreme values far from normal range
- Causes: Data entry errors, genuine rare events
- Solutions:
- Remove if clearly erroneous
- Cap at percentiles (winsorization)
- Transform data (log scale)
- Use robust algorithms
3. Inconsistent Formatting
- Problem: Same thing recorded differently
- Examples: "USA", "U.S.A.", "United States"
- Solutions:
- Standardize formats
- Create lookup tables
- Use data validation rules
- Regular expressions for cleaning
4. Duplicate Records
- Problem: Same entity recorded multiple times
- Causes: Multiple data sources, entry errors
- Solutions:
- Exact matching (same ID)
- Fuzzy matching (similar names)
- Deduplication algorithms
- Define business rules for merging
Q2.2 (13 points)
You receive a dataset with the following issues:
- Age column has values: 25, 30, -5, 150, 45, NULL, 35
- Gender column has: "M", "Male", "m", "F", "female", "Female"
a) Identify what's wrong with each column b) Propose specific cleaning steps
π‘ Click to View Answer & Solution
a) Age Column Issues:
- -5: Invalid (negative age impossible)
- 150: Outlier (likely data entry error - no one is 150)
- NULL: Missing value
Gender Column Issues:
- Inconsistent case: "M" vs "m", "Male" vs "male"
- Inconsistent format: "M" vs "Male" (abbreviation vs full word)
- Inconsistent capitalization: "female" vs "Female"
b) Cleaning Steps:
For Age:
# Step 1: Replace invalid values with NaN
df.loc[df['Age'] < 0, 'Age'] = np.nan
# Step 2: Replace outliers (>120) with NaN
df.loc[df['Age'] > 120, 'Age'] = np.nan
# Step 3: Fill missing with median
df['Age'].fillna(df['Age'].median(), inplace=True)For Gender:
# Step 1: Convert to lowercase
df['Gender'] = df['Gender'].str.lower()
# Step 2: Standardize to single format
df['Gender'] = df['Gender'].replace({
'm': 'Male',
'male': 'Male',
'f': 'Female',
'female': 'Female'
})Result:
- Age: [25, 30, 32.5, 32.5, 45, 32.5, 35] (assuming median = 32.5)
- Gender: ['Male', 'Male', 'Male', 'Female', 'Female', 'Female']
Block 3: Model Evaluation & Interpretation (25 points)
Q3.1 (12 points)
Explain the following model evaluation concepts:
a) Accuracy, Precision, Recall b) When is accuracy NOT a good metric? c) What is the F1 Score and when to use it?
π‘ Click to View Answer & Solution
a) Definitions:
Accuracy: $\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}$
- Overall correctness
- "Of all predictions, how many were right?"
Precision: $\text{Precision} = \frac{TP}{TP + FP}$
- Of positive predictions, how many were actually positive?
- "When I predict positive, how often am I right?"
Recall (Sensitivity): $\text{Recall} = \frac{TP}{TP + FN}$
- Of actual positives, how many did I catch?
- "Of all actual positives, how many did I find?"
b) When Accuracy Fails:
Imbalanced Classes!
Example: Fraud detection
- 99% legitimate, 1% fraud
- Model predicts "all legitimate" β 99% accuracy!
- But catches 0% of fraud (useless)
Rule: Use precision/recall for imbalanced datasets.
c) F1 Score:
$F1 = 2 \times \frac{Precision \times Recall}{Precision + Recall}$
- Harmonic mean of precision and recall
- Balances both metrics
- Range: 0 to 1 (higher is better)
Use When:
- Class imbalance exists
- Both false positives AND false negatives matter
- Need single metric to compare models
Q3.2 (13 points)
A spam detection model has the following confusion matrix:
| Predicted: Spam | Predicted: Not Spam | |
|---|---|---|
| Actual: Spam | 85 | 15 |
| Actual: Not Spam | 10 | 890 |
Calculate: a) Accuracy b) Precision (for Spam) c) Recall (for Spam) d) F1 Score e) Which is more important for spam detection: Precision or Recall? Why?
π‘ Click to View Answer & Solution
Confusion Matrix Values:
- TP (Spam correctly identified) = 85
- FN (Spam missed) = 15
- FP (Not spam marked as spam) = 10
- TN (Not spam correctly identified) = 890
- Total = 1000
a) Accuracy: $\text{Accuracy} = \frac{85 + 890}{1000} = \frac{975}{1000} = 97.5%$
b) Precision: $\text{Precision} = \frac{85}{85 + 10} = \frac{85}{95} = 89.47%$
c) Recall: $\text{Recall} = \frac{85}{85 + 15} = \frac{85}{100} = 85.0%$
d) F1 Score: $F1 = 2 \times \frac{0.8947 \times 0.85}{0.8947 + 0.85} = 2 \times \frac{0.760}{1.745} = 87.17%$
e) Precision is MORE important for spam detection
Reasoning:
- High precision = Few false positives
- False positive = Important email marked as spam
- This is WORSE than missing some spam (user might miss critical email!)
Trade-off:
- Prefer: Some spam in inbox (annoying but visible)
- Avoid: Important email in spam folder (might never be seen)
Block 4: Advanced Analytics Concepts (25 points)
Q4.1 (12 points)
Compare Supervised vs Unsupervised learning:
| Aspect | Supervised | Unsupervised |
|---|---|---|
| Definition | ? | ? |
| Data Requirements | ? | ? |
| Example Algorithms | ? | ? |
| Business Use Cases | ? | ? |
π‘ Click to View Answer & Solution
| Aspect | Supervised | Unsupervised |
|---|---|---|
| Definition | Learning from labeled data (input β known output) | Finding patterns in unlabeled data |
| Data Requirements | Needs labeled training data (expensive to create) | Only needs input data (no labels needed) |
| Example Algorithms | Decision Tree, Random Forest, SVM, Linear Regression, Naive Bayes | K-Means Clustering, Hierarchical Clustering, PCA, Association Rules |
| Business Use Cases | Spam detection, price prediction, customer churn, loan approval | Customer segmentation, market basket analysis, anomaly detection |
Key Difference: Supervised has a "teacher" (labels), unsupervised discovers structure on its own.
Q4.2 (13 points)
Explain the concept of overfitting:
a) What is overfitting? b) How can you detect it? c) List FOUR techniques to prevent overfitting
π‘ Click to View Answer & Solution
a) What is Overfitting?
Model learns training data TOO well, including noise and random fluctuations.
- Performs excellently on training data
- Performs poorly on new/unseen data
- Model has memorized rather than learned general patterns
Analogy: Student who memorizes test answers but can't solve new problems.
b) How to Detect Overfitting
- Train-Test Gap:
- Training accuracy = 99%
- Test accuracy = 70%
- Large gap = overfitting
- Learning Curves:
- Training error keeps decreasing
- Validation error increases or plateaus
- Lines diverge
- Cross-Validation:
- High variance in scores across folds
- Some folds much worse than others
c) Four Techniques to Prevent Overfitting
- Cross-Validation
- Split data into k folds
- Train on k-1, test on 1, rotate
- More reliable performance estimate
- Regularization (L1/L2)
- Adds penalty for complex models
- L1 (Lasso): Can eliminate features
- L2 (Ridge): Shrinks coefficients
- Early Stopping
- Monitor validation error during training
- Stop when validation error starts increasing
- Prevents over-training
- Reduce Model Complexity
- Fewer features (feature selection)
- Simpler model (shallower tree)
- Less parameters
Bonus techniques:
- Dropout (neural networks)
- More training data
- Data augmentation
- Ensemble methods
π Bonus: Ethics in Analytics (5 extra points)
A company wants to use ML for hiring decisions. The model was trained on historical hiring data (who was hired and who succeeded).
What ethical concerns should be considered?
π‘ Click to View Answer & Solution
Ethical Concerns:
- Historical Bias Perpetuation
- If past hiring was biased (e.g., fewer women in tech), model learns this
- Model will discriminate against groups historically underrepresented
- "The model is only as fair as its training data"
- Proxy Discrimination
- Even without protected attributes (race, gender), proxies exist
- Zip code correlates with race
- Name style correlates with ethnicity
- Model may discriminate indirectly
- Lack of Transparency
- Candidates can't understand why they were rejected
- "Black box" decisions are hard to appeal
- Legal requirement for explainable decisions
- Feedback Loop
- Only hired people have success data
- Rejected candidates might have succeeded
- Model never learns from its mistakes
Recommendations:
- Audit model for bias regularly
- Use diverse training data
- Ensure human oversight
- Allow candidates to appeal/explain
- Be transparent about AI use
π End of Exam
| Block | Topic | Points |
|---|---|---|
| Block 1 | Analytics Strategy | 25 |
| Block 2 | Data Quality | 25 |
| Block 3 | Model Evaluation | 25 |
| Block 4 | Advanced Concepts | 25 |
| Total | 100 | |
| Bonus | Ethics | +5 |
π Key Formulas Reference
| Metric | Formula |
|---|---|
| Accuracy | (TP + TN) / Total |
| Precision | TP / (TP + FP) |
| Recall | TP / (TP + FN) |
| F1 Score | 2 Γ (Precision Γ Recall) / (Precision + Recall) |
| Specificity | TN / (TN + FP) |
Confusion Matrix:
Predicted
Pos Neg
Actual Pos TP FN
Neg FP TN
Show all working for partial credit. Good luck!