| Item | Details |
|---|---|
| Total Points | 100 |
| Time Allowed | 90 minutes |
| Format | Closed book, calculator allowed |
| Structure | 4 Blocks, 8 Questions total |
Complete the comparison table for four analytics types:
| Analytics Type | Key Question | Example Technique | Business Example |
|---|---|---|---|
| Descriptive | ? | ? | ? |
| Diagnostic | ? | ? | ? |
| Predictive | ? | ? | ? |
| Prescriptive | ? | ? | ? |
💡 Click to View Answer & Solution
| Analytics Type | Key Question | Example Technique | Business Example |
|---|---|---|---|
| Descriptive | What happened? | Summary statistics, dashboards, reporting | Monthly sales report showing $2M revenue |
| Diagnostic | Why did it happen? | Drill-down analysis, correlation analysis | Sales drop due to competitor price cut |
| Predictive | What will happen? | Regression, ML models, forecasting | Customer churn prediction (70% likely to leave) |
| Prescriptive | What should we do? | Optimization, simulation, decision models | Optimal inventory levels to minimize costs |
Memory Trick:
Match each business problem to the correct analytics type and explain:
A: "Our sales dropped 15% last quarter. What caused this decline?"
B: "Which products should we recommend based on browsing patterns?"
C: "What is the optimal price point to maximize profit?"
D: "What were our top 5 selling products last month?"
💡 Click to View Answer & Solution
A: "Sales dropped 15%. What caused this?"
B: "Which products to recommend based on browsing?"
C: "Optimal price point to maximize profit?"
D: "Top 5 selling products last month?"
List and describe the SIX stages of Data Analytics Lifecycle in order.
💡 Click to View Answer & Solution
Stage 1: DISCOVERY
Stage 2: DATA PREPARATION
Stage 3: MODEL PLANNING
Stage 4: MODEL BUILDING
Stage 5: COMMUNICATE RESULTS
Stage 6: OPERATIONALIZE
Memory Trick: D-D-M-M-C-O = "Data Doctors Make Models, Communicate, Operate"
Identify which lifecycle stage each scenario describes:
A: "The team is cleaning missing values and removing outliers."
B: "Management wants to understand why churn increased. Team is defining specific questions."
C: "The model is deployed in production. Team monitors accuracy weekly."
D: "Data scientists are testing Random Forest, SVM, and Logistic Regression."
💡 Click to View Answer & Solution
A: Cleaning missing values and outliers
B: Defining specific questions to answer
C: Model deployed, monitoring accuracy
D: Testing multiple algorithms
Real estate price prediction model:
$$\text{Price} = 50000 + 200 \times \text{SquareFeet} + 15000 \times \text{Bedrooms} - 5000 \times \text{Age}$$
| Statistic | Value |
|---|---|
| R² | 0.82 |
| Adjusted R² | 0.80 |
| All p-values | < 0.01 |
| Sample size | 200 houses |
Interpret each coefficient in plain business language.
💡 Click to View Answer & Solution
Constant (50,000):
SquareFeet Coefficient (200):
Bedrooms Coefficient (15,000):
Age Coefficient (-5,000):
Key Phrase: Always say "holding other variables constant"
Interpret R² = 0.82. Calculate predicted price for a house with:
💡 Click to View Answer & Solution
R² Interpretation:
R² = 0.82 means 82% of the variation in house prices is explained by square feet, bedrooms, and age. The remaining 18% is due to other factors (location, condition, etc.). This is a good model for real estate.
Price Prediction: $\text{Price} = 50000 + 200(2000) + 15000(3) - 5000(10)$ $= 50000 + 400000 + 45000 - 50000$ $= 445000$
Predicted Price = $445,000
Compare two investment houses:
House A: 1,800 sq ft, 3 bedrooms, 5 years old House B: 1,500 sq ft, 4 bedrooms, 2 years old
Which has higher predicted price? Calculate the difference.
💡 Click to View Answer & Solution
House A: $= 50000 + 200(1800) + 15000(3) - 5000(5)$ $= 50000 + 360000 + 45000 - 25000$ $= 430000$
House B: $= 50000 + 200(1500) + 15000(4) - 5000(2)$ $= 50000 + 300000 + 60000 - 10000$ $= 400000$
Results:
Insight: Square footage has more impact than bedrooms. House A's extra 300 sq ft ($60,000 value) outweighs House B's extra bedroom ($15,000).
Complete the algorithm comparison table:
| Aspect | Decision Tree | KNN | Naive Bayes |
|---|---|---|---|
| Algorithm Type | ? | ? | ? |
| How it works | ? | ? | ? |
| Main Advantage | ? | ? | ? |
| Main Disadvantage | ? | ? | ? |
| Best Use Case | ? | ? | ? |
💡 Click to View Answer & Solution
| Aspect | Decision Tree | KNN | Naive Bayes |
|---|---|---|---|
| Type | Both (Classification & Regression) | Both | Classification |
| How it works | Splits data using if-then rules based on feature thresholds | Classifies based on k nearest neighbors' majority vote | Uses Bayes theorem with feature independence assumption |
| Advantage | Easy to interpret, visual | Simple, no training needed | Fast, works well with small data |
| Disadvantage | Prone to overfitting | Slow prediction (compares all points) | Assumes feature independence (often unrealistic) |
| Best Use Case | When explainability matters (credit decisions) | When similar items cluster together (recommendations) | Text classification (spam detection) |
Scenario: Build email spam classifier with:
Which algorithm? Why not the others?
💡 Click to View Answer & Solution
Best Choice: Naive Bayes
Reasons:
Why NOT Decision Tree:
Why NOT KNN:
Summary:
Explain the difference. Give an example where two variables are correlated but NOT causally related.
💡 Click to View Answer & Solution
Correlation: Two variables move together (positive or negative relationship)
Causation: One variable directly causes changes in another
Key Difference: Correlation ≠ Causation. Just because A and B move together doesn't mean A causes B (or B causes A).
Example:
Other Examples:
| Block | Topic | Points |
|---|---|---|
| Block 1 | Analytics Types | 25 |
| Block 2 | Lifecycle | 25 |
| Block 3 | Regression | 25 |
| Block 4 | Data Mining | 25 |
| Total | 100 | |
| Bonus | Correlation/Causation | +5 |
Analytics Types:
Lifecycle: Discovery → Data Prep → Model Plan → Model Build → Communicate → Operationalize
Regression: Coefficient = change in Y per 1-unit change in X
R²: % of variance explained by model
Show your work for partial credit. Good luck!
| Item | Details |
|---|---|
| Total Points | 100 |
| Time Allowed | 90 minutes |
| Format | Closed book, calculator allowed |
| Structure | 4 Blocks, 8 Questions |
A hospital wants to reduce patient readmission rates. For each analytics approach, give a specific example of how it could help:
a) Descriptive Analytics b) Diagnostic Analytics c) Predictive Analytics d) Prescriptive Analytics
💡 Click to View Answer & Solution
a) Descriptive Analytics: "Dashboard showing readmission rates by department, age group, and diagnosis. Example: 'Cardiology has 15% readmission rate vs. 8% hospital average.'"
b) Diagnostic Analytics: "Root cause analysis to understand WHY readmissions happen. Example: 'Patients discharged on Friday have 20% higher readmission - likely due to weekend pharmacy closures.'"
c) Predictive Analytics: "ML model predicting which patients are likely to be readmitted. Example: 'Patient John has 78% probability of readmission within 30 days based on his diagnosis, age, and prior history.'"
d) Prescriptive Analytics: "Recommending specific interventions. Example: 'For high-risk patients, schedule follow-up call within 48 hours, arrange home nurse visits, and ensure medication delivery.'"
Key Pattern:
Explain the concept of data-driven decision making vs intuition-based decision making. Give TWO advantages and TWO disadvantages of each approach.
💡 Click to View Answer & Solution
Data-Driven Decision Making:
| Advantages | Disadvantages |
|---|---|
| 1. Objective - removes bias | 1. Requires quality data (garbage in = garbage out) |
| 2. Scalable - can analyze millions of records | 2. May miss context that humans understand |
| 3. Reproducible - same data → same decision | 3. Expensive to set up and maintain |
| 4. Measurable - can track outcomes | 4. Can lead to "analysis paralysis" |
Intuition-Based Decision Making:
| Advantages | Disadvantages |
|---|---|
| 1. Fast - no data collection needed | 1. Subject to cognitive biases |
| 2. Works when data is unavailable | 2. Hard to explain or justify |
| 3. Can capture tacit knowledge | 3. Not scalable |
| 4. Good for unprecedented situations | 4. Inconsistent results |
Best Practice: Combine both - use data to inform decisions, but let human judgment handle context and ethics.
Explain FOUR common data quality issues and how to address each:
💡 Click to View Answer & Solution
1. Missing Values
2. Outliers
3. Inconsistent Formatting
4. Duplicate Records
You receive a dataset with the following issues:
a) Identify what's wrong with each column b) Propose specific cleaning steps
💡 Click to View Answer & Solution
a) Age Column Issues:
Gender Column Issues:
b) Cleaning Steps:
For Age:
# Step 1: Replace invalid values with NaN
df.loc[df['Age'] < 0, 'Age'] = np.nan
# Step 2: Replace outliers (>120) with NaN
df.loc[df['Age'] > 120, 'Age'] = np.nan
# Step 3: Fill missing with median
df['Age'].fillna(df['Age'].median(), inplace=True)For Gender:
# Step 1: Convert to lowercase
df['Gender'] = df['Gender'].str.lower()
# Step 2: Standardize to single format
df['Gender'] = df['Gender'].replace({
'm': 'Male',
'male': 'Male',
'f': 'Female',
'female': 'Female'
})Result:
Explain the following model evaluation concepts:
a) Accuracy, Precision, Recall b) When is accuracy NOT a good metric? c) What is the F1 Score and when to use it?
💡 Click to View Answer & Solution
a) Definitions:
Accuracy: $\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}$
Precision: $\text{Precision} = \frac{TP}{TP + FP}$
Recall (Sensitivity): $\text{Recall} = \frac{TP}{TP + FN}$
b) When Accuracy Fails:
Imbalanced Classes!
Example: Fraud detection
Rule: Use precision/recall for imbalanced datasets.
c) F1 Score:
$F1 = 2 \times \frac{Precision \times Recall}{Precision + Recall}$
Use When:
A spam detection model has the following confusion matrix:
| Predicted: Spam | Predicted: Not Spam | |
|---|---|---|
| Actual: Spam | 85 | 15 |
| Actual: Not Spam | 10 | 890 |
Calculate: a) Accuracy b) Precision (for Spam) c) Recall (for Spam) d) F1 Score e) Which is more important for spam detection: Precision or Recall? Why?
💡 Click to View Answer & Solution
Confusion Matrix Values:
a) Accuracy: $\text{Accuracy} = \frac{85 + 890}{1000} = \frac{975}{1000} = 97.5%$
b) Precision: $\text{Precision} = \frac{85}{85 + 10} = \frac{85}{95} = 89.47%$
c) Recall: $\text{Recall} = \frac{85}{85 + 15} = \frac{85}{100} = 85.0%$
d) F1 Score: $F1 = 2 \times \frac{0.8947 \times 0.85}{0.8947 + 0.85} = 2 \times \frac{0.760}{1.745} = 87.17%$
e) Precision is MORE important for spam detection
Reasoning:
Trade-off:
Compare Supervised vs Unsupervised learning:
| Aspect | Supervised | Unsupervised |
|---|---|---|
| Definition | ? | ? |
| Data Requirements | ? | ? |
| Example Algorithms | ? | ? |
| Business Use Cases | ? | ? |
💡 Click to View Answer & Solution
| Aspect | Supervised | Unsupervised |
|---|---|---|
| Definition | Learning from labeled data (input → known output) | Finding patterns in unlabeled data |
| Data Requirements | Needs labeled training data (expensive to create) | Only needs input data (no labels needed) |
| Example Algorithms | Decision Tree, Random Forest, SVM, Linear Regression, Naive Bayes | K-Means Clustering, Hierarchical Clustering, PCA, Association Rules |
| Business Use Cases | Spam detection, price prediction, customer churn, loan approval | Customer segmentation, market basket analysis, anomaly detection |
Key Difference: Supervised has a "teacher" (labels), unsupervised discovers structure on its own.
Explain the concept of overfitting:
a) What is overfitting? b) How can you detect it? c) List FOUR techniques to prevent overfitting
💡 Click to View Answer & Solution
a) What is Overfitting?
Model learns training data TOO well, including noise and random fluctuations.
Analogy: Student who memorizes test answers but can't solve new problems.
b) How to Detect Overfitting
c) Four Techniques to Prevent Overfitting
Bonus techniques:
A company wants to use ML for hiring decisions. The model was trained on historical hiring data (who was hired and who succeeded).
What ethical concerns should be considered?
💡 Click to View Answer & Solution
Ethical Concerns:
Recommendations:
| Block | Topic | Points |
|---|---|---|
| Block 1 | Analytics Strategy | 25 |
| Block 2 | Data Quality | 25 |
| Block 3 | Model Evaluation | 25 |
| Block 4 | Advanced Concepts | 25 |
| Total | 100 | |
| Bonus | Ethics | +5 |
| Metric | Formula |
|---|---|
| Accuracy | (TP + TN) / Total |
| Precision | TP / (TP + FP) |
| Recall | TP / (TP + FN) |
| F1 Score | 2 × (Precision × Recall) / (Precision + Recall) |
| Specificity | TN / (TN + FP) |
Confusion Matrix:
Predicted
Pos Neg
Actual Pos TP FN
Neg FP TN
Show all working for partial credit. Good luck!