1. General Data Science Questions
Q1: What is Data Science? How is it different from Business Intelligence (BI) and Data Analytics?
Answer:
Data Science is an interdisciplinary field that combines statistics, programming, and domain knowledge to extract insights from data. It involves data cleaning, visualization, predictive modeling, and decision-making.
- BI focuses on past data to generate reports and dashboards for decision-making.
- Data Analytics covers descriptive, diagnostic, and predictive analytics.
- Data Science integrates machine learning and AI for automation and prediction.
π Example:
Netflix uses Data Science to personalize movie recommendations using machine learning, whereas a BI tool generates a report showing the most-watched movies.
Q2: What is the CRISP-DM process in data science?
Answer:
CRISP-DM (Cross Industry Standard Process for Data Mining) consists of:
- Business Understanding β Define objectives
- Data Understanding β Collect and explore data
- Data Preparation β Clean and preprocess data
- Modeling β Apply machine learning algorithms
- Evaluation β Assess model performance
- Deployment β Implement the model
π Example:
In fraud detection for credit cards, we first collect transaction data, clean missing values, build a fraud detection model, test it, and deploy it to monitor real-time transactions.
2. Statistics & Probability
Q3: What is the Central Limit Theorem (CLT)? Why is it important?
Answer:
The CLT states that the distribution of the sample mean approaches a normal distribution as the sample size increases, regardless of the population’s original distribution.
π Example:
In A/B testing, even if individual user responses are skewed, the average conversion rate across many users follows a normal distribution.
Python Example:
import numpy as np
import matplotlib.pyplot as plt
sample_means = []
for _ in range(1000):
sample = np.random.exponential(scale=2, size=100) # Exponential Distribution
sample_means.append(np.mean(sample))
plt.hist(sample_means, bins=30, density=True)
plt.title("Demonstration of Central Limit Theorem")
plt.show()
3. Machine Learning
Q4: Explain Bias-Variance Tradeoff. How do you balance it?
Answer:
- Bias: Error due to overly simplistic models (e.g., Linear Regression underfitting).
- Variance: Error due to overly complex models (e.g., Deep Neural Networks overfitting).
- Solution: Use cross-validation, feature selection, and regularization to balance bias and variance.
π Example:
- A Decision Tree with deep nodes may have low bias but high variance.
- A Random Forest can reduce variance by averaging multiple trees.
Python Example:
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
dt = DecisionTreeRegressor(max_depth=20) # High variance
rf = RandomForestRegressor(n_estimators=100) # Reduced variance
Q5: Explain different evaluation metrics for classification models.
Answer:
- Accuracy = (TP + TN) / (TP + TN + FP + FN)
- Precision = TP / (TP + FP)
- Recall (Sensitivity) = TP / (TP + FN)
- F1-score = 2 * (Precision * Recall) / (Precision + Recall)
- AUC-ROC β Measures separability between classes.
π Example:
In fraud detection, accuracy might be misleading due to class imbalance. AUC-ROC is better for evaluating performance.
Python Example:
from sklearn.metrics import classification_report, roc_auc_score
y_true = [0, 1, 1, 0, 1, 1, 0]
y_pred = [0, 1, 1, 0, 0, 1, 1]
print(classification_report(y_true, y_pred))
print("AUC-ROC:", roc_auc_score(y_true, y_pred))
4. Deep Learning & Neural Networks
Q6: What is a Convolutional Neural Network (CNN)?
Answer:
CNNs are designed for image processing, using convolutional layers to detect spatial hierarchies.
π Example:
In medical imaging, CNNs can detect tumors in X-ray scans.
Python Example:
import tensorflow as tf
from tensorflow.keras import layers
model = tf.keras.Sequential([
layers.Conv2D(32, (3, 3), activation='relu', input_shape=(64, 64, 3)),
layers.MaxPooling2D(2, 2),
layers.Flatten(),
layers.Dense(64, activation='relu'),
layers.Dense(1, activation='sigmoid')
])
5. Python & Coding Challenges
Q7: Reverse a Linked List in Python.
class ListNode:
def __init__(self, val=0, next=None):
self.val = val
self.next = next
def reverse_linked_list(head):
prev, curr = None, head
while curr:
temp = curr.next
curr.next = prev
prev = curr
curr = temp
return prev
6. SQL & Databases
Q8: Write an SQL query to find the second highest salary from the “employees” table.
SELECT MAX(salary) AS SecondHighestSalary
FROM employees
WHERE salary < (SELECT MAX(salary) FROM employees);
7. Big Data & Cloud Computing
Q9: What is Apache Spark? Why is it faster than Hadoop?
Answer:
Apache Spark is a distributed computing framework that processes data in-memory, making it faster than Hadoopβs disk-based MapReduce.
8. Business Case & Scenario-Based Questions
Q10: How would you improve Netflixβs recommendation algorithm?
- Use deep learning models (Autoencoders) for collaborative filtering.
- Improve user embedding representations using Transformer models.
- Optimize content recommendations based on watch-time instead of clicks.