10 Most Common Data Science Interview Questions with Answers and Code Examples

1. General Data Science Questions

Q1: What is Data Science? How is it different from Business Intelligence (BI) and Data Analytics?

Answer:
Data Science is an interdisciplinary field that combines statistics, programming, and domain knowledge to extract insights from data. It involves data cleaning, visualization, predictive modeling, and decision-making.

  • BI focuses on past data to generate reports and dashboards for decision-making.
  • Data Analytics covers descriptive, diagnostic, and predictive analytics.
  • Data Science integrates machine learning and AI for automation and prediction.

πŸ“Œ Example:
Netflix uses Data Science to personalize movie recommendations using machine learning, whereas a BI tool generates a report showing the most-watched movies.


Q2: What is the CRISP-DM process in data science?

Answer:
CRISP-DM (Cross Industry Standard Process for Data Mining) consists of:

  1. Business Understanding – Define objectives
  2. Data Understanding – Collect and explore data
  3. Data Preparation – Clean and preprocess data
  4. Modeling – Apply machine learning algorithms
  5. Evaluation – Assess model performance
  6. Deployment – Implement the model

πŸ“Œ Example:
In fraud detection for credit cards, we first collect transaction data, clean missing values, build a fraud detection model, test it, and deploy it to monitor real-time transactions.


2. Statistics & Probability

Q3: What is the Central Limit Theorem (CLT)? Why is it important?

Answer:
The CLT states that the distribution of the sample mean approaches a normal distribution as the sample size increases, regardless of the population’s original distribution.

πŸ“Œ Example:
In A/B testing, even if individual user responses are skewed, the average conversion rate across many users follows a normal distribution.

Python Example:

import numpy as np
import matplotlib.pyplot as plt

sample_means = []
for _ in range(1000):
    sample = np.random.exponential(scale=2, size=100)  # Exponential Distribution
    sample_means.append(np.mean(sample))

plt.hist(sample_means, bins=30, density=True)
plt.title("Demonstration of Central Limit Theorem")
plt.show()

3. Machine Learning

Q4: Explain Bias-Variance Tradeoff. How do you balance it?

Answer:

  • Bias: Error due to overly simplistic models (e.g., Linear Regression underfitting).
  • Variance: Error due to overly complex models (e.g., Deep Neural Networks overfitting).
  • Solution: Use cross-validation, feature selection, and regularization to balance bias and variance.

πŸ“Œ Example:

  • A Decision Tree with deep nodes may have low bias but high variance.
  • A Random Forest can reduce variance by averaging multiple trees.

Python Example:

from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor

dt = DecisionTreeRegressor(max_depth=20)  # High variance
rf = RandomForestRegressor(n_estimators=100)  # Reduced variance

Q5: Explain different evaluation metrics for classification models.

Answer:

  1. Accuracy = (TP + TN) / (TP + TN + FP + FN)
  2. Precision = TP / (TP + FP)
  3. Recall (Sensitivity) = TP / (TP + FN)
  4. F1-score = 2 * (Precision * Recall) / (Precision + Recall)
  5. AUC-ROC – Measures separability between classes.

πŸ“Œ Example:
In fraud detection, accuracy might be misleading due to class imbalance. AUC-ROC is better for evaluating performance.

Python Example:

from sklearn.metrics import classification_report, roc_auc_score

y_true = [0, 1, 1, 0, 1, 1, 0]
y_pred = [0, 1, 1, 0, 0, 1, 1]

print(classification_report(y_true, y_pred))
print("AUC-ROC:", roc_auc_score(y_true, y_pred))

4. Deep Learning & Neural Networks

Q6: What is a Convolutional Neural Network (CNN)?

Answer:
CNNs are designed for image processing, using convolutional layers to detect spatial hierarchies.

πŸ“Œ Example:
In medical imaging, CNNs can detect tumors in X-ray scans.

Python Example:

import tensorflow as tf
from tensorflow.keras import layers

model = tf.keras.Sequential([
    layers.Conv2D(32, (3, 3), activation='relu', input_shape=(64, 64, 3)),
    layers.MaxPooling2D(2, 2),
    layers.Flatten(),
    layers.Dense(64, activation='relu'),
    layers.Dense(1, activation='sigmoid')
])

5. Python & Coding Challenges

Q7: Reverse a Linked List in Python.

class ListNode:
    def __init__(self, val=0, next=None):
        self.val = val
        self.next = next

def reverse_linked_list(head):
    prev, curr = None, head
    while curr:
        temp = curr.next
        curr.next = prev
        prev = curr
        curr = temp
    return prev

6. SQL & Databases

Q8: Write an SQL query to find the second highest salary from the “employees” table.

SELECT MAX(salary) AS SecondHighestSalary
FROM employees
WHERE salary < (SELECT MAX(salary) FROM employees);

7. Big Data & Cloud Computing

Q9: What is Apache Spark? Why is it faster than Hadoop?

Answer:
Apache Spark is a distributed computing framework that processes data in-memory, making it faster than Hadoop’s disk-based MapReduce.


8. Business Case & Scenario-Based Questions

Q10: How would you improve Netflix’s recommendation algorithm?

  • Use deep learning models (Autoencoders) for collaborative filtering.
  • Improve user embedding representations using Transformer models.
  • Optimize content recommendations based on watch-time instead of clicks.

Leave a Reply

Your email address will not be published. Required fields are marked *