50 most common Data Science interview questions and answers

🧠 GENERAL DATA SCIENCE

What is Data Science?
It is an interdisciplinary field that combines statistics, computer science, and domain knowledge to extract insights from data.
Difference between Data Science, Machine Learning, and AI?

What is the lifecycle of a data science project?
Problem understanding → Data acquisition → Data preparation → Modeling → Evaluation → Deployment → Monitoring.
What is EDA (Exploratory Data Analysis)?
It involves visual and quantitative techniques to understand data patterns, spot anomalies, and check assumptions.
What are features in a dataset?
Independent variables or attributes used to predict the target variable.
What is a target variable?
The variable we aim to predict (dependent variable).
What is feature engineering?
Creating, transforming, or selecting variables to improve model performance.
What is data wrangling?
The process of cleaning and unifying complex data sets for analysis.
What are outliers?
Data points that differ significantly from other observations.

What is the p-value?
Probability of observing test results under the null hypothesis.
What is a confidence interval?
A range of values used to estimate a population parameter with a certain confidence level (e.g., 95%).
What is correlation?
Measures the relationship between two variables (-1 to +1).
Difference between correlation and causation?
Correlation is a relationship; causation implies one variable causes the other.
What is variance?
The average squared deviation from the mean.
What is standard deviation?
Square root of variance; indicates data dispersion.
What is the Central Limit Theorem (CLT)?
Distribution of sample means tends to be normal as sample size increases.
What is a null hypothesis?
A general statement that there is no effect or difference.
What is hypothesis testing?
Statistical method to evaluate assumptions about population parameters.
What is A/B testing?
A statistical method to compare two versions (A & B) and determine which one performs better.

What is overfitting?
Model performs well on training data but poorly on test data.
What is underfitting?
Model fails to capture patterns in training data.
How to prevent overfitting?

What is regularization?
Technique to penalize model complexity (e.g., L1/L2).
What is bias-variance tradeoff?
Balance between error due to assumptions (bias) and error due to variance in training data.
What is cross-validation?
A method to evaluate model performance by partitioning the dataset into training and testing sets multiple times.
What is a confusion matrix?
A table to evaluate the performance of a classification model (TP, TN, FP, FN).
Explain precision, recall, and F1-score.

What is logistic regression?
Used for binary classification, outputs probabilities.
What is a decision tree?
A flowchart-like model for decision making and classification.
What is Random Forest?
An ensemble of decision trees for improved accuracy and reduced overfitting.
What is Gradient Boosting?
Sequential ensemble technique that builds models incrementally to correct errors of previous ones.
What is KNN (K-Nearest Neighbors)?
A non-parametric method that classifies a point based on majority vote of its neighbors.
What is PCA (Principal Component Analysis)?
A technique to reduce dimensionality by projecting data onto principal components.
What is clustering?
Unsupervised technique to group similar data points (e.g., K-Means).
What is K-Means?
A clustering algorithm that partitions data into K clusters.
What is an ROC curve?
A graph showing the performance of a classification model at all thresholds.
What is AUC – ROC?
Area under the ROC Curve; higher AUC indicates better model performance.

What is the difference between NumPy and Pandas?
NumPy handles numerical data; Pandas is built on NumPy and handles tabular data.
What is the difference between .iloc[] and .loc[] in Pandas?

What is the difference between map(), apply(), and applymap() in Pandas?

What is the use of groupby() in Pandas?
To split data into groups, apply a function, and combine results.
What is serialization in Python?
Converting a data structure to a format that can be saved and loaded (e.g., using pickle or joblib).
How do you handle missing data in Pandas?

What are lambda functions in Python?
Anonymous functions defined with the lambda keyword.
What is vectorization in NumPy?
Using arrays to apply operations without explicit loops, increasing performance.
What is the difference between shallow copy and deep copy in Python?