50 most common Data Science interview questions and answers
🧠 GENERAL DATA SCIENCE
What is Data Science? It is an interdisciplinary field that combines statistics, computer science, and domain knowledge to extract insights from data.
Difference between Data Science, Machine Learning, and AI?
AI: Simulating human intelligence.
ML: Subset of AI focusing on pattern learning.
DS: Uses ML and statistics for data analysis.
What are the types of Data?
Structured
Unstructured
Semi-structured
What is the lifecycle of a data science project? Problem understanding → Data acquisition → Data preparation → Modeling → Evaluation → Deployment → Monitoring.
What is EDA (Exploratory Data Analysis)? It involves visual and quantitative techniques to understand data patterns, spot anomalies, and check assumptions.
What are features in a dataset? Independent variables or attributes used to predict the target variable.
What is a target variable? The variable we aim to predict (dependent variable).
What is feature engineering? Creating, transforming, or selecting variables to improve model performance.
What is data wrangling? The process of cleaning and unifying complex data sets for analysis.
What are outliers? Data points that differ significantly from other observations.
📊 STATISTICS & PROBABILITY
What is the p-value? Probability of observing test results under the null hypothesis.
What is a confidence interval? A range of values used to estimate a population parameter with a certain confidence level (e.g., 95%).
What is correlation? Measures the relationship between two variables (-1 to +1).
Difference between correlation and causation? Correlation is a relationship; causation implies one variable causes the other.
What is variance? The average squared deviation from the mean.
What is standard deviation? Square root of variance; indicates data dispersion.
What is the Central Limit Theorem (CLT)? Distribution of sample means tends to be normal as sample size increases.
What is a null hypothesis? A general statement that there is no effect or difference.
What is hypothesis testing? Statistical method to evaluate assumptions about population parameters.
What is A/B testing? A statistical method to compare two versions (A & B) and determine which one performs better.
📈 MACHINE LEARNING
What is the difference between supervised and unsupervised learning?
Supervised: Labeled data
Unsupervised: Unlabeled data
What is overfitting? Model performs well on training data but poorly on test data.
What is underfitting? Model fails to capture patterns in training data.
How to prevent overfitting?
Cross-validation
Pruning
Regularization
More data
What is regularization? Technique to penalize model complexity (e.g., L1/L2).
What is bias-variance tradeoff? Balance between error due to assumptions (bias) and error due to variance in training data.
What is cross-validation? A method to evaluate model performance by partitioning the dataset into training and testing sets multiple times.
What is a confusion matrix? A table to evaluate the performance of a classification model (TP, TN, FP, FN).