🧠 General Data Science Questions
1. What is Data Science?
Answer:
Data Science is a multidisciplinary field that uses statistics, computer science, machine learning, and domain knowledge to extract insights and knowledge from structured and unstructured data.
2. What are the steps in a Data Science project?
Answer:
- Problem Definition
- Data Collection
- Data Cleaning
- Exploratory Data Analysis (EDA)
- Feature Engineering
- Model Building
- Model Evaluation
- Deployment
- Monitoring and Maintenance
3. Difference between Supervised and Unsupervised Learning?
Answer:
Feature | Supervised Learning | Unsupervised Learning |
---|---|---|
Labeled Data | Yes | No |
Output | Predicts outcomes | Finds patterns/groupings |
Examples | Regression, Classification | Clustering, Dimensionality Reduction |
📊 Statistics & Probability
4. What is p-value?
Answer:
The p-value indicates the probability of observing the test results under the null hypothesis. A lower p-value (< 0.05) typically indicates strong evidence against the null hypothesis.
5. What is Central Limit Theorem?
Answer:
It states that the distribution of the sample mean approaches a normal distribution as the sample size becomes large, regardless of the original distribution.
📈 Machine Learning
6. What is overfitting and how to avoid it?
Answer:
Overfitting is when a model performs well on training data but poorly on unseen data. It can be avoided using:
- Cross-validation
- Pruning (in decision trees)
- Regularization (L1/L2)
- Reducing model complexity
- More training data
7. What are precision, recall, and F1-score?
Answer:
- Precision = TP / (TP + FP)
- Recall = TP / (TP + FN)
- F1-score = 2 × (Precision × Recall) / (Precision + Recall)
Used to evaluate classification models, especially with imbalanced datasets.
🛠️ Technical Skills
8. What libraries do you use in Python for data science?
Answer:
- NumPy, Pandas: Data manipulation
- Matplotlib, Seaborn, Plotly: Visualization
- Scikit-learn: Machine learning
- TensorFlow, PyTorch: Deep learning
- NLTK, SpaCy: NLP
- Statsmodels: Statistical modeling
9. What is the difference between apply()
and map()
in Pandas?
Answer:
map()
is used for element-wise operations on Series.apply()
is used for applying a function along an axis (row/column) in DataFrames or Series.
💾 SQL & Data Handling
10. How do you handle missing data?
Answer:
- Remove missing data (if minimal)
- Impute with mean/median/mode
- Use algorithms that support missing values (like XGBoost)
- Use interpolation or forward/backward fill techniques