Question Answering System

A production-ready, extensible Question Answering system with multiple retrieval strategies, REST API, and comprehensive documentation.

๐ŸŒŸ Features

  • Multiple QA Modes
    • TF-IDF based retrieval (fast, lightweight)
    • Semantic search with sentence transformers
    • Transformer-based QA models (BERT, RoBERTa)
  • Document Processing
    • Support for TXT, JSON, PDF, CSV files
    • Intelligent text chunking with overlap
    • Metadata preservation
  • REST API
    • Ask questions via HTTP
    • Index documents dynamically
    • Search and retrieve documents
    • System health monitoring
  • Production Ready
    • Comprehensive test suite
    • Performance optimization
    • Docker support
    • Configurable settings

๐Ÿ“‹ Table of Contents

๐Ÿš€ Installation

Basic Installation

bash

# Install core dependencies
pip install -r requirements.txt

Minimal Installation (TF-IDF only)

bash

pip install numpy scikit-learn flask flask-cors

Full Installation (All features)

bash

# Install with semantic search
pip install sentence-transformers torch

# Install with transformer QA
pip install transformers torch

# Install PDF support
pip install PyPDF2

โšก Quick Start

1. Run Basic Example

python

from qa_system import QuestionAnsweringSystem

# Initialize
qa = QuestionAnsweringSystem(mode="tfidf")

# Add documents
documents = [
    {
        "id": "doc1",
        "text": "Python is a programming language created by Guido van Rossum in 1991.",
        "metadata": {"category": "programming"}
    }
]

qa.add_documents(documents)
qa.index_documents()

# Ask questions
answers = qa.answer("Who created Python?")
print(answers[0].text)

2. Start API Server

bash

# Basic server
python api_server.py

# With pre-loaded documents
python api_server.py --docs sample_documents.json

# Advanced mode
python api_server.py --mode semantic --docs sample_documents.json

3. Index Documents

bash

# Index files from directory
python document_indexer.py --input ./my_docs --output index.json

# Index single file
python document_indexer.py --input document.pdf --output index.json

๐Ÿ“– Usage Examples

Example 1: Custom Documents

python

from qa_system import QuestionAnsweringSystem

qa = QuestionAnsweringSystem(mode="tfidf")

# Your documents
docs = [
    {"id": "1", "text": "Company XYZ was founded in 2010..."},
    {"id": "2", "text": "Our products include..."}
]

qa.add_documents(docs)
qa.index_documents()

# Get answer
answer = qa.answer("When was the company founded?")
print(f"Answer: {answer[0].text}")
print(f"Confidence: {answer[0].confidence:.2%}")

Example 2: Multiple Answers

python

# Get ranked answer candidates
answers = qa.answer(
    question="What are the main products?",
    top_k=5,
    return_multiple=True
)

for i, ans in enumerate(answers, 1):
    print(f"{i}. {ans.text} (confidence: {ans.confidence:.2%})")

Example 3: Document Search

python

# Retrieve relevant documents without extracting answers
results = qa.retrieve_documents("machine learning", top_k=3)

for doc, score in results:
    print(f"Document: {doc.id}")
    print(f"Score: {score:.4f}")
    print(f"Text: {doc.text[:100]}...\n")

Example 4: Save and Load

python

# Save trained system
qa.save("my_qa_system.pkl")

# Load later
qa = QuestionAnsweringSystem.load("my_qa_system.pkl")

๐Ÿ”Œ API Documentation

Endpoints

POST /ask

Ask a question and get answers.

Request:

json

{
  "question": "What is Python?",
  "top_k": 3,
  "return_multiple": false
}

Response:

json

{
  "question": "What is Python?",
  "answers": [
    {
      "text": "Python is a programming language...",
      "confidence": 0.92,
      "source": "doc1",
      "context": "..."
    }
  ]
}

POST /index

Index new documents.

Request:

json

{
  "documents": [
    {
      "id": "doc1",
      "text": "Document content...",
      "metadata": {}
    }
  ]
}

POST /search

Search for relevant documents.

Request:

json

{
  "query": "machine learning",
  "top_k": 5
}

GET /health

Check system health.

GET /stats

Get system statistics.

Example API Usage

python

import requests

# Ask question
response = requests.post(
    "http://localhost:5000/ask",
    json={"question": "What is Python?"}
)
print(response.json())

# Index documents
requests.post(
    "http://localhost:5000/index",
    json={"documents": [{"id": "1", "text": "..."}]}
)

โš™๏ธ Configuration

Edit config.py to customize:

python

QA_CONFIG = {
    'default_mode': 'tfidf',  # or 'semantic', 'transformer'
    'top_k_documents': 3,
    'min_confidence': 0.3,
    'chunk_size': 500,
    'chunk_overlap': 50,
}

Or use environment variables:

bash

export QA_MODE=semantic
export TOP_K=5
export CHUNK_SIZE=300
python api_server.py

๐Ÿงช Testing

Run the test suite:

bash

# Run all tests
pytest test_qa.py -v

# Run with coverage
pytest test_qa.py --cov=qa_system --cov=document_indexer

# Run specific test
pytest test_qa.py::TestQuestionAnsweringSystem::test_answer_question

๐Ÿณ Docker Deployment

Build Image

bash

docker build -t qa-system .

Run Container

bash

docker run -p 5000:5000 -v $(pwd)/data:/data qa-system

Docker Compose

yaml

version: '3.8'
services:
  qa-api:
    build: .
    ports:
      - "5000:5000"
    environment:
      - QA_MODE=semantic
      - TOP_K=3
    volumes:
      - ./data:/data

๐Ÿ“Š Performance

Benchmarks

ModeQuestions/secLatency (p95)MemoryAccuracy
TF-IDF20-30500ms2GBGood
Semantic10-151s4GBBetter
Transformer5-102s8GBBest

Optimization Tips

  1. For Speed:
    • Use TF-IDF mode
    • Reduce chunk size
    • Enable caching
    • Use GPU for transformers
  2. For Accuracy:
    • Use transformer mode
    • Increase chunk overlap
    • Get multiple answers
    • Fine-tune on domain data
  3. For Scale:
    • Use vector databases (FAISS)
    • Implement caching layer
    • Deploy multiple instances
    • Use async processing

๐Ÿ“ Project Structure

qa-system/
โ”œโ”€โ”€ QA_SOLUTION_ARTICLE.md    # Comprehensive guide
โ”œโ”€โ”€ QUICKSTART.md              # Quick start guide
โ”œโ”€โ”€ README.md                  # This file
โ”œโ”€โ”€ qa_system.py               # Core QA implementation
โ”œโ”€โ”€ document_indexer.py        # Document processing
โ”œโ”€โ”€ api_server.py              # REST API server
โ”œโ”€โ”€ config.py                  # Configuration
โ”œโ”€โ”€ requirements.txt           # Dependencies
โ”œโ”€โ”€ test_qa.py                 # Test suite
โ”œโ”€โ”€ example_usage.py           # Usage examples
โ”œโ”€โ”€ sample_documents.json      # Sample data
โ””โ”€โ”€ Dockerfile                 # Docker config

๐Ÿค Contributing

Contributions welcome! Please:

  1. Fork the repository
  2. Create a feature branch
  3. Add tests for new features
  4. Ensure all tests pass
  5. Submit a pull request

๐Ÿ“ License

This project is provided as-is for educational and commercial use.

๐Ÿ†˜ Support

  • Documentation: See QA_SOLUTION_ARTICLE.md for comprehensive guide
  • Examples: Run python example_usage.py
  • Tests: Check test_qa.py for usage patterns
  • Issues: Report bugs and request features

๐ŸŽฏ Use Cases

  • Customer Support: Automated FAQ systems
  • Knowledge Management: Enterprise knowledge bases
  • Document Search: Legal, medical, technical documents
  • Education: Interactive learning platforms
  • Research: Academic paper Q&A
  • E-commerce: Product information retrieval

Leave a Reply

Your email address will not be published. Required fields are marked *