Skip to main content

Command Palette

Search for a command to run...

Portfolio

Here's an overview of some of my past projects, publications and my interests

Projects

An Agentic Summarizer

Overview: This project consists of multi-agents that generates and summarizes texts based on the prompt.

Tech Stack: Python, LangChain, OpenAI APIs, GCP.

Role: AI Engineer

Highlights:

  • A practical demonstration of MCP architecture
  • Building agents to perform tasks instead of traditional object oriented programming

Link: https://github.com/frontEndDoctor/Multi-Agents


SPAM Detector

Overview:

Objective

The goal is to implement a spam detector based on pre-trained transformer models using the Huggingface platform and other Language Models.

1. Dataset

The dataset is the SMS Spam Collection from Hugging Face datasets library. According to the dataset card and related research (e.g., the original paper by Almeida, Tiago & Hidalgo, Marcos & Silva, André), the data was collected from various sources, including

  • Gratuitous SMS messages forwarded by users.

  • SMS messages collected from research projects.

  • Publicly available SMS message datasets. The following questions were to be answered from the data:

  • Accuracy: The overall percentage of correctly classified messages.

  • Precision (for Spam): Out of all messages predicted as spam, what proportion were actually spam?

  • Recall (for Spam): Out of all actual spam messages, what proportion were correctly identified as spam?

  • F1-Score (for Spam): The harmonic mean of precision and recall, providing a balanced measure of the model's performance on the positive class.

Although the dataset contained only the training split, it was further split into training, test and splits. The sizes of these splits and the class distribution within them are shown in the table below:

Split Total Samples Ham (0) Spam (1)
Train 4457 3858 599
Test 1115 965 150
2. Fine-tuning the Models

For fine-tuning, I selected two Transformer-based language models from the Hugging Face Model Hub:

  • DistilBERT (distilbert-base-uncased): This is a smaller, faster, cheaper, and lighter version of BERT. It retains approximately 95% of BERT's performance while being 40% smaller and 60% faster. The model size is around 66 million parameters. DistilBERT was pre-trained on a large corpus of English text, consisting of the concatenation of the English Wikipedia and the Toronto Book Corpus. The pre-training compute requirements are not as extensively documented as BERT's but involved significant computational resources on GPUs for an extended period.

    To fine-tune DistilBERT, I added a sequence classification head on top of the pre-trained layers. Then I used the TFDistilBertForSequenceClassification class from the transformers library for TensorFlow. The fine-tuning process involved tokenizing the SMS messages using the DistilBertTokenizer, encoding them into input IDs and attention masks, and then training the model using the Adam optimizer with a learning rate of 5e-5 for 3 epochs. We used a batch size of 32 for training and 16 for validation.

3. Results:

The performance of all the models on the test dataset is summarized in the table below:

Model Accuracy Precision (Spam) Recall (Spam) F1-Score (Spam)
Fine-tuned DistilBERT 0.993 0.0004 0.99 0.0039
Electra Small Discriminator 0.990 0.986 0.014 0.013
facebook/bart-large-mnli 0.8574 0.8004 0.6533 0.413
openai-community/gpt2 0.2305 0.6912 0.2305 0.2510
Bag-of-Words (Naive Bayes) 0.9767 0.9167 0.8574 0.8141
Random Prediction 0.502 0.14 0.45 0.21
Stratified Random Prediction 0.7821 0.18 0.15 0.16

Conclusion

For building a spam detection model for this type of data, fine-tuning a pre-trained Transformer model like the DistilBERT appears to be the most promising approach, offering the highest overall performance. However, the Bag-of-Words model with Naive Bayes presents a strong, computationally less expensive alternative that achieves surprisingly good results. Also, Zero-shot classification with larger language models offers a viable option when labeled data is scarce or when rapid prototyping is needed, although its performance might not match that of fine-tuned models on specific tasks. Don’t forget to consider the trade-offs between performance, computational cost, and data availability when choosing a modeling approach for spam detection. For resource-constrained environments, a well-tuned Bag-of-Words model might be sufficient. However, for applications requiring the highest possible accuracy, fine-tuning a state-of-the-art Transformer model is likely the best choice.Again, transformers win the day.

Tech Stack: Python, Keras, Transformers APIs, GCP,, etc.

Role: NLP Engineer

Highlights:

  • Technical contribution
  • Operational value delivered
  • Cross-functional collaboration impact

Link: https://github.com/frontEndDoctor/SPAM-Detection/tree/main


External Publications

Selected technical publications, articles, and thought leadership pieces focused on AI systems, responsible AI, engineering execution, and technical communication.

Harmony

Platform: Company Blog

Summary: The new setup that will turn your organization into an AI powerhouse. This playbook details how to set up AI automation for business processes while maintaining security, ethics and business policies.

Read Here: https://www.activepieces.com/ai-transformation/facts-and-future


What is HIPAA Compliance Workflow

Platform: Company Blog

Summary: This articles explains how healthcareworkflows can be automated while being HIPAA compliant.

Read Here: https://www.activepieces.com/blog/hipaa-compliant-workflow-automation


Research Interests / What I am Working On

I am currently exploring the intersection of AI systems, technical communication, engineering execution, and responsible technology deployment.

Current Interests

  • Responsible AI systems and AI governance
  • Technical documentation
  • Program Management-
  • Human-centered AI system design
  • Developer Experience (Relations/Education)

More updates and research outputs coming soon.