Portfolio
Here's an overview of some of my past projects, publications and my interests
Projects
An Agentic Summarizer
Overview: This project consists of multi-agents that generates and summarizes texts based on the prompt.
Tech Stack: Python, LangChain, OpenAI APIs, GCP.
Role: AI Engineer
Highlights:
- A practical demonstration of MCP architecture
- Building agents to perform tasks instead of traditional object oriented programming
Link: https://github.com/frontEndDoctor/Multi-Agents
SPAM Detector
Overview:
Objective
The goal is to implement a spam detector based on pre-trained transformer models using the Huggingface platform and other Language Models.
1. Dataset
The dataset is the SMS Spam Collection from Hugging Face datasets library. According to the dataset card and related research (e.g., the original paper by Almeida, Tiago & Hidalgo, Marcos & Silva, André), the data was collected from various sources, including
Gratuitous SMS messages forwarded by users.
SMS messages collected from research projects.
Publicly available SMS message datasets. The following questions were to be answered from the data:
Accuracy: The overall percentage of correctly classified messages.
Precision (for Spam): Out of all messages predicted as spam, what proportion were actually spam?
Recall (for Spam): Out of all actual spam messages, what proportion were correctly identified as spam?
F1-Score (for Spam): The harmonic mean of precision and recall, providing a balanced measure of the model's performance on the positive class.
Although the dataset contained only the training split, it was further split into training, test and splits. The sizes of these splits and the class distribution within them are shown in the table below:
| Split | Total Samples | Ham (0) | Spam (1) |
|---|---|---|---|
| Train | 4457 | 3858 | 599 |
| Test | 1115 | 965 | 150 |
2. Fine-tuning the Models
For fine-tuning, I selected two Transformer-based language models from the Hugging Face Model Hub:
DistilBERT (distilbert-base-uncased): This is a smaller, faster, cheaper, and lighter version of BERT. It retains approximately 95% of BERT's performance while being 40% smaller and 60% faster. The model size is around 66 million parameters. DistilBERT was pre-trained on a large corpus of English text, consisting of the concatenation of the English Wikipedia and the Toronto Book Corpus. The pre-training compute requirements are not as extensively documented as BERT's but involved significant computational resources on GPUs for an extended period.
To fine-tune DistilBERT, I added a sequence classification head on top of the pre-trained layers. Then I used the TFDistilBertForSequenceClassification class from the transformers library for TensorFlow. The fine-tuning process involved tokenizing the SMS messages using the DistilBertTokenizer, encoding them into input IDs and attention masks, and then training the model using the Adam optimizer with a learning rate of 5e-5 for 3 epochs. We used a batch size of 32 for training and 16 for validation.
3. Results:
The performance of all the models on the test dataset is summarized in the table below:
| Model | Accuracy | Precision (Spam) | Recall (Spam) | F1-Score (Spam) |
|---|---|---|---|---|
| Fine-tuned DistilBERT | 0.993 | 0.0004 | 0.99 | 0.0039 |
| Electra Small Discriminator | 0.990 | 0.986 | 0.014 | 0.013 |
| facebook/bart-large-mnli | 0.8574 | 0.8004 | 0.6533 | 0.413 |
| openai-community/gpt2 | 0.2305 | 0.6912 | 0.2305 | 0.2510 |
| Bag-of-Words (Naive Bayes) | 0.9767 | 0.9167 | 0.8574 | 0.8141 |
| Random Prediction | 0.502 | 0.14 | 0.45 | 0.21 |
| Stratified Random Prediction | 0.7821 | 0.18 | 0.15 | 0.16 |
Conclusion
For building a spam detection model for this type of data, fine-tuning a pre-trained Transformer model like the DistilBERT appears to be the most promising approach, offering the highest overall performance. However, the Bag-of-Words model with Naive Bayes presents a strong, computationally less expensive alternative that achieves surprisingly good results. Also, Zero-shot classification with larger language models offers a viable option when labeled data is scarce or when rapid prototyping is needed, although its performance might not match that of fine-tuned models on specific tasks. Don’t forget to consider the trade-offs between performance, computational cost, and data availability when choosing a modeling approach for spam detection. For resource-constrained environments, a well-tuned Bag-of-Words model might be sufficient. However, for applications requiring the highest possible accuracy, fine-tuning a state-of-the-art Transformer model is likely the best choice.Again, transformers win the day.
Tech Stack: Python, Keras, Transformers APIs, GCP,, etc.
Role: NLP Engineer
Highlights:
- Technical contribution
- Operational value delivered
- Cross-functional collaboration impact
Link: https://github.com/frontEndDoctor/SPAM-Detection/tree/main
External Publications
Selected technical publications, articles, and thought leadership pieces focused on AI systems, responsible AI, engineering execution, and technical communication.
Harmony
Platform: Company Blog
Summary: The new setup that will turn your organization into an AI powerhouse. This playbook details how to set up AI automation for business processes while maintaining security, ethics and business policies.
Read Here: https://www.activepieces.com/ai-transformation/facts-and-future
What is HIPAA Compliance Workflow
Platform: Company Blog
Summary: This articles explains how healthcareworkflows can be automated while being HIPAA compliant.
Read Here: https://www.activepieces.com/blog/hipaa-compliant-workflow-automation
Research Interests / What I am Working On
I am currently exploring the intersection of AI systems, technical communication, engineering execution, and responsible technology deployment.
Current Interests
- Responsible AI systems and AI governance
- Technical documentation
- Program Management-
- Human-centered AI system design
- Developer Experience (Relations/Education)
More updates and research outputs coming soon.

