Year · 2024–presentBy Ayush Niroula

A high-performance deep learning system that automatically generates descriptive Nepali captions for images. This project implements a state-of-the-art CNN-Transformer architecture, utilizing InceptionV3 for image feature extraction and a custom-built Transformer decoder for natural language generation in the Devanagari script.
Highlights
- CNN-Transformer hybrid architecture using InceptionV3 and Multi-Head Attention for complex visual-linguistic mapping
- Customized Devanagari text processing engine for cleaning, tokenization, and handling of Nepali script nuances
- Advanced GPU training pipeline with XLA compilation and mixed-precision (float16) for accelerated performance
Tech stack
TensorFlow
Keras
Python
Transformers
InceptionV3
NLTK
NumPy