Language Learning “Shazam” for Kirtan (ASR Model)

IN PROGRESS [updates below]

Description: I'm working on a Automatic Speech Recognition (ASR) model that can accurately transcribe Kirtan (sung Sikhi hymns). By combining speech recognition with music transcription techniques, I aim to create a machine learning model(s) that can process sung Gurbani, filter out instrumental sounds, and transcribe into Gurkumi ( the writing system of Punjabi).

⬅️View all Projects

Motivation & Overview

Kirtan is a form of devotional music in Sikhi where hymns are sung in Raag (melodic structures). Unlike spoken Punjabi, Kirtan is sung with elongated syllables, varying rhtyms and melodic improvisation. and , usually accompanied by instruments such as harmonium, tabla and sitar.

Standard Autmoatic Speec rcongition (ASR) models like OpenAI’s Whisper can transcribe conversational Punjabi but fail to recognize Kirtan (in my personal test, it failed to transcribe a single word in Kirtan, yet was able to fully and perfectly transcribe my spoken Punjabi. That’s because it’s trained on speech, not sung words swith musical instrucments).

Currently, platforms like iGurbani and SikhiToTheMax allow users to search for Gurbani if they know the words. However, they cannot transcribe audio, meaning:

If someone hears a Shabad and wants to look it up, they must already know the words.
Non-fluent speakers or new learners struggle to identify Shabads from melodic Kirtan alone

This can make it challenging for reserachers, or learnign to search, study or follow along with kirtan lyrics.

This project aims to fill that gap, allowing users to record or upload a Kirtan clip and get an instant transcription. Combined with using a database like IGurbani and SikhiToTheMax, they can then use that tarsncription to look up the shabad it’s being sung from and follow along and look at the meaning. This could make Gurbani more searchable, accessible, and easier to learn for Sikhs worldwide.

Project Overview

Current Work (Phase 1)

Before training a model, I need a properly structured dataset of Kirtan audio and transcriptions. This involves:

Extracting short clips (5-10 seconds) from longer Kirtan recordings, in WAV format for easy across the board comptaiability.
Manually transcribing each clip to ensure accuracy.
Organizing data into a structured format (TSV file) where:

Column 1: File name (e.g., shabad1_part1.wav)
Column 2: Transcription (Gurmukhi text of the sung words)

This dataset will be the foundation for training and fine-tuning a model. While I'll continue to slowly and consistently expand the dataset as I work on the model, I want to start with a small but high-quality and accurately transcribed dataset. This will allow me to test different ASR models and methods to refine the approach and confirm which method to pursue before scaling up.

Why Start Small?

Check Model Performance Early – Fine-tuning a pre-built ASR model (such as Wav2Vec) or testing hybrid models before investing in a massive dataset.
Fix Any Issues in the Dataset – Ensure transcription format, audio quality, and segmentation are correct.
Expand Strategically – Once I confirm that the approach works, I can scale up data collection efficiently.

Clip Selection Strategy:

Diverse Raags & Kirtan Styles – Different singing speeds, clarity, and background noise levels.
Variation in Granthi/Vocalists – Helps prevent model bias toward a single voice.
Clear Pronunciation – Avoid overly reverb-heavy or distorted recordings.

📌 For now, I’m focusing on getting at least 500 clips manually transcribed before experimenting with models. 500-1000 clips = (~1-2 hours of transcribed audio).

Future Work (Phases 2+)

I'll take the initial dataset (1-2 hours of transcribed data) and experiment with different models and methods to determine which approach is most promising. While I have several methods in mind, I'm keeping my options open. This flexibility is important because machine learning is advancing exponentially, and better methods may emerge in the coming months. I love projects where I have a clear end goal (creating an accurate Kirtan transcription model) but maintain flexibility in how to get there. It makes the journey exciting, especially in a field that's evolving as I'm learning about it.

1. Fine-Tuning an ASR Model

Since ASR models (such as Wav2Vec 2.0 by Meta, or NeMo ASR by NVIDIA) are typically pre-trained on large general datasets, fine-tuning adapts them to specialized speech styles, accents, dialects, or unique musical forms like Kirtan, which blends Punjabi with melodic Raag structures.

Pros:

Already trained for Punjabi speech (can recognize conversational Punjabi).
Can be fine-tuned on my custom dataset for better recognition.

Cons:

Trained for speech, not music, so it may struggle with elongated syllables.

2. Music Transcription Model

MT3 (Multi-Track Music Transcription): A Transformer-based model designed to transcribe polyphonic music, recognizing multiple instruments and notes from raw audio. It performs well on structured music but may struggle with highly improvisational or microtonal styles like Raag-based Kirtan.

Open-Unmix (UMX): A deep learning model for music source separation, used to isolate vocals, drums, bass, and other elements from a mixed audio track. It can help extract vocals from Kirtan recordings before applying an ASR model for transcription.

Pros:

Designed for polyphonic music, meaning it can separate vocals from instruments.
Could help pre-process audio before feeding it to an ASR model.

Cons:

Music transcription models output notes, not words—so I’d need extra training steps.

📌3. Hybrid Model (Best of Both Worlds)

The Hybrid Model combines music source separation with ASR transcription, followed by language model correction. First, the music model isolates vocals from Kirtan, removing tabla/harmonium interference. Then, a fine-tuned ASR model transcribes the cleaned vocals, and a Gurbani-trained language model corrects errors for better accuracy.

Pros:

Can handle both speech and singing, making it more accurate for Kirtan.
Can remove tabla/harmonium sounds, reducing interference.

Cons:

More complex and requires training multiple models.

Goal End Result : To create a Kirtan transcription tool that converts sung Gurbani into text, enabling users to search, study, and follow along with Shabad lyrics by linking transcriptions to databases like iGurbani and SikhiToTheMax, making Gurbani more accessible.

Language Learning “Shazam” for Kirtan (ASR Model)

Motivation & Overview

Project Overview

Current Work (Phase 1)

Why Start Small?

Clip Selection Strategy:

Future Work (Phases 2+)

Five Rivers, One Future: Revitalizing Punjab’s Economic Landscape (Case Study)