How to Build a Spoken-Digit Classifier with a Spiking Neural Network (Talamo Tutorial)

Spiking neural networks, or SNNs, are a different way to think about machine learning on time-based data. Instead of passing around only dense activations, they process streams of spikes over time. That makes them especially interesting for audio, sensors, and edge AI systems where efficiency matters as much as accuracy.

In this tutorial, we will build a complete spoken-digit classifier with Talamo. The idea is straightforward:

start with short audio clips of spoken digits
convert each clip into MFCC features
encode those features into spikes
train a small SNN to classify the digit
prepare the model for hardware-oriented deployment

By the end, you should have a clear mental model of what an SNN pipeline looks like in practice and how Talamo helps connect training and deployment.

Why use an SNN for audio?

Audio is naturally temporal. A spoken word is not just a bag of numbers; it unfolds over time. That makes it a good fit for spiking models, where timing and event activity are part of the representation.

SNNs are attractive for edge systems because they can support sparse, event-driven computation. If the model only needs to respond when meaningful activity appears, it can often be more efficient than a conventional always-on dense network.

For a first project, spoken-digit classification is a great choice:

the task is simple and easy to understand
audio clips are short
labels are clean
errors are easy to inspect
the full pipeline is small enough to reason about

The task

We will classify recordings of spoken digits from a small subset of classes:

The full system will look like this:

waveform -> MFCC -> IFEncoder -> SNN -> MaxRateDecoder

Each stage solves a specific problem.

Step 1: Standardize the audio input

Before we train anything, we want every audio sample to have the same shape. That keeps batching simple and makes the downstream pipeline stable.

A practical choice is to force every waveform to length 8000:

if the clip is longer, trim it
if the clip is shorter, pad it
center the speech within the fixed window

import numpy as np


def pad_and_center_waveform_8K(waveform: np.ndarray) -> np.ndarray:
    new_waveform = np.zeros(8000, dtype=np.float32)
    if len(waveform) > 8000:
        waveform = waveform[:8000]
    start = (8000 - len(waveform)) // 2
    new_waveform[start:start + len(waveform)] = waveform.astype(np.float32)
    return new_waveform

This step may look simple, but it matters a lot. Most training problems blamed on the network are really input consistency problems.

Step 2: Turn raw audio into MFCC features

Raw waveforms are possible inputs, but MFCCs are a much better starting point for compact speech classification. They summarize time-frequency structure in a form that is easier for the model to learn from.

With Talamo, we can define an MFCC front-end directly:

import talamo

feature_extractor = talamo.pipeline.elements.MFCC(
    n_mfcc=32,
    n_fft=512,
    hop_length=512,
    n_mels=128,
    sample_rate=8000,
)

This stage converts each fixed-length waveform into a smaller feature representation that preserves useful speech cues while reducing the burden on the SNN.

Step 3: Encode dense features into spikes

MFCC values are still dense numbers, not spike trains. To feed them into an SNN, we need an encoder.

A good hardware-oriented option in Talamo is the IFEncoder:

encoder = talamo.encoders.c1.IFEncoder(num_encoder_channels=32)

Conceptually, this stage converts continuous feature values into spike activity over time. That gives us the bridge from conventional signal processing into spiking computation.

If you are new to SNNs, this is the key transition to understand:

feature extractor: produces informative numbers
encoder: turns those numbers into spike-based signals
SNN: learns from the spikes

Step 4: Build a small spiking network

A strong default is to start with a small network and make sure the full pipeline works before trying anything deeper.

Here is a compact SNN for four-way classification:

class SNN(talamo.snn.TimeModule):
    def __init__(self, num_inputs: int = 32, num_hidden_neurons: int = 128, num_outputs: int = 4):
        super(SNN, self).__init__()

        self.neuron_model = talamo.snn.neuron.c1.digital_neuron()
        self.synapse_model = talamo.snn.synapse.c1.digital_synapse()

        self.fc1 = talamo.snn.layer.Dense(
            num_inputs,
            num_hidden_neurons,
            self.synapse_model,
            self.neuron_model,
            weight_initializer=talamo.initializers.RandomUniform(-6, 10),
        )
        self.fc2 = talamo.snn.layer.Dense(
            num_hidden_neurons,
            num_outputs,
            self.synapse_model,
            self.neuron_model,
            weight_initializer=talamo.initializers.RandomUniform(-6, 10),
            probe=True,
        )

    def forward(self, x):
        x = self.fc1(x)
        x = self.fc2(x)
        return x


snn_step = talamo.snn.containers.Snn(SNN())

A few practical notes:

TimeModule means the model processes temporal data explicitly
digital_neuron() and digital_synapse() keep the design aligned with deployment on C1-style hardware
32 -> 128 -> 4 is a sensible first architecture for this task

For SNN work, simple is usually better at the start. You want to debug the flow of spikes, not fight unnecessary model complexity.

Step 5: Assemble the Talamo pipeline

Now we can connect everything together:

pipe = talamo.pipeline.elements.Pipeline([
    feature_extractor,
    encoder,
    snn_step,
    talamo.decoders.MaxRateDecoder(),
])

This final stage, MaxRateDecoder, converts output spike activity into a class prediction by selecting the output neuron with the strongest response.

At this point we have a complete end-to-end classifier.

Step 6: Build the dataset

Talamo works cleanly with DenseDataset, so once the waveforms and labels are ready, we can package everything for training.

import torch


dataset = talamo.utils.data.DenseDataset(
    labels=torch.tensor(class_ids),
    data=torch.tensor(audio_data, dtype=torch.float32),
)

torch.manual_seed(12345)
train_dataset, test_dataset = talamo.pipeline.util.stratified_split(dataset, [0.8, 0.2])

A stratified split is important here because it keeps all digit classes represented in both training and test data.

Step 7: Train the SNN

Training looks familiar if you have used PyTorch before. The difference is that Talamo manages the full pipeline, not just the network core.

lr = 0.2
batch_size = 128
num_epochs = 50

snn_params = pipe.query_torch_params(
    "snn_step*synapses.weight*",
    talamo.utils.ParameterType.PARAMETER,
    references=False,
)

optimizer = torch.optim.Adam(params=snn_params, lr=lr, betas=(0.9, 0.999))
learning_rate_scheduler = torch.optim.lr_scheduler.StepLR(
    optimizer,
    step_size=10,
    gamma=0.9,
)
loss_fn = torch.nn.CrossEntropyLoss()

pipe.fit(
    dataset=train_dataset,
    epochs=num_epochs,
    dataloader_type=torch.utils.data.DataLoader,
    dataloader_args={"batch_size": batch_size, "shuffle": True},
    optimizer=optimizer,
    loss_function=loss_fn,
    learning_rate_scheduler=learning_rate_scheduler,
    verbose=2,
)

There is a deeper lesson here: training an SNN is not only about loss curves. You also care about whether spike activity is healthy.

If the model barely spikes, it may not learn. If everything spikes all the time, you lose the benefits of sparse computation and often get unstable behavior.

Step 8: Evaluate properly

After training, evaluate on held-out data:

metrics = talamo.utils.recording.ModelMetrics(num_classes=4)
recorder = talamo.utils.recording.Recorder(model_performance_metrics=metrics)

pipe.to(torch.device("cpu"))
pipe.evaluate(
    dataset=test_dataset,
    recorder=recorder,
    verbose=3,
)

For spoken-digit classification, overall accuracy is useful, but it should not be the only thing you inspect. A confusion matrix is often more informative. If 7 and 9 are repeatedly confused, for example, that may point to a feature or data issue rather than a network issue.

Step 9: Inspect thresholds and weights before deployment

This is where SNN engineering becomes more practical than theoretical. A floating-point model can look good in training and still fail when moved toward embedded execution.

That is why it is useful to inspect parameter ranges before quantization:

from talamo.utils import ParameterType

thr_tensors = pipe.query_torch_params(
    "*.threshold",
    ParameterType.PARAMETER,
    references=False,
)

w_tensors = pipe.query_torch_params(
    "*.weight",
    ParameterType.PARAMETER,
    references=False,
)

When you inspect these values, you are looking for questions like:

are thresholds in a sensible range?
are weights concentrated or extreme?
will these values survive rounding and clamping?
does the trained model still make sense under hardware constraints?

That step is easy to skip and expensive to ignore.

Step 10: Move toward hardware execution

Talamo makes it possible to test the SNN stage on an Innatera SoC simulator:

talamo.pipeline.elements.Pipeline.disable_timeouts = True
innatera_soc = talamo.device.c1.Soc(simulate=True)

snn_step.to(innatera_soc)
deploy_result = pipe.evaluate(test_dataset, verbose=3)
print(deploy_result)

This is where the workflow becomes especially powerful. You are no longer building a model only for offline experimentation. You are building something that can move toward a deployment target with realistic constraints.

Step 11: Run inference on a new audio file

For real usage, you want to take a fresh .wav file and preprocess it in exactly the same way as training.

import librosa

TARGET_SR = 8000
TARGET_LEN = 8000


def pad_and_center_8k(x: np.ndarray, target_len: int = 8000) -> np.ndarray:
    x = np.asarray(x)
    if x.shape[0] > target_len:
        x = x[:target_len]
    out = np.zeros(target_len, dtype=np.float32)
    start = (target_len - x.shape[0]) // 2
    out[start:start + x.shape[0]] = x.astype(np.float32)
    return out


def load_like_training(wav_path: str) -> np.ndarray:
    x, _ = librosa.load(wav_path, sr=TARGET_SR, mono=True)
    x = x * 32768.0
    x = pad_and_center_8k(x, TARGET_LEN)
    return x.astype(np.float32)

That consistency is essential. If your training preprocessing and inference preprocessing differ, the model can degrade quickly even when the code seems correct.

Step 12: Extend to live microphone input

Once the offline pipeline works, you can adapt it to a live setup:

capture audio from the microphone
save or reshape the captured buffer
apply the exact same preprocessing path
run the Talamo pipeline
decode the predicted class
optionally display the result on device

That turns a toy classification experiment into the start of a usable embedded application.

What this project teaches about SNNs

A lot of introductions make SNNs sound mysterious. In practice, the workflow is quite concrete.

A successful SNN pipeline usually depends on getting these pieces right:

Stable input formatting.
A useful front-end feature representation.
A sensible spike encoder.
A compact, trainable spiking architecture.
Evaluation on unseen data.
Quantization-aware inspection.
A realistic deployment path.

That is the real value of Talamo. It helps treat SNNs not as an isolated research toy, but as a system you can build, train, inspect, and move toward hardware.

Final thoughts

If you are just starting with SNNs, this kind of spoken-digit classifier is an excellent first project. It is simple enough to understand end to end, but rich enough to teach the important lessons:

time matters
encoding matters
spike activity matters
deployment constraints matter

The central mindset shift is this: an SNN is not just a normal neural network with spikes added on top. It is a temporal pipeline, and each stage has to be designed with care.

Talamo makes that process much more approachable by giving you a clean path from audio preprocessing to spike encoding to SNN training and, eventually, deployment-oriented execution.