In this tutorial, we will build a complete spoken-digit classifier with Talamo. The idea is straightforward:
- start with short audio clips of spoken digits
- convert each clip into MFCC features
- encode those features into spikes
- train a small SNN to classify the digit
- prepare the model for hardware-oriented deployment
By the end, you should have a clear mental model of what an SNN pipeline looks like in practice and how Talamo helps connect training and deployment.
Why use an SNN for audio?
Audio is naturally temporal. A spoken word is not just a bag of numbers; it unfolds over time. That makes it a good fit for spiking models, where timing and event activity are part of the representation.
SNNs are attractive for edge systems because they can support sparse, event-driven computation. If the model only needs to respond when meaningful activity appears, it can often be more efficient than a conventional always-on dense network.
For a first project, spoken-digit classification is a great choice:
- the task is simple and easy to understand
- audio clips are short
- labels are clean
- errors are easy to inspect
- the full pipeline is small enough to reason about
The task
We will classify recordings of spoken digits from a small subset of classes:
0479
The full system will look like this:
waveform -> MFCC -> IFEncoder -> SNN -> MaxRateDecoder
Each stage solves a specific problem.
Step 1: Standardize the audio input
Before we train anything, we want every audio sample to have the same shape. That keeps batching simple and makes the downstream pipeline stable.
A practical choice is to force every waveform to length 8000:
- if the clip is longer, trim it
- if the clip is shorter, pad it
- center the speech within the fixed window
import numpy as np
def pad_and_center_waveform_8K(waveform: np.ndarray) -> np.ndarray:
new_waveform = np.zeros(8000, dtype=np.float32)
if len(waveform) > 8000:
waveform = waveform[:8000]
start = (8000 - len(waveform)) // 2
new_waveform[start:start + len(waveform)] = waveform.astype(np.float32)
return new_waveform
This step may look simple, but it matters a lot. Most training problems blamed on the network are really input consistency problems.
Step 2: Turn raw audio into MFCC features
Raw waveforms are possible inputs, but MFCCs are a much better starting point for compact speech classification. They summarize time-frequency structure in a form that is easier for the model to learn from.
With Talamo, we can define an MFCC front-end directly:
import talamo
feature_extractor = talamo.pipeline.elements.MFCC(
n_mfcc=32,
n_fft=512,
hop_length=512,
n_mels=128,
sample_rate=8000,
)
This stage converts each fixed-length waveform into a smaller feature representation that preserves useful speech cues while reducing the burden on the SNN.
Step 3: Encode dense features into spikes
MFCC values are still dense numbers, not spike trains. To feed them into an SNN, we need an encoder.
A good hardware-oriented option in Talamo is the IFEncoder:
encoder = talamo.encoders.c1.IFEncoder(num_encoder_channels=32)
Conceptually, this stage converts continuous feature values into spike activity over time. That gives us the bridge from conventional signal processing into spiking computation.
If you are new to SNNs, this is the key transition to understand:
- feature extractor: produces informative numbers
- encoder: turns those numbers into spike-based signals
- SNN: learns from the spikes
Step 4: Build a small spiking network
A strong default is to start with a small network and make sure the full pipeline works before trying anything deeper.
Here is a compact SNN for four-way classification:
class SNN(talamo.snn.TimeModule):
def __init__(self, num_inputs: int = 32, num_hidden_neurons: int = 128, num_outputs: int = 4):
super(SNN, self).__init__()
self.neuron_model = talamo.snn.neuron.c1.digital_neuron()
self.synapse_model = talamo.snn.synapse.c1.digital_synapse()
self.fc1 = talamo.snn.layer.Dense(
num_inputs,
num_hidden_neurons,
self.synapse_model,
self.neuron_model,
weight_initializer=talamo.initializers.RandomUniform(-6, 10),
)
self.fc2 = talamo.snn.layer.Dense(
num_hidden_neurons,
num_outputs,
self.synapse_model,
self.neuron_model,
weight_initializer=talamo.initializers.RandomUniform(-6, 10),
probe=True,
)
def forward(self, x):
x = self.fc1(x)
x = self.fc2(x)
return x
snn_step = talamo.snn.containers.Snn(SNN())
A few practical notes:
TimeModulemeans the model processes temporal data explicitlydigital_neuron()anddigital_synapse()keep the design aligned with deployment on C1-style hardware32 -> 128 -> 4is a sensible first architecture for this task
For SNN work, simple is usually better at the start. You want to debug the flow of spikes, not fight unnecessary model complexity.
Step 5: Assemble the Talamo pipeline
Now we can connect everything together:
pipe = talamo.pipeline.elements.Pipeline([
feature_extractor,
encoder,
snn_step,
talamo.decoders.MaxRateDecoder(),
])
This final stage, MaxRateDecoder, converts output spike activity into a class prediction by selecting the output neuron with the strongest response.
At this point we have a complete end-to-end classifier.
Step 6: Build the dataset
Talamo works cleanly with DenseDataset, so once the waveforms and labels are ready, we can package everything for training.
import torch
dataset = talamo.utils.data.DenseDataset(
labels=torch.tensor(class_ids),
data=torch.tensor(audio_data, dtype=torch.float32),
)
torch.manual_seed(12345)
train_dataset, test_dataset = talamo.pipeline.util.stratified_split(dataset, [0.8, 0.2])
A stratified split is important here because it keeps all digit classes represented in both training and test data.
Step 7: Train the SNN
Training looks familiar if you have used PyTorch before. The difference is that Talamo manages the full pipeline, not just the network core.
lr = 0.2
batch_size = 128
num_epochs = 50
snn_params = pipe.query_torch_params(
"snn_step*synapses.weight*",
talamo.utils.ParameterType.PARAMETER,
references=False,
)
optimizer = torch.optim.Adam(params=snn_params, lr=lr, betas=(0.9, 0.999))
learning_rate_scheduler = torch.optim.lr_scheduler.StepLR(
optimizer,
step_size=10,
gamma=0.9,
)
loss_fn = torch.nn.CrossEntropyLoss()
pipe.fit(
dataset=train_dataset,
epochs=num_epochs,
dataloader_type=torch.utils.data.DataLoader,
dataloader_args={"batch_size": batch_size, "shuffle": True},
optimizer=optimizer,
loss_function=loss_fn,
learning_rate_scheduler=learning_rate_scheduler,
verbose=2,
)
There is a deeper lesson here: training an SNN is not only about loss curves. You also care about whether spike activity is healthy.
If the model barely spikes, it may not learn. If everything spikes all the time, you lose the benefits of sparse computation and often get unstable behavior.
Step 8: Evaluate properly
After training, evaluate on held-out data:
metrics = talamo.utils.recording.ModelMetrics(num_classes=4)
recorder = talamo.utils.recording.Recorder(model_performance_metrics=metrics)
pipe.to(torch.device("cpu"))
pipe.evaluate(
dataset=test_dataset,
recorder=recorder,
verbose=3,
)
For spoken-digit classification, overall accuracy is useful, but it should not be the only thing you inspect. A confusion matrix is often more informative. If 7 and 9 are repeatedly confused, for example, that may point to a feature or data issue rather than a network issue.
Step 9: Inspect thresholds and weights before deployment
This is where SNN engineering becomes more practical than theoretical. A floating-point model can look good in training and still fail when moved toward embedded execution.
That is why it is useful to inspect parameter ranges before quantization:
from talamo.utils import ParameterType
thr_tensors = pipe.query_torch_params(
"*.threshold",
ParameterType.PARAMETER,
references=False,
)
w_tensors = pipe.query_torch_params(
"*.weight",
ParameterType.PARAMETER,
references=False,
)
When you inspect these values, you are looking for questions like:
- are thresholds in a sensible range?
- are weights concentrated or extreme?
- will these values survive rounding and clamping?
- does the trained model still make sense under hardware constraints?
That step is easy to skip and expensive to ignore.
Step 10: Move toward hardware execution
Talamo makes it possible to test the SNN stage on an Innatera SoC simulator:
talamo.pipeline.elements.Pipeline.disable_timeouts = True
innatera_soc = talamo.device.c1.Soc(simulate=True)
snn_step.to(innatera_soc)
deploy_result = pipe.evaluate(test_dataset, verbose=3)
print(deploy_result)
This is where the workflow becomes especially powerful. You are no longer building a model only for offline experimentation. You are building something that can move toward a deployment target with realistic constraints.
Step 11: Run inference on a new audio file
For real usage, you want to take a fresh .wav file and preprocess it in exactly the same way as training.
import librosa
TARGET_SR = 8000
TARGET_LEN = 8000
def pad_and_center_8k(x: np.ndarray, target_len: int = 8000) -> np.ndarray:
x = np.asarray(x)
if x.shape[0] > target_len:
x = x[:target_len]
out = np.zeros(target_len, dtype=np.float32)
start = (target_len - x.shape[0]) // 2
out[start:start + x.shape[0]] = x.astype(np.float32)
return out
def load_like_training(wav_path: str) -> np.ndarray:
x, _ = librosa.load(wav_path, sr=TARGET_SR, mono=True)
x = x * 32768.0
x = pad_and_center_8k(x, TARGET_LEN)
return x.astype(np.float32)
That consistency is essential. If your training preprocessing and inference preprocessing differ, the model can degrade quickly even when the code seems correct.
Step 12: Extend to live microphone input
Once the offline pipeline works, you can adapt it to a live setup:
- capture audio from the microphone
- save or reshape the captured buffer
- apply the exact same preprocessing path
- run the Talamo pipeline
- decode the predicted class
- optionally display the result on device
That turns a toy classification experiment into the start of a usable embedded application.
What this project teaches about SNNs
A lot of introductions make SNNs sound mysterious. In practice, the workflow is quite concrete.
A successful SNN pipeline usually depends on getting these pieces right:
- Stable input formatting.
- A useful front-end feature representation.
- A sensible spike encoder.
- A compact, trainable spiking architecture.
- Evaluation on unseen data.
- Quantization-aware inspection.
- A realistic deployment path.
That is the real value of Talamo. It helps treat SNNs not as an isolated research toy, but as a system you can build, train, inspect, and move toward hardware.
Final thoughts
If you are just starting with SNNs, this kind of spoken-digit classifier is an excellent first project. It is simple enough to understand end to end, but rich enough to teach the important lessons:
- time matters
- encoding matters
- spike activity matters
- deployment constraints matter
The central mindset shift is this: an SNN is not just a normal neural network with spikes added on top. It is a temporal pipeline, and each stage has to be designed with care.
Talamo makes that process much more approachable by giving you a clean path from audio preprocessing to spike encoding to SNN training and, eventually, deployment-oriented execution.
