{"id":614,"date":"2026-04-21T16:57:45","date_gmt":"2026-04-21T11:27:45","guid":{"rendered":"https:\/\/swstech.sws-international.com\/?p=614"},"modified":"2026-04-21T16:57:45","modified_gmt":"2026-04-21T11:27:45","slug":"how-to-build-a-spoken-digit-classifier-with-a-spiking-neural-network-talamo-tutorial","status":"publish","type":"post","link":"https:\/\/swstech.sws-international.com\/?p=614","title":{"rendered":"How to Build a Spoken-Digit Classifier with a Spiking Neural Network (Talamo Tutorial)"},"content":{"rendered":"<style>\n  .snn-post {\n    max-width: 820px;\n    margin: 0 auto;\n    line-height: 1.75;\n    font-size: 17px;\n    color: #222;\n  }\n  .snn-post h1, .snn-post h2 {\n    line-height: 1.3;\n    margin-top: 1.8em;\n    margin-bottom: 0.6em;\n  }\n  .snn-post pre,\n  .snn-post code {\n    font-family: ui-monospace, SFMono-Regular, Menlo, Monaco, Consolas, \"Liberation Mono\", \"Courier New\", monospace;\n  }\n  .snn-post pre {\n    background: #111 !important;\n    color: #c7c7c7 !important;\n    border: 1px solid #2a2a2a;\n    border-radius: 10px;\n    padding: 14px 16px;\n    overflow-x: auto;\n    font-size: 14px;\n    line-height: 1.6;\n  }\n  .snn-post pre code {\n    background: transparent !important;\n    color: inherit !important;\n    padding: 0 !important;\n  }\n  .snn-post p code,\n  .snn-post li code {\n    background: #f0f1f3;\n    color: #222;\n    padding: 0.1em 0.35em;\n    border-radius: 5px;\n    font-size: 0.92em;\n  }\n<\/style>\n<article class=\"snn-post\">Spiking neural networks, or SNNs, are a different way to think about machine learning on time-based data. Instead of passing around only dense activations, they process streams of spikes over time. That makes them especially interesting for audio, sensors, and edge AI systems where efficiency matters as much as accuracy.<\/p>\n<p>In this tutorial, we will build a complete spoken-digit classifier with Talamo. The idea is straightforward:<\/p>\n<ul>\n<li>start with short audio clips of spoken digits<\/li>\n<li>convert each clip into MFCC features<\/li>\n<li>encode those features into spikes<\/li>\n<li>train a small SNN to classify the digit<\/li>\n<li>prepare the model for hardware-oriented deployment<\/li>\n<\/ul>\n<p>By the end, you should have a clear mental model of what an SNN pipeline looks like in practice and how Talamo helps connect training and deployment.<\/p>\n<h2>Why use an SNN for audio?<\/h2>\n<p>Audio is naturally temporal. A spoken word is not just a bag of numbers; it unfolds over time. That makes it a good fit for spiking models, where timing and event activity are part of the representation.<\/p>\n<p>SNNs are attractive for edge systems because they can support sparse, event-driven computation. If the model only needs to respond when meaningful activity appears, it can often be more efficient than a conventional always-on dense network.<\/p>\n<p>For a first project, spoken-digit classification is a great choice:<\/p>\n<ul>\n<li>the task is simple and easy to understand<\/li>\n<li>audio clips are short<\/li>\n<li>labels are clean<\/li>\n<li>errors are easy to inspect<\/li>\n<li>the full pipeline is small enough to reason about<\/li>\n<\/ul>\n<h2>The task<\/h2>\n<p>We will classify recordings of spoken digits from a small subset of classes:<\/p>\n<ul>\n<li><code>0<\/code><\/li>\n<li><code>4<\/code><\/li>\n<li><code>7<\/code><\/li>\n<li><code>9<\/code><\/li>\n<\/ul>\n<p>The full system will look like this:<\/p>\n<p><code>waveform -&gt; MFCC -&gt; IFEncoder -&gt; SNN -&gt; MaxRateDecoder<\/code><\/p>\n<p>Each stage solves a specific problem.<\/p>\n<h2>Step 1: Standardize the audio input<\/h2>\n<p>Before we train anything, we want every audio sample to have the same shape. That keeps batching simple and makes the downstream pipeline stable.<\/p>\n<p>A practical choice is to force every waveform to length <code>8000<\/code>:<\/p>\n<ul>\n<li>if the clip is longer, trim it<\/li>\n<li>if the clip is shorter, pad it<\/li>\n<li>center the speech within the fixed window<\/li>\n<\/ul>\n<pre><code>import numpy as np\n\n\ndef pad_and_center_waveform_8K(waveform: np.ndarray) -&gt; np.ndarray:\n    new_waveform = np.zeros(8000, dtype=np.float32)\n    if len(waveform) &gt; 8000:\n        waveform = waveform[:8000]\n    start = (8000 - len(waveform)) \/\/ 2\n    new_waveform[start:start + len(waveform)] = waveform.astype(np.float32)\n    return new_waveform\n<\/code><\/pre>\n<p>This step may look simple, but it matters a lot. Most training problems blamed on the network are really input consistency problems.<\/p>\n<h2>Step 2: Turn raw audio into MFCC features<\/h2>\n<p>Raw waveforms are possible inputs, but MFCCs are a much better starting point for compact speech classification. They summarize time-frequency structure in a form that is easier for the model to learn from.<\/p>\n<p>With Talamo, we can define an MFCC front-end directly:<\/p>\n<pre><code>import talamo\n\nfeature_extractor = talamo.pipeline.elements.MFCC(\n    n_mfcc=32,\n    n_fft=512,\n    hop_length=512,\n    n_mels=128,\n    sample_rate=8000,\n)\n<\/code><\/pre>\n<p>This stage converts each fixed-length waveform into a smaller feature representation that preserves useful speech cues while reducing the burden on the SNN.<\/p>\n<h2>Step 3: Encode dense features into spikes<\/h2>\n<p>MFCC values are still dense numbers, not spike trains. To feed them into an SNN, we need an encoder.<\/p>\n<p>A good hardware-oriented option in Talamo is the <code>IFEncoder<\/code>:<\/p>\n<pre><code>encoder = talamo.encoders.c1.IFEncoder(num_encoder_channels=32)\n<\/code><\/pre>\n<p>Conceptually, this stage converts continuous feature values into spike activity over time. That gives us the bridge from conventional signal processing into spiking computation.<\/p>\n<p>If you are new to SNNs, this is the key transition to understand:<\/p>\n<ul>\n<li>feature extractor: produces informative numbers<\/li>\n<li>encoder: turns those numbers into spike-based signals<\/li>\n<li>SNN: learns from the spikes<\/li>\n<\/ul>\n<h2>Step 4: Build a small spiking network<\/h2>\n<p>A strong default is to start with a small network and make sure the full pipeline works before trying anything deeper.<\/p>\n<p>Here is a compact SNN for four-way classification:<\/p>\n<pre><code>class SNN(talamo.snn.TimeModule):\n    def __init__(self, num_inputs: int = 32, num_hidden_neurons: int = 128, num_outputs: int = 4):\n        super(SNN, self).__init__()\n\n        self.neuron_model = talamo.snn.neuron.c1.digital_neuron()\n        self.synapse_model = talamo.snn.synapse.c1.digital_synapse()\n\n        self.fc1 = talamo.snn.layer.Dense(\n            num_inputs,\n            num_hidden_neurons,\n            self.synapse_model,\n            self.neuron_model,\n            weight_initializer=talamo.initializers.RandomUniform(-6, 10),\n        )\n        self.fc2 = talamo.snn.layer.Dense(\n            num_hidden_neurons,\n            num_outputs,\n            self.synapse_model,\n            self.neuron_model,\n            weight_initializer=talamo.initializers.RandomUniform(-6, 10),\n            probe=True,\n        )\n\n    def forward(self, x):\n        x = self.fc1(x)\n        x = self.fc2(x)\n        return x\n\n\nsnn_step = talamo.snn.containers.Snn(SNN())\n<\/code><\/pre>\n<p>A few practical notes:<\/p>\n<ul>\n<li><code>TimeModule<\/code> means the model processes temporal data explicitly<\/li>\n<li><code>digital_neuron()<\/code> and <code>digital_synapse()<\/code> keep the design aligned with deployment on C1-style hardware<\/li>\n<li><code>32 -&gt; 128 -&gt; 4<\/code> is a sensible first architecture for this task<\/li>\n<\/ul>\n<p>For SNN work, simple is usually better at the start. You want to debug the flow of spikes, not fight unnecessary model complexity.<\/p>\n<h2>Step 5: Assemble the Talamo pipeline<\/h2>\n<p>Now we can connect everything together:<\/p>\n<pre><code>pipe = talamo.pipeline.elements.Pipeline([\n    feature_extractor,\n    encoder,\n    snn_step,\n    talamo.decoders.MaxRateDecoder(),\n])\n<\/code><\/pre>\n<p>This final stage, <code>MaxRateDecoder<\/code>, converts output spike activity into a class prediction by selecting the output neuron with the strongest response.<\/p>\n<p>At this point we have a complete end-to-end classifier.<\/p>\n<h2>Step 6: Build the dataset<\/h2>\n<p>Talamo works cleanly with <code>DenseDataset<\/code>, so once the waveforms and labels are ready, we can package everything for training.<\/p>\n<pre><code>import torch\n\n\ndataset = talamo.utils.data.DenseDataset(\n    labels=torch.tensor(class_ids),\n    data=torch.tensor(audio_data, dtype=torch.float32),\n)\n\ntorch.manual_seed(12345)\ntrain_dataset, test_dataset = talamo.pipeline.util.stratified_split(dataset, [0.8, 0.2])\n<\/code><\/pre>\n<p>A stratified split is important here because it keeps all digit classes represented in both training and test data.<\/p>\n<h2>Step 7: Train the SNN<\/h2>\n<p>Training looks familiar if you have used PyTorch before. The difference is that Talamo manages the full pipeline, not just the network core.<\/p>\n<pre><code>lr = 0.2\nbatch_size = 128\nnum_epochs = 50\n\nsnn_params = pipe.query_torch_params(\n    \"snn_step*synapses.weight*\",\n    talamo.utils.ParameterType.PARAMETER,\n    references=False,\n)\n\noptimizer = torch.optim.Adam(params=snn_params, lr=lr, betas=(0.9, 0.999))\nlearning_rate_scheduler = torch.optim.lr_scheduler.StepLR(\n    optimizer,\n    step_size=10,\n    gamma=0.9,\n)\nloss_fn = torch.nn.CrossEntropyLoss()\n\npipe.fit(\n    dataset=train_dataset,\n    epochs=num_epochs,\n    dataloader_type=torch.utils.data.DataLoader,\n    dataloader_args={\"batch_size\": batch_size, \"shuffle\": True},\n    optimizer=optimizer,\n    loss_function=loss_fn,\n    learning_rate_scheduler=learning_rate_scheduler,\n    verbose=2,\n)\n<\/code><\/pre>\n<p>There is a deeper lesson here: training an SNN is not only about loss curves. You also care about whether spike activity is healthy.<\/p>\n<p>If the model barely spikes, it may not learn. If everything spikes all the time, you lose the benefits of sparse computation and often get unstable behavior.<\/p>\n<h2>Step 8: Evaluate properly<\/h2>\n<p>After training, evaluate on held-out data:<\/p>\n<pre><code>metrics = talamo.utils.recording.ModelMetrics(num_classes=4)\nrecorder = talamo.utils.recording.Recorder(model_performance_metrics=metrics)\n\npipe.to(torch.device(\"cpu\"))\npipe.evaluate(\n    dataset=test_dataset,\n    recorder=recorder,\n    verbose=3,\n)\n<\/code><\/pre>\n<p>For spoken-digit classification, overall accuracy is useful, but it should not be the only thing you inspect. A confusion matrix is often more informative. If <code>7<\/code> and <code>9<\/code> are repeatedly confused, for example, that may point to a feature or data issue rather than a network issue.<\/p>\n<h2>Step 9: Inspect thresholds and weights before deployment<\/h2>\n<p>This is where SNN engineering becomes more practical than theoretical. A floating-point model can look good in training and still fail when moved toward embedded execution.<\/p>\n<p>That is why it is useful to inspect parameter ranges before quantization:<\/p>\n<pre><code>from talamo.utils import ParameterType\n\nthr_tensors = pipe.query_torch_params(\n    \"*.threshold\",\n    ParameterType.PARAMETER,\n    references=False,\n)\n\nw_tensors = pipe.query_torch_params(\n    \"*.weight\",\n    ParameterType.PARAMETER,\n    references=False,\n)\n<\/code><\/pre>\n<p>When you inspect these values, you are looking for questions like:<\/p>\n<ul>\n<li>are thresholds in a sensible range?<\/li>\n<li>are weights concentrated or extreme?<\/li>\n<li>will these values survive rounding and clamping?<\/li>\n<li>does the trained model still make sense under hardware constraints?<\/li>\n<\/ul>\n<p>That step is easy to skip and expensive to ignore.<\/p>\n<h2>Step 10: Move toward hardware execution<\/h2>\n<p>Talamo makes it possible to test the SNN stage on an Innatera SoC simulator:<\/p>\n<pre><code>talamo.pipeline.elements.Pipeline.disable_timeouts = True\ninnatera_soc = talamo.device.c1.Soc(simulate=True)\n\nsnn_step.to(innatera_soc)\ndeploy_result = pipe.evaluate(test_dataset, verbose=3)\nprint(deploy_result)\n<\/code><\/pre>\n<p>This is where the workflow becomes especially powerful. You are no longer building a model only for offline experimentation. You are building something that can move toward a deployment target with realistic constraints.<\/p>\n<h2>Step 11: Run inference on a new audio file<\/h2>\n<p>For real usage, you want to take a fresh <code>.wav<\/code> file and preprocess it in exactly the same way as training.<\/p>\n<pre><code>import librosa\n\nTARGET_SR = 8000\nTARGET_LEN = 8000\n\n\ndef pad_and_center_8k(x: np.ndarray, target_len: int = 8000) -&gt; np.ndarray:\n    x = np.asarray(x)\n    if x.shape[0] &gt; target_len:\n        x = x[:target_len]\n    out = np.zeros(target_len, dtype=np.float32)\n    start = (target_len - x.shape[0]) \/\/ 2\n    out[start:start + x.shape[0]] = x.astype(np.float32)\n    return out\n\n\ndef load_like_training(wav_path: str) -&gt; np.ndarray:\n    x, _ = librosa.load(wav_path, sr=TARGET_SR, mono=True)\n    x = x * 32768.0\n    x = pad_and_center_8k(x, TARGET_LEN)\n    return x.astype(np.float32)\n<\/code><\/pre>\n<p>That consistency is essential. If your training preprocessing and inference preprocessing differ, the model can degrade quickly even when the code seems correct.<\/p>\n<h2>Step 12: Extend to live microphone input<\/h2>\n<p>Once the offline pipeline works, you can adapt it to a live setup:<\/p>\n<ul>\n<li>capture audio from the microphone<\/li>\n<li>save or reshape the captured buffer<\/li>\n<li>apply the exact same preprocessing path<\/li>\n<li>run the Talamo pipeline<\/li>\n<li>decode the predicted class<\/li>\n<li>optionally display the result on device<\/li>\n<\/ul>\n<p>That turns a toy classification experiment into the start of a usable embedded application.<\/p>\n<h2>What this project teaches about SNNs<\/h2>\n<p>A lot of introductions make SNNs sound mysterious. In practice, the workflow is quite concrete.<\/p>\n<p>A successful SNN pipeline usually depends on getting these pieces right:<\/p>\n<ol>\n<li>Stable input formatting.<\/li>\n<li>A useful front-end feature representation.<\/li>\n<li>A sensible spike encoder.<\/li>\n<li>A compact, trainable spiking architecture.<\/li>\n<li>Evaluation on unseen data.<\/li>\n<li>Quantization-aware inspection.<\/li>\n<li>A realistic deployment path.<\/li>\n<\/ol>\n<p>That is the real value of Talamo. It helps treat SNNs not as an isolated research toy, but as a system you can build, train, inspect, and move toward hardware.<\/p>\n<h2>Final thoughts<\/h2>\n<p>If you are just starting with SNNs, this kind of spoken-digit classifier is an excellent first project. It is simple enough to understand end to end, but rich enough to teach the important lessons:<\/p>\n<ul>\n<li>time matters<\/li>\n<li>encoding matters<\/li>\n<li>spike activity matters<\/li>\n<li>deployment constraints matter<\/li>\n<\/ul>\n<p>The central mindset shift is this: an SNN is not just a normal neural network with spikes added on top. It is a temporal pipeline, and each stage has to be designed with care.<\/p>\n<p>Talamo makes that process much more approachable by giving you a clean path from audio preprocessing to spike encoding to SNN training and, eventually, deployment-oriented execution.<\/p>\n<\/article>\n","protected":false},"excerpt":{"rendered":"<p>Spiking neural networks, or SNNs, are a different way to think about machine learning on time-based data. Instead of passing [&hellip;]<\/p>\n","protected":false},"author":4,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"site-sidebar-layout":"default","site-content-layout":"","ast-site-content-layout":"normal-width-container","site-content-style":"default","site-sidebar-style":"default","ast-global-header-display":"","ast-banner-title-visibility":"","ast-main-header-display":"","ast-hfb-above-header-display":"","ast-hfb-below-header-display":"","ast-hfb-mobile-header-display":"","site-post-title":"","ast-breadcrumbs-content":"","ast-featured-img":"","footer-sml-layout":"","theme-transparent-header-meta":"","adv-header-id-meta":"","stick-header-meta":"","header-above-stick-meta":"","header-main-stick-meta":"","header-below-stick-meta":"","astra-migrate-meta-layouts":"default","ast-page-background-enabled":"default","ast-page-background-meta":{"desktop":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"ast-content-background-meta":{"desktop":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"footnotes":""},"categories":[32],"tags":[],"class_list":["post-614","post","type-post","status-publish","format-standard","hentry","category-ai"],"_links":{"self":[{"href":"https:\/\/swstech.sws-international.com\/index.php?rest_route=\/wp\/v2\/posts\/614","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/swstech.sws-international.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/swstech.sws-international.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/swstech.sws-international.com\/index.php?rest_route=\/wp\/v2\/users\/4"}],"replies":[{"embeddable":true,"href":"https:\/\/swstech.sws-international.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=614"}],"version-history":[{"count":9,"href":"https:\/\/swstech.sws-international.com\/index.php?rest_route=\/wp\/v2\/posts\/614\/revisions"}],"predecessor-version":[{"id":625,"href":"https:\/\/swstech.sws-international.com\/index.php?rest_route=\/wp\/v2\/posts\/614\/revisions\/625"}],"wp:attachment":[{"href":"https:\/\/swstech.sws-international.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=614"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/swstech.sws-international.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=614"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/swstech.sws-international.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=614"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}