Detecting acoustic events in real-world environments is a much more complex challenge than “recognizing sounds.” In industrial, urban, or environmental settings, events overlap, noise is persistent, and conditions are constantly changing. Solving this problem requires artificial intelligence systems capable of simultaneously modeling time, frequency, and context, and doing so reliably outside the laboratory.
This article traces the design and evolution of a deep learning-based acoustic event detection system, from the definition of the problem to its implementation in a real electronic device. Previous experience with acoustic bird detection systems, deployed in remote and energy-constrained environments, had naturally led to a preference for small models and low-power architectures. In that context, autonomy was the dominant variable, and model complexity had to be subordinated to it.
In this project, the scenario changes. In industrial applications, energy consumption is no longer a critical constraint and acoustic complexity increases significantly: more sources, more overlaps, and greater operational variability. This change in context enables—and demands—a different approach: large models, real computing power, and non-trivial engineering decisions, implemented on dedicated, autonomous, low-cost electronic equipment, without resorting to a general-purpose computer as a hardware platform.
Three central themes are explored along the way: the choice of deep architectures inspired by DCASE—combining convolutional networks and Transformers—the practical limits of training acoustic AI for real-world scenarios, and, above all, the real bottleneck of the system: the dataset. When real data is insufficient, carefully designed synthetic generation ceases to be an alternative and becomes a structural part of the training process.
Rather than presenting a model, the article shows how to build a complete and reusable pipeline, where data, labels, training, and hardware deployment form a coherent system. This approach capitalizes on previous experience and allows acoustic AI to be taken from prototype to device, and from the laboratory to the real world.
Automatic acoustic event detection is a much more complex problem than it seems at first glance. It is not just a matter of “recognizing sounds,” but of identifying temporal and spectral patterns in real scenarios, with noise, overlaps, and constant variations in the environment.
This project was born with a clear objective: to detect and classify complex sound events in a specific environment, executing all processing locally on a Raspberry Pi-type platform. From the outset, it was clear that this was not a low-power device, nor a typical microcontroller application: the complexity of the problem required more robust models and real computing power, while maintaining dedicated, low-cost, purpose-built electronics, without resorting to a general-purpose computer as the hardware platform.

The chosen setting was an industrial environment, characterized by:
In this context, simple audio classification techniques are insufficient. It is not enough to detect energy or static patterns: it is necessary to model the time, frequency, and coexistence of events.
From the outset, lightweight architectures were ruled out, as the problem required an approach comparable to that used in academic research and benchmark competitions. The conceptual framework of the project relied heavily on DCASE (Detection and Classification of Acoustic Scenes and Events), an international initiative that defines benchmarks, datasets, and representative tasks for acoustic analysis in real-world conditions.
The experience accumulated in DCASE shows that simple models are not sufficient for detecting realistic acoustic events. Deep convolutional networks (CNNs) are necessary to capture complex spectral structures in spectrograms; temporal modeling is key to understanding how these patterns evolve over time; and attention mechanisms, together with Transformer-based architectures, allow the model to focus on relevant audio fragments, significantly improving performance in the face of events of varying duration, overlapping events, or events immersed in background noise.
Chosen architecture
The central model of the project combines:
This type of architecture is computationally expensive, but necessary when seeking true robustness.
The real bottleneck was the dataset; the model wasn't the biggest problem.
Initial difficulties
Training a large model with a small dataset is not only inefficient: it's counterproductive.
The solution was to design a controlled and reproducible synthetic audio generation pipeline.
Adopted approach:
Each audio file generated was associated with a CSV file of labels, which allows us to:
The CSV became the core of the training system, clearly separating data, labels, and configuration.
The audio files are transformed into log-mel spectrograms, following standard practices in DCASE and academic literature.
About this representation:
One of the most important lessons learned was understanding that certain errors are not bugs, but physical limitations: spectral masking also exists for models.
The following figures show specific examples of the type of information processed and generated by the system. The upper part shows the complete log-mel spectrogramof the audio, where persistent spectral patterns and variations over time can be seen, typical of an environment with multiple active sound sources. Below is the RMS energy, which reflects the overall evolution of the audio power, but which alone does not allow us to discriminate which events are occurring. Finally, the bottom timeline summarizes the result of the model's inference, indicating which acoustic events were detected in each time interval. This visualization highlights one of the central motivations of the project: while simple metrics such as energy are not sufficient to separate overlapping sources or events, the model based on CNN, attention, and Transformers manages to identify and segment different acoustic events even when they coexist in the same audio.


This project yielded several clear conclusions:
Although the project was born in a specific environment, the end result is a generic acoustic event detection pipeline, adaptable to:
By changing the domain of the dataset, the same approach can be reused without touching the base architecture.
Written by Alejandro Casanova.
For further inquiries, contact us: info@emtech.com.ar