Forgive us, John Connor, or How We Taught a Neural Network to Accurately Recognize Gunshots

9 min readJul 17, 2023

My name is Anton Ryabykh, and I work at Doubletapp. In this article, I will talk about the technical details of applying machine learning in the HitFactor project.

What is hit factor? In practical shooting competitions, athletes quickly move, change magazines, and shoot at different targets, including moving ones. Hit factor is the competition result, which is the number of points divided by the time taken.

World practical shooting champions Alena Karelina and Roman Khalitov told us about it. And they needed a mobile app to assist in their training. Moving more efficiently and shooting faster — analyzing the training recordings would help them understand how to reduce the time spent on exercises and increase efficiency.

In the project, it was necessary to accurately determine the start time of the gunshot and the start signal time. There were no ready-made solutions at the time of product development (2019). In this article, you will learn:

how we were trying to solve the problem without machine learning
what approaches we used with machine learning
how we annotated the data
how we used an intermediate model to assist with annotation
how we deployed the final model on iOS devices.

Description of the final product

Our clients asked us to develop an app for practical shooting that would simultaneously display 2 videos, synchronize them based on the start signal, and annotate the shots from each video on the timeline. This would allow athletes to see where they make unnecessary movements between shots and waste time. The time of each shot would also be displayed near the video. The app interface looked like this:

*Here, on the timeline, the white vertical line indicates the start signal, and the red lines indicate the shots.*

And here are already 2 videos: they are synchronized based on the start signal. The red lines below the timeline indicate the shots from the left video, and the red lines above the timeline indicate the shots from the right video. Additionally, there is a table on the side with the moments of the shots. This allows athletes to compare their performance in shooting exercises with other attempts, whether their own or others’.

Recognition requirements: it should work offline on iOS devices, and the shot recognition accuracy should be very high (within 50 ms margin of error).

Solution development

The problem was solved solely through sound processing. Solving it through video processing was not feasible as the shooter might not even be visible (obscured by an obstacle). Additionally, the start signal is not visible in the video, so sound processing was required in any case.

First, let’s understand what constitutes a shot and a start signal.

The start signal is a uniform sound signal at a frequency of approximately 2 kHz. However, there can be different devices producing the signal, so the frequency can vary. The duration of the signal is approximately one second. There are almost no similar sounds to the start signal at the place.
A gunshot is a loud sound with some duration, which can be a second or longer, gradually fading out. There can be multiple consecutive gunshots. It’s important for us to detect the start of each gunshot, i.e. the moment when the gunshot occurred. This includes finding the start of each consecutive gunshot if there are multiple. There can be many sounds similar to gunshots at the place (such as any impact or weapon noise).

Existing solutions

As I mentioned before, in 2019, when we worked on the project, we couldn’t find any existing solutions for gunshot recognition or accurately detecting the start of a specific sound in a long audio track.

Data, part 1

At the beginning of the project, we had limited data. We obtained a few video clips from the clients and downloaded some from YouTube. We also found datasets containing gunshot sounds, but they were limited, and the sounds were not isolated. In one track, there could be silence, followed by a gunshot sound, then silence again, and sometimes another gunshot or a burst.

Baseline: Solution without machine learning

As a baseline solution, we attempted to detect gunshots based on the volume and set a threshold based on the fading. However, this solution proved to be unreliable for several reasons:

Setting a threshold was challenging. Sometimes, if a person moved away from the camera, the sounds became much quieter. Different firearms produce different volumes, and the sound varies indoors and outdoors. It was not possible to determine a universal threshold in advance.
There were often many unrelated loud sounds, such as people talking near the camera, weapon handling noise, or cartridge ejection, resulting in many false positives.
This approach did not help in detecting the start signal.

Machine learning solution

Nearly all audio recognition solutions work on spectrograms, so our audio processing quickly transitioned to working with spectrograms. A spectrogram is effectively a 2D array that shows the volume at each frequency at a particular moment in time.

*Example of a spectrogram: Blue areas represent silence, and red areas represent loud sounds, in this case, gunshots. It can be observed that the gunshots fade out over time.*

We generated spectrograms using the librosa library.

There were only a few videos with gunshots, but each video contained multiple gunshots. Additionally, we had audio tracks with gunshots within silence.

We decided to extract gunshot sounds using Audacity, creating synthetic tracks with gunshots or start signals overlaid on, for example, a person speaking or running, or over music. Then, we trained the network using this data.

First version of the network

Initially, we experimented with the following architecture: The network received a 2D array input of size 9x1025, where 9 represented the time bins (horizontal elements in the spectrogram graph; one bin represented, for example, 0.2 seconds depending on the spectrogram conversion parameters), and 1025 represented the frequency bins (vertical ones on the graph). The network’s task was to predict whether the central bin was the start of a gunshot, start signal, or background noise.

*Example of a gunshot: a short graph with 15 time bins, which is more illustrative than with 9. We tested different network parameters, and using an input of 9 bins yielded more accurate results.*

The network architecture was convolutional, with a gradual reduction in the vertical dimension using MaxPooling (the time dimension was not reduced).

At the end, it was broken into three classes: background noise, gunshot start, and start signal.

Metrics

For the first architecture, the metrics were calculated as for the classification task of a specific moment in time (i.e. whether the center is the beginning of a gunshot), and they looked good in numbers: the f-score for gunshots was approximately 92%.

Results of the first architecture

It was integrated into an iOS mobile app and worked offline through CoreML. There were no difficulties in integrating this architecture.

Issues

There were many false positives in the real videos: the training was done on synthetic data, and the metrics were calculated on the same data. There was no annotation for real shooting videos. This led to problems where sounds similar to gunshots, such as weapon handling noise, were interpreted as false positives.
Low accuracy in identifying the start of gunshots: although the gunshots themselves were detected well, the start time was not determined accurately. The reason for this was most likely that when cutting the gunshots in Audacity, their beginnings were not precisely trimmed as the tool was inconvenient for this task. As a result, the network learned to detect something close to the start of a gunshot but not exactly.
Long processing time on devices: since the network had to be run for each moment in time, the processing was time-consuming, taking about a minute for a 20-second video.

Data, part 2: annotation tool

To address the first two issues, we decided to get more videos from the client and annotate them precisely. The goal was to know where the gunshots and start signals were located in the real audio tracks. This was expected to reduce the number of false positives as real sounds similar to gunshots would be present in the audio and annotated as background noises, while the accuracy would increase due to the precise annotation of gunshot starts. We also decided to annotate not only the gunshot start but also the subsequent fadeout.

At this point, the client gave us a hard drive with approximately a thousand training videos — quite a lot.

To perform this annotation, a custom annotation tool had to be developed. We used it in the browser, and the server was built using Flask.

In the tool, one video was displayed in two players, along with a timeline showing the annotated gunshots, an audio waveform also annotated with gunshots, and a spectrogram. It was possible to use the mouse to select an area on the spectrogram or waveform and press a specific key on the keyboard to save it as a gunshot, start signal, or gunshot-like sound. One video player simply showed the video, synchronized with the timelines, while the other only displayed the selected regions, which allowed for viewing without resetting the main timeline.

The result of the annotation was a file containing the audio track with specific annotations (e.g. gunshot, start signal) and separately cut sounds that could be used later for generating tracks.

The tool was convenient and allowed for quick annotation of multiple videos.

We annotated 30 videos and came up with a modification to the architecture to make the network’s operation more similar to manual annotation work. The idea for the architecture emerged while annotating regions on spectrograms.

Second version of the network

The new architecture consisted of 1D convolutions followed by BiLSTM. One of the architecture variations is shown in the screenshot:

1D convolutions reduced the frequency dimension by considering neighboring time bins. The time dimension remained unchanged until the end.

At the end, it was broken into 4 classes:

background noise
start signal
gunshot start
remaining part of the gunshot (fadeout)

The start of the gunshot was annotated with five consecutive elements instead of one. This was done to simplify the strong class imbalance for the network without sacrificing the accuracy of detecting the start. It also helped the network understand that if something resembled the start, its neighboring elements should also be considered as the start.

The remaining part of the gunshot was not functionally necessary, but the logic behind it was that it would be easier for the network to identify the start of the gunshot when it explicitly knew its continuation. The duration of the gunshot was long, and there was no strong class imbalance for it.

The first version of this network was trained on a small portion of annotated data, approximately 30 videos. After that, it was integrated into the annotation tool to provide suggestions and visual indications of what it considered a gunshot. This allowed us to:

understand how well the network performed visually by seeing what it annotated as a gunshot, start signal, or gunshot start.
speed up the annotation process. If the network correctly identified a gunshot, we only needed to click on it and press a key to save it. If the network made a mistake, its response could be deleted.

As a result, approximately 120 videos were annotated using this approach.

An additional 120 videos were generated. Cut gunshot and start signal sounds were overlaid on background tracks (running, music, podcasts).

The final network was trained on this dataset.

With this approach, we addressed the problems of the previous version:

The number of false positives decreased significantly since complex sounds from the original tracks were preserved and shown to the network.
The accuracy of detecting the start of the gunshot improved because the tool allowed precise annotation of the gunshot start on the spectrogram, making it easier to identify.
After porting the network to iOS, it worked much faster and more accurately than the previous version.

Final metrics

The gunshot recognition metrics were calculated considering that the start of the gunshot determined by the network should not differ from the actual start by more than 50 milliseconds.

The recognition accuracy was 99.1%.
The recognition sensitivity was 97.8%.

Accuracy refers to the probability that the detected gunshot is indeed a gunshot.

Sensitivity refers to the probability that an existing gunshot will be detected.

Porting to iOS

The network was ported to iOS and it worked offline. Unlike the previous version, this version worked much faster and more accurately. There was a challenge with CoreML, as it had a limitation on the length of the LSTM sequence. If the recording was approximately one minute long, an error occurred. This was resolved by simply dividing the audio recording into multiple parts and combining the results.

Additionally, after porting it was discovered that the model performed poorly on data with a different sample rate than the one it was trained on (44 kHz vs. 22 kHz), despite the fact that there should theoretically be no difference after obtaining the spectrogram. Therefore, the final model was trained on different sample rates (22 kHz, 44 kHz, 11 kHz) to generalize well and accurately process data from various devices.

Conclusion

We have created an app that accurately detects start signals and gunshot timings. Additionally, we have developed a convenient tool for annotating sounds in combination with videos.