EM GUIDE – Interview by Peter Bokor w/ Alex Radzishevsky

AudioTag: Challanging Shazam’s fingerprinting

Alex Radzishevsky

Peter Bokor of MMN talked to Alex Radzishevsky, the creator of AudioTag.info for automated track ID detection. The Ukraine-born, Haifa-based Israeli sound engineer is the author of various patented audio DSP algorithms for not only music “fingerprinting” – the underlying approach behind services like Shazam – but also for “audio watermarking”, which means hiding digital audio signatures in audio recordings (e.g., music files) for copyright protection and anti-piracy (AudioWatermarking.com). Alex explains the inherent challenges of automated track ID detection, explains his approach with AudioTag, and sheds light on Shazam’s operation. We also ask Alex about his background in music activism, his involvement in the demoscene, and why he chose to support Lahmacun radio with a free AudioTag license.

Let’s start with AudioTag, a music recognition service and audio fingerprinting tool similar to Shazam. AudioTag can be used “in the background” to automatically create playlists of a stream or a longer mix. How are the competitors in the field of audio fingerprinting?

Music recognition is a niche field. There are some competitors like ACR Cloud, but in general, big companies aren’t interested in offering such a service unless it generates significant revenue. Shazam, for example, was acquired by Apple and is mainly used to generate sales for Apple Music. Offering programmable interfaces (also called APIs) for audio fingerprinting, like AudioTag or ACR Cloud, is too small of a market for the big tech companies, despite the wide range of applications beyond Shazam-like track ID detection.

How does audio fingerprinting conceptually work?

The main idea is similar across the various technology providers. A fingerprint is basically a compact digital representation of acoustic content (technically speaking, an “acoustic hash”). For example, I take a 3-minute music track, pass it through the fingerprinting algorithm, and it creates a very compact representation of it. You can’t listen to this representation, but it uniquely identifies that specific track in a very small file, say under 10 kilobytes. It’s important to note that it’s not a numeric hash of the digital music file, but a hash of its acoustic, audible content. So, even if the music is converted, say from WAV to MP3, the fingerprint will still match the original one, allowing the track to be identified.

But that’s not the whole story. The fingerprinting algorithm only creates a compact representation of a track. You still need a specifically designed database of music fingerprints and a special algorithm to perform quick searches within that database. The search problem is this: “Here’s an audio fingerprint, return the track from the database that matches this fingerprint.” The database structure and efficient search algorithm are part of the challenges. For example, a standard SQL database wouldn’t be efficient for this purpose with a huge amount of music. The fingerprinting algorithm, the database format, and the search method are three tightly-related components of the whole solution. They can’t be designed independently but must be part of a single, highly integrated algorithmic system.

An example use case is analyzing a stream of music to create a playlist. The algorithm needs to continuously scan the stream, create hashes periodically, and search the database for a matching fingerprint almost instantaneously.

What are the sources of your database? In other words, where do you find music?

Basically, we crawl the Internet to collect music for our own database. Some people also send their large music collections to add to the database. But the main source is the continuous crawling of the Internet for new tracks. Of course, I don’t store the audio itself — that wouldn’t be legal or feasible since it’s a vast amount of data.

The sources include YouTube, public audio libraries, some closed communities and websites, social networks for track shares, and many others. We also crawl Telegram, for example, public channels where people share their tracks. However, we have to be careful because we can’t always trust the metadata information (artist name, track title, etc.). Additionally, if there are different versions of the same track, we prefer the best quality one, as its hash is likely to match with the hash of a lower quality version.

How many tracks do you have? Do you also crawl Bandcamp?

It’s a multimillion-track database already. There are an estimated 90 to 100 million songs worldwide, and we’ve hashed a little more than 20 million tracks as of today. It’s a large enough database to serve even commercial applications, but of course, we can’t compete with Shazam. Those guys, I believe, have fingerprints of everything that has ever been released.

We don’t crawl Bandcamp yet. In general, crawling is a very complicated part of the project. The interfaces of music hosting services are different and can change arbitrarily over time without any prior announcement.

What is your business model? How do you protect your IP? Is it true that you write patents yourself?

You can use AudioTag for free in various ways — whether it’s for single tracks, streams, via the website, or through the API. For analyzing streams, we do put a cap on the number of searches, and you can sign up for a paid subscription to unlock more searches.

The acoustic fingerprinting algorithm behind AudioTag is protected by patents. Even though AudioTag is a small project — pretty much a one-man band — the algorithmic design is critical, it can significantly impact the acoustic performance and efficiency of search, and competitors might try to copy your solution. Also, commercial customers sometimes specifically request patented technology to ensure it’s protected.

And yes, I write the patents myself, which is another tricky task. It involves abstract thinking and using formal language. Projects like this develop you into a bit of a renaissance man, which is a lot of work but also fun. Plus, hiring a patent attorney would be an immense expense, which I probably couldn’t afford.

Can you highlight how your fingerprinting algorithm works that powers AudioTag?

It’s an academic field with a plethora of publications on the topic. All these algorithms are based on extracting so-called “acoustic features”. An acoustic feature is an event occurring in the music that is prominent enough to be noticeable to the listener. Most of these algorithms work in the spectral-time domain. So, the sound is decomposed into spectral components, along with frequency and time, which are used to find and extract acoustic features, like a pitch or a beat in the sound. There are dozens of different ways to do that and various features you can extract.

The job is also to represent such a feature in a compact form. In practice, we’re talking about a sequence of features in the track that spread over time and frequency. Think of the track as a map of events related to time and frequency, and you look for prominent events on this map. You take those that match your idea of feature extraction, and then you store the information about these features in an acoustic hash.

Does Shazam work similarly?

Yes, it’s a brilliant and simple idea with a well-understood algorithm. It was published as a scientific paper and was also patented in the early 2000’s. I do fingerprinting my own way — the feature extraction idea is different, and the algorithm is younger than Shazam’s. My patent is now 6 years old. In general, DSP is a crowded field. There are many different approaches to feature extraction, combining features, and storing and searching the information. As mentioned earlier, the operation needs to be quality agnostic as well, which adds to the complexity of the problem.

You seem to be supportive when it comes to using your software. Budapest’s Lahmacun radio, for example, is using AudioTag to produce quarterly playlists for the local copyright authority. Being a non-profit project with very limited resources, the radio wouldn’t be able to produce such data otherwise, so it’s an “existential” game changer for them. Can you imagine making AudioTag open-source?

I develop AudioTag after-hours, besides my daytime job, and I have the freedom to support projects like Lahmacun radio. I come from a music background — I was into the demoscene and part of the tracker community, and I authored a music web magazine for 15 years — so I really appreciate non-profit music projects. My goal is to find a balance between earning some money with AudioTag and still doing work for “charity.” AudioTag can be used for free with some limitations on the amount of usage, which I see as a voluntary contribution. I wouldn’t benefit from going open-source. My competitors, who are often big commercial companies, probably would! 😉

You work at Alango Technologies, a digital sound processing company. How do you balance your work?

Yeah, I’ve been at Alango for 18 years now. I’m currently the director of product development, managing engineering and algorithm R&D. The company develops all kinds of algorithms, like echo cancellation and noise reduction, which are related to sound and audio but not to music. So, there’s an overlap in the underlying knowledge, but no conflict because Alango isn’t involved in music fingerprinting and watermarking.

Sometimes, a hobby and professional work can support each other (laughs). For example, the best video game programmers often come from the demoscene.

Haifa view – The home of Alex Radzishevsky, the author of AudioTag.info and AudioWatermarking.com


Tell me about your other technology, AudioWatermarking.com. How does it relate to audio fingerprinting?

Audio watermarking is a method of hiding signatures (or personal identifiers) inside acoustic content. It’s a different audio processing problem and requires completely different algorithms than fingerprinting. The main challenge is to embed a digital identifier in the actual audible content of a music track without changing how the song sounds to the listener. So, it involves modifying the song, but in a way that is imperceptible to the human ear.

Here’s an example: you’re a musician and you want to share one of your songs with Sarah. You watermark the song with a personal identifier for Sarah and send her the watermarked song. The song sounds the same as before watermarking. Then, you send another copy to John by embedding John’s personal identifier into the song. Again, the original song, the one watermarked for Sarah, and the one watermarked for John all sound exactly the same to the human listener. If I later find the song on a file-sharing platform like Soulseek and see that it was watermarked for John, I know that John was the one who leaked it.

Wow. How does that work? How can you modify music without changing how it is perceived?

It’s another patent (laughs). Actually, two — the second one is very new. Watermarking falls into the field of psychoacoustics. In simple terms, the challenge is to find sounds that are not noticeable to the human ear and hide information in them. For example, so-called fricatives in speech, non-pitched signals, or other high-energy non-harmonic sounds are ideal because the human auditory system is not very sensitive to their nuances. If we make some changes in these sounds, the listener likely won’t notice.

Another effect is called “time masking,” known from psychoacoustics. If you place a low-level (quiet) sound just a few milliseconds after a high-level (loud) sound, the brain won’t “hear” it. This phenomenon allows us to introduce unnoticeable modifications in the acoustic content if they are hidden (masked) by preceding high-level sounds. There are various techniques and methods to introduce unnoticeable changes into sound recordings.

I suppose that your algorithms (both tagging and watermarking) are deterministic, in the sense that they are not “trained”, so it’s not any form of machine learning, right? Can you imagine an AI-augmented approach?

All of my DSP technologies so far have been based on classical algorithms, without involving AI or machine learning. I believe AI is indeed the future, and hybrid solutions combining AI with classical algorithms will be the next step. In some audio areas, such as noise reduction in conferencing applications, this is already happening. However, the term “AI” is often overused as a marketing buzzword and doesn’t inherently guarantee better features or quality. There are many different machine learning approaches, not just neural networks, and incorporating AI doesn’t automatically mean superiority. That said, AI can open new possibilities in areas where classical methods have reached their limits. Interestingly, my technologies, especially watermarking, become even more relevant in the modern world of AI and deep fake. I recently posted an article discussing exactly this topic.

Alex, thanks a lot for the conversation!


EM Guide“This article is brought to you by MMN Mag as part of the EM GUIDE project – an initiative dedicated to empowering independent music magazines and strengthen the underground music scene in Europe. Read more about the project at emgui.de

Funded by the European Union. Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or the European Education and Culture Executive Agency (EACEA). Neither the European Union nor EACEA can be held responsible for them.

Kaput - Magazin für Insolvenz & Pop | Aquinostrasse 1 | Zweites Hinterhaus, 50670 Köln | Germany
Herausgeber & Chefredaktion:
Thomas Venker & Linus Volkmann
Autoren, Fotografen, Kontakt
Kaput - Magazin für Insolvenz & Pop
Impressum – Legal Disclosure
Urheberrecht /
Inhaltliche Verantwortung / Rechtswirksamkeit
Kaput Supporter
Kaput – Magazin für Insolvenz & Pop dankt seinen Supporter_innen!