audio – Andrew E. Scott

Mathematical, musical curiosity

I’ve been recently writing an app that uses the autocorrelation approach to detect the pitch of musical notes. This approach basically tries to see if a given musical note is in a digital audio signal by comparing each sample with the next sample in the signal that ought to be the same (since a given note should repeat periodically as per its frequency). In exploring how to best do this in my app, I’ve come across something I found curious.

Before I get to that, I need to explain a couple of things. Firstly, I am doing this pitch detection for a particular instrument: the flute. The flute is ordinarily considered to be able to play notes from B3 (the B immediately below middle C, but only if a flute has a “B foot”, otherwise from middle C) to C7 (the C three octaves above middle C). However, very skilled players might be able to get a few notes higher, to F7. Also, the piccolo flute can go up to C8, but we’ll ignore that more now. Given that the frequency of B3 is 246.9Hz and that of F7 is 2,793.8Hz, the 43 notes are spread across about 2,550Hz of frequencies.

The other thing to explain is that CDs (and many electronic devices) use a sample frequency of 44,100Hz. This is considered to be sufficiently high to record and reproduce audio signals up to 20,000Hz, which is the general limit of human hearing. However, a higher sample frequency, of 48,000Hz, is being increasingly used, such as in DAT tapes or DVDs.

These two things come together in autocorrelation because it requires knowing the period of each note, measured in numbers of samples. For example, the audio signal for a pure B3 tone should repeat every 178.6 samples if sampled at 44.1kHz or every 194.4 samples if sampled at 48kHz. Similarly, F7 should repeat every 15.8 samples at 44.1kHz or every 17.2 samples at 48kHz. Except there’s no such thing as a fraction of a sample, so for my autocorrelation calculations, I would round to the nearest sample.

Rounding introduces error, so using a period of 16 samples (at 44.1kHz) or 17 samples (at 48kHz) for F7 is not ideal. In fact, these periods correspond to different frequencies – 2,756.3Hz and 2,823.5Hz respectively. The intervals between musical notes are measured in cents, and there are 100 evenly-spaced cents to a semitone. The frequency corresponding to a period of 16 samples at 44.1kHz is 23 cents below the real F7, and the frequency of 17 samples at 48kHz version is 18 cents above F7. Higher notes are more error-prone, and the corresponding errors for a low note like B3 are 4 cents below (for 44.1kHz) and 3 cents above (for 48kHz).

For my autocorrelation algorithm, some error in detecting pitch is okay, since as long as the flute is playing in tune, if the algorithm was less than 50 cents out, it would always get the right note. So, I wrote some code to look at what the maximum error in cents was in following this approach, considering a range of sample frequencies from 2,000Hz to 60,000, and got a curious graph:

You might be able to see small red dots at the points for 44.1kHz and 48kHz (or you can click through to see a bigger version of the photo). This graph shows the maximum error in cents across all notes in the range between A3 and F7, and it is less than 40 cents for both 44.1kHz and 48kHz. In fact, the maximum error for 44.1kHz (29.3 cents, relating to the note G#6) is less than 48kHz (37.0 cents, relating to the note D7), and 44.1kHz is close to the minimum for all sample frequencies up until about 57.8kHz.

There is a general trend that the higher rates result in lower errors, although I wasn’t expecting that the sample rate of 44.1kHz would have lower maximum error than 48kHz. I wondered if this was due to the specific range I was examining, so I wrote some more code to examine the impact of maximum errors on these two frequencies if I used ranges of notes starting at A3 and finishing at between C7 and C8. Here’s the resulting graph:

As with before, for a note range going up to F7, 44.1kHz has a lower maximum error in cents than 48kHz. However, if the note range had stopped at C7, 48kHz would have a lower maximum error. Also, if we’d gone above A7, 48kHz would also be more accurate than using 44.1kHz but at that point the error would be above 50 cents, i.e. not accurate enough to be useful.

So, curiously 44.1kHz happens to be well-suited to autocorrelation of notes in the flute range. I’m sure this wasn’t a consideration when that was selected as a common sample frequency for audio recordings, but it happens to benefit me now.

Personal and environmental audio – hear hear!

Just before Christmas, a friend brought me a new pair of headphones back from the US. I still haven’t quite decided yet whether they are the future of personal audio or just a step in the right direction, but I am finding them a bit of a revelation.

The headphones are the AfterShokz Sportz M2, which are relatively cheap, bone conduction headphones. Bone conduction means that instead of the headphones sending sound into your ear canal (like in-ear or full size headphones), they sit against the bones of your skull and send vibrations along them to your inner ear. The main advantage is that while listening to audio from these headphones, you can still hear all the environmental sound around you. The main disadvantage is that, of course, you can still hear all the environmental sound around you.

Clearly, this is not desirable for an audiophile. Obviously, you don’t get these sorts of headphones for their audio quality, and while I find them perfectly decent for listening to music or podcasts, the bass is not as good as typical headphones either. That said, if I want to hear the sound better, I can pop a finger in my ear to block out external noise. Sometimes I use the headphones for telephone calls on my mobile when traveling on the tram, and it probably looks a little odd to the other travelers that I am wearing headphones and putting my finger to my ear, but it is very effective.

For the first week or so that I was wearing them, I had strange sensations in my head, very much like when I first get new frames for my glasses. They push on my head in a way that I’m not used to, and it takes a little bit to get used to. The fact that I can hear music playing in my “ears” and yet hear everything around me was also initially a bit surreal – a bit like I was in a movie with a soundtrack – but the strangeness here diminished very quickly and now it is just a delight.

While they are marketed to cyclists or people who need to be able to hear environmental sound for safety reason (like, well, pedestrians crossing roads, so almost everyone I guess), it’s not the safety angle that really enthuses me. I am delighted by being able to fully participate in the world around me while concurrently having access to digital audio. When the announcer at a train station explains that a train is going to be cancelled, I still hear it. When a barista calls out that my coffee is ready, I still hear it. When my wife asks me a question while I’m doing something on the computer, I still hear it.

A couple of years ago, I yearned for this sort of experience:

For example, if I want to watch a TV program on my laptop, while my wife watches some video on the iPod on the couch next to me, we are going to interfere with each other, making it difficult for either of us to listen to our shows.

Being able to engage with people in my physical environment and yet access audio content without interfering with others is very liberating. I had hoped that highly directional speakers were the solution, but bone conduction headphones are a possible alternative.

Initially I had tried headphones that sat in only one ear, leaving the other one free. They were also very light and comfortable. One issue was that these were Bluetooth headphones and had trouble staying paired with several of the devices I had. However, and more importantly, I looked a bit like a real estate agent when I wore them, and was extremely self-conscious. Even trying to go overboard and wear them constantly for a month wasn’t enough to rid me of the sense of embarrassment I felt. Additionally, others would make a similar association and always seemed to assume that I must be on a phone call. If I did interact with others, I always had to explain first that I wasn’t on a call. What should’ve been a highly convenient solution turned out to be quite inconvenient.

The AfterShokz have none of these issues. I did try coupling them with a Bluetooth adaptor, but it had similar Bluetooth pairing issues. I see that AfterShokz have since released headphones with Bluetooth built in, but I haven’t tested these.

One potential new issue with the AfterShokz that I should discuss relates to the ability for others to hear what I’m listening to – this had been mentioned by some other online reviewers. While at higher volumes, others can hear sounds coming from the headphones (although this is not unique to AfterShokz’ headphones), at lower volumes it is actually very private. In any case, I’ve got a niggling sense of a higher risk of damage to my inner ear from listening to music at higher volumes: bone conduction headphones surely need to send sound-waves at higher energy levels than normal headphones because the signal probably attenuates more through bone than through air, and this is coupled with the fact that it needs to be operated at higher levels in order to be heard over background noise that would be otherwise blocked out by normal headphones. So, I try to set it at as low a volume as I can get away with, and block my ear with my finger if I need to hear better. In quiet environments, it’s not an issue.

Perhaps I am worrying about something that isn’t a problem, since I note that some medical professionals who specialise in hearing loss are advocating them. For that matter, the local group that specialises in vision loss is also promoting them. Although, I guess the long term effects of this technology are still unclear.

In any case, I find using this technology to be quite wonderful. I feel that I’ve finally found stereo headphones that aren’t anti-social. I hope if you have the chance to try it, you will also agree.

People prefer the personal

Following up on my last post, the reason that the big TV on the living room wall is going to become less relevant is because it’s a shared device. The way of the future is personal devices.

It’s sad but true – we prefer to have our own personal versions of things rather than share them with others. Maybe this is a particularly Western trait, but I suspect not. For example, despite the additional cost, most people prefer to travel in their own car rather than use a taxi or use public transport. Car sales are booming in China, showing it’s not just something that happens here.

When it comes to video devices like TVs, pretty much all actors in the economy are benefiting from move to selling household video devices to individual video devices: the screen manufactures, content providers, telcos, and most of all, the viewers. It’s part of a larger trend. Initially, all households in a city got the pretty much the same video content at the same time, broadcast from TV stations. Then, with the uptake of VCRs, DVDs, PVRs, and so on, different households were able to get different video content at the same time. Now, with PCs and iPods, individuals within the households are getting different content at the same time.

We saw the same thing happen with audio devices. The Consumer Electronics Association in America published this year in their Digital America 2008 report that

U.S. factory-level dollar sales of portable audio products, consisting overwhelmingly of MP3/portable media players (PMPs), exceeded the combined sales of the home audio and aftermarket car audio industries for the first time in history in 2005, and again in 2006 and 2007, according to CEA statistics.

Another aspect to consider is that portable media players and PCs are increasingly becoming connected to the Internet, and support communication as well as media consumption. There will be growth in triggers to watch video content, received over those communication channels (such as friends sending you email, IM, or messages from Twitter or Facebook), and given a desire for immediate gratification, people will not want to wait for a shared device to become free, so will watch the video content on their personal devices, even if the quality of experience is less.

I don’t think shared video devices, like the expensive LCD or Plasma set that takes pride of place on the wall, will ever become completely redundant. They will simply evolve to niche uses when it is more convenient or appropriate to use a shared device, such as when hosting a video / games party with friends, or displaying a loop of video to display in the background.