¡digital audio extravaganza!
I called Napster a "bad brand" a few days ago, but I've got to admit that there seems to be a certain magic to it. In the past couple of days I've had a lot of friends IMing and emailing me about the various ways of turning Napster's DRM'ed WMA files into other, unprotected formats.
Well, yes, you can do that. As I noted in the original post, you can use Winamp's out_lame plugin to encode to MP3. The Napster trick making the rounds uses the Output Stacker plugin (which has since been pulled from AOL-owned Winamp's website), but the principle is the same -- I haven't tried it, but I imagine Output Stacker might let you transfer ID3 information so you don't have to retag your music, but there is very little difference from the out_lame solution, technically speaking.
Thing is, this is nothing new.
You might have heard of the exciting-sounding "analog hole" -- this term refers to the unfortunate fact that our ears and eyes don't work digitally, so the media companies have to allow their product to be decoded at *some* point in order for it to be viewed or listened to. No matter how many fancy software hindrances they introduce, someone can always replace their headphone plug with a line running to a recorder, or point a videocamera at a video screen.
This Napster "trick" is similar. To play sounds your computer must convert audio files into what's called pulse code modulation format, or PCM. But let's back up a bit: how do you record audio digitally? Well, have a look at this page. In a nutshell: sound waves are made up of variations in air pressure. A microphone converts these slight pressure changes into an electrical signal. A digital recording is taken by measuring the strength of this signal very, very quickly -- for CD audio, it happens 44,100 times per second. Each measurement is called a sample. PCM data can be used to tell a speaker cone where it ought to be at each 44,100th of a second, recreating the original pressure waves -- voila, you've got sound. Speakers aren't digital, of course, but this is the general idea.
CD Audio is just PCM data with a particular wrapper of information put around it to help devices recognize and play it -- the same can be said for the WAV format and Apple's lossless AIFF standard, and you can convert from one to another without modifying the PCM data at all. So for the purposes of this discussion, CD audio is the holy grail. Other standards like SACD and DVD-Audio (not the same as the audio on your DVD movies) sample at faster rates than CD's 44.1 kHz. Some also sample at a higher resolution -- CD Audio is 16 bit, meaning a given sample represents one of 216 possible air pressure levels. Increasing either the sample rate or resolution allows for a more accurate reproduction of the original sound wave, but 44.1 kHz/16bit is pretty good, and it's the de-facto standard for consumer digital audio.
So what do so-called "lossy" encoding formats like MP3, Ogg Vorbis, AAC and WMA do? Here's where cognitive science comes into the picture. It turns out that what we perceive is only loosely connected to the sounds surrounding us. In the same way that optical illusions reveal the weird pre-processing that our brains perform before sensation becomes conscious, there are audio illusions that can show us just how imprecise our hearing is. The simplest example is probably relative loudness -- sit in a silent room for a while and even a slight noise will seem very loud. A slightly weirder example is the Shepard Tones, or barber-pole tone: through some sneaky math, a series of notes can be generated that seem to perpetually go up or down in pitch. Click here or here for an example.
Audio codec engineers can take advantage of our human frailty in various ways. For example, we're worse at detecting the location of low-frequency sounds than high frequencies, so they can throw away some low-frequency stereo information. Loud sounds at a particular frequency tend to mask the presence of quieter sounds of the same frequency, so the quieter information can be discarded as well. A lot of this is beyond me, but this is the general idea behind perceptual audio formats like MP3: discard stuff that we couldn't perceive anyway.
You can see a visual representation of this below. These are audio spectrograms of the first half of the first chorus of Jump, Little Children's "My Guitar". The y-axis is frequency; the x-axis is time; and I'm only showing the left channel of the stereo signal. A darker dot means a stronger relative intensity at a given frequency. The left spectrogram is the CD audio; the right is the same audio after running it through the LAME MP3 codec (and chopping off the extra silence added at the start by the conversion process).
Lossless CD Audio
128 kbps MP3
They look pretty similar, right? Well, they are. But what happens when we subtract the compressed spectrogram from the uncompressed one? We'll see everything that the MP3 compression process threw out. The changes are often slight, so I've used Photoshop's Auto Contrast function ("the poor man's normalization") to make it more visible to our feeble human eyes:
information discarded by the MP3 conversion process
Those bits of sound aren't all that important, but what is important is to realize that you can never get them back. Encoding is a one-way process; once discarded, that information is lost forever. Even if you know which MP3 codec you used and what bitrate, you can't regenerate it. That's why this type of encoding is called "lossy".
This is important because different codecs operate in slightly different ways. They throw away different parts of the signal, and each generation gets worse and worse. The difference is often subtle, but it's there. Here: have a listen to this. It's the intro to the Pixies' "Where Is My Mind" -- first, the CD audio version. Second, the same clip after being put through the audio compression wringer -- six different transitions between MP3, Ogg Vorbis and WMA (plus one more encoding to very high quality MP3 to save our bandwidth -- that transition applies to the entire clip and should be negligible). Notice how the second version is breathier? How it's harder to figure out where the "stop!" is coming from?
That's the problem with reencoding an already lossy file -- and that's what's happening with the Napster solution. Napster's audio comes in WMA format, and the hack reencodes it to MP3. The results will be a lot better than the example above, but you'll still inevitably lose some quality. Some songs will suffer more than others.
So that's the reason why I'm not super-excited about this new exploit. Barring some exotic new oppression from Microsoft, you will always be able to do this with the digital audio vendor of your choice -- or hell, your favorite internet radio station. For *nix users (including Mac owners running OS X) all you have to do is record the data going through /dev/dsp. The reason the Hymn Project (for Apple's iTunes Music Store) is different -- and so much cooler -- is that it removes the copy protection without touching the audio at all, so there's no quality loss.
I'm sure Napster's tenuous allies in the music industry are going to be really pissed off about all of this, but they'll eventually have to realize that all digital music systems will inevitably suffer from this vulnerability. It'll be nice to see their license-model wet dream of universal digital serfdom blow up in their faces, but nothing really amazing has happened: just another few executives who should've listened to their nerds a bit more. Move along, nothing to see here.
UPDATE: Above, I incorrectly implied that very high quality audio formats like SACD and DVD-Audio sample at a higher rate, and that some also improve their samples' resolution from CD Audio's 16 bit standard. That's backward -- the discrete sampling rate necessary to reproduce a signal of a given bandwidth is a known quantity determined by the Nyquist-Shannon Theorem. We know the bandwidth of human hearing -- it runs from around 20 Hz to 20 kHz. So there isn't much benefit to a higher sampling rate. Better sample resolution does carry a payoff, however. So getting past the 16-bit limit is the first thing these formats do; many also jack up the sampling rate, but this isn't the first thing audio engineers would pursue.