(An ongoing collaboration with Nicholas Thayer) Principal component analysis is a powerful technique for recognizing the underlying features of large, high-dimensional datasets. The general idea is to find just a few key components that can be judiciously combined to accurately reproduce the majority of the observed data. In practice, the principal components are generated by computing the eigenvectors of a data covariance matrix.
A popular application of principal component analysis is found in facial recognition software. Using a large training dataset of photos of peoples' faces, the principal components or eigenfaces are generated. The eigenfaces can then be used to classify new photos of unknown people, or computer-generate faces with specific qualities and characteristics.
A difficulty with the eigenface approach is that the training dataset must be composed of carefully cropped photos using the exact same lighting conditions and camera angle. This makes it difficult to use the majority of the photos available on facebook/myspace/flickr etc. for this type of analysis. In order to overcome this challenge, an idea that we had was to use mugshot photos, as these are taken under (somewhat) standardized conditions for each subject. Furthermore, a number of police departments make arrest photos available online, leading to an excellent trove of data.
In order to assemble a dataset, we used Google's automated search api to download a couple thousand faces, using the search keyword "mugshots". Unfortunately Google seems to cut you off after relatively few search returns, unless you are willing to pay for a premium version of their api. All images were resized to 128 X 128 resolution, and the eigenfaces were computed using the mathematical computing platform MATLAB. Below are some of our favorite results. Some look like faces, while others don't look like anything at all. It is interesting to think about the raw material that was used to make them.