Even with a cursory introduction to “Big Data”, you will likely see some mention of MapReduce. It provides the framework to Hadoop which in turn has been used (with some modifications) by many applications such as Facebook, Google, Twitter, and Yahoo. MapReduce is a programming model that actually abstracts the complexities of parallel computing and allows users to tap into the power without having to get into the low level architecture. I covered this in my Data Science class and was surprised at how easy it was to pick up. So I copied what they did in Python and created a MapReduce emulator in R.
I just turned in Assignment 2 for my Data Analysis Course so I can now share it on
here (Unfortunately, I’ve been warned that people have been plagiarizing so I’ve removed my files to prevent cheating… which ironically I did not list as a challenge for a MOOC below, but should be added). In this assignment, we were given sensor data from the Samsung Galaxy SII recorded while users performed specific activities. The goal was to develop a model on some training data to predict what activity the test subjects are performing. As usual, I wish I had more time to spend on it because I always feel like there is more I can add. Using random forests, I got a misclassification error rate of about 5% on the test subjects. Not too shabby, but at some point I would like to compare it to other models such as SVMs or Neural networks. Continue reading
47 years. That is the life expectancy of Sierra Leone, the lowest of all countries. Compare that with the highest life expectancy of 83 years in San Marino. Its easy to brush these numbers aside as another dull statistic, but that is a 36 year difference. I have yet to even experience that length of time! Imagine being told, that your life will end 36 years earlier than expected. It’s not very easy for most of us to comprehend. Continue reading
I thought I’d share my first assignment from the Coursera class, Data Analysis. It’s a very simple analysis, but it did get me back into the feel of writing research papers (as opposed to the terse sentence fragments I email at work). I’m posting it here to demonstrate what level of work you can expect from such a course… and because charts in R blow Excel out of the water. Continue reading
A week or two ago, Nikhil Kumar showed me this awesome real-time cartogram of twitter feeds. Since it was something I had no idea how to do, I thought it was a good time to learn. I decided to play around with geographical data visualizations and see what kind of graphics could be easily produced. Continue reading
What can you see in the image above? A seemingly innocuous picture of Tower Bridge and the London skyline? But, there’s also hidden instructions for a devious, top-secret scheme. And, it can’t be detected by just looking at the picture. Continue reading
After being inspired by this link, I planned to implement a similar Genetic Algorithm. The algorithm uses a sequence of mutating polygons to recreate an image. Since I was taking a Coursera refresher in R, I decided to give it a shot in that language. Continue reading