Data Analysis   Spiders, Crawlers, Customized Analysis, Statistics

Home Services Data Analysis

Data Analysis

If there is one thing I enjoy, it's data. Since "data" is such a broad term, I'll list specific examples that come to mind when I use that term.

  • Access log data
  • eBay listings
  • Sports competition results
  • Data feeds
  • Consumer purchase data
  • Craig's List listings
  • Form data

Access log data

When you want to learn how people are using your website, the first place to start is with access log data. Your website is hosted on a web server - which is just a computer, either dedicated to your site or shared with other website owners - and on that web server, there is software running 24x7 waiting to accept page requests and to fulfill them. That's web server software, and one of the most popular (and fast and configurable) is Apache. So, every time a page request arrives for your site, Apache writes information to an access log file.

Here is what a typical set of Apache log line file looks like in the process of handling the request for this page:

65.39.107.4 - - [05/Jun/2013:18:27:05 -0400] "GET /service/data-analysis.html HTTP/1.1" 200 3251 "http://davidalyea.com/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:22.0) Gecko/20100101 Firefox/22.0"
65.39.107.4 - - [05/Jun/2013:18:27:06 -0400] "GET /css/base.001.css HTTP/1.1" 200 2129 "http://davidalyea.com/service/data-analysis.html" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:22.0) Gecko/20100101 Firefox/22.0"
65.39.107.4 - - [05/Jun/2013:18:27:06 -0400] "GET /css/font-awesome.css HTTP/1.1" 200 2912 "http://davidalyea.com/service/data-analysis.html" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:22.0) Gecko/20100101 Firefox/22.0"
65.39.107.4 - - [05/Jun/2013:18:27:06 -0400] "GET /css/bootstrap-responsive.min.css HTTP/1.1" 200 3998 "http://davidalyea.com/service/data-analysis.html" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:22.0) Gecko/20100101 Firefox/22.0"
65.39.107.4 - - [05/Jun/2013:18:27:05 -0400] "GET /css/bootstrap.min.css HTTP/1.1" 200 17075 "http://davidalyea.com/service/data-analysis.html" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:22.0) Gecko/20100101 Firefox/22.0"
65.39.107.4 - - [05/Jun/2013:18:27:06 -0400] "GET /js/bootstrap-popover.js HTTP/1.1" 200 1164 "http://davidalyea.com/service/data-analysis.html" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:22.0) Gecko/20100101 Firefox/22.0"
65.39.107.4 - - [05/Jun/2013:18:27:06 -0400] "GET /js/bootstrap-tooltip.js HTTP/1.1" 200 2386 "http://davidalyea.com/service/data-analysis.html" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:22.0) Gecko/20100101 Firefox/22.0"
65.39.107.4 - - [05/Jun/2013:18:27:06 -0400] "GET /js/base.002.js HTTP/1.1" 200 661 "http://davidalyea.com/service/data-analysis.html" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:22.0) Gecko/20100101 Firefox/22.0"
65.39.107.4 - - [05/Jun/2013:18:27:06 -0400] "GET /images/banner/round/facebook.png HTTP/1.1" 200 10010 "http://davidalyea.com/service/data-analysis.html" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:22.0) Gecko/20100101 Firefox/22.0"
65.39.107.4 - - [05/Jun/2013:18:27:07 -0400] "GET /images/banner/round/linkedin.png HTTP/1.1" 200 9503 "http://davidalyea.com/service/data-analysis.html" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:22.0) Gecko/20100101 Firefox/22.0"
65.39.107.4 - - [05/Jun/2013:18:27:07 -0400] "GET /images/banner/round/twitter.png HTTP/1.1" 200 11927 "http://davidalyea.com/service/data-analysis.html" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:22.0) Gecko/20100101 Firefox/22.0"
65.39.107.4 - - [05/Jun/2013:18:27:08 -0400] "GET /images/banner/round/youtube.png HTTP/1.1" 200 12583 "http://davidalyea.com/service/data-analysis.html" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:22.0) Gecko/20100101 Firefox/22.0"
65.39.107.4 - - [05/Jun/2013:18:27:08 -0400] "GET /images/b/img64x64.png HTTP/1.1" 200 9645 "http://davidalyea.com/service/data-analysis.html" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:22.0) Gecko/20100101 Firefox/22.0"
65.39.107.4 - - [05/Jun/2013:18:27:09 -0400] "GET /images/banner/services.png HTTP/1.1" 200 102938 "http://davidalyea.com/service/data-analysis.html" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:22.0) Gecko/20100101 Firefox/22.0"
65.39.107.4 - - [05/Jun/2013:18:27:09 -0400] "GET /images/data-analysis-1-590.jpg HTTP/1.1" 200 48094 "http://davidalyea.com/service/data-analysis.html" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:22.0) Gecko/20100101 Firefox/22.0"

That's 15 requests to the Apache server to get all the parts to make up one page. It starts with the request for the page itself, which returns HTML. Then the browser, which we can see in this case is Firefox, parses through the HTML and determines what additional files it needs. That's where the other 14 file requests are coming from, starting with the CSS files, then the JS files, and then the image files. We can see by the timestamps that the whole process, from start to finish, took 4 seconds, which isn't particularly fast. However, the browser will now have cached most if not all of these files, likely for 1 year based on HTTP headers, so subsequent pages loaded from my site will render much faster.

Now, what can we learn from this data? Well, what really matters, in terms of analyzing site usage, is the very first line, that this person wanted to see /service/data-analysis.html. We know this "person" is 65.39.107.4 - that's the IP address. Though not 100% accurate, it is possible to connote an IP address with being a unique individual. So in terms of site usage, we now have a unique visitor, and we can follow their access path through the site starting with this line and then looking for following lines with the same IP address. We can delta the timestamps to see how long the user stays on each page. We can find the last page the user looked at before we no longer see their IP address. All of these things are exactly what happens in Google Analytics! Count visitors, count page views, measure time on page, and identify the exit page. If the user arrives, looks at one page, then leaves, that defines a bounce. And so on - with just these page lines from the Apache access logs, there is a lot of analysis to be done, especially when you are poring through 200,000 lines of log data.

Other thing you can learn from Apache access log files includes, for your site, what percentage are on desktops, tablets, or smartphones, what operating systems they are running, and what browsers they are using. This is important to developers in order to make sure that a website is rendering properly for almost everyone (let's say 99% of all users). There are times when people access a site using an old browser for which it's hard to justify the time and effort to test and fix website code. We can also begin to calculate page speed metrics by looking at the time deltas and finding inefficiencies. Another thing, you can chart hour by hour usage to determine when are the peak usage hours on your site. And yet one more use of access log data: you can geo-locate your users using their IP addresses and begin to learn from where in the world your users are arriving.

Suffice to say, Apache access log data is an extremely powerful set of data. It is the backbone of how Google Analytics works; if you look at reporting on your website's backend reporting, it's almost for sure coming from your web server's access logs. I've been using access log data in, I should say, very creative ways to make the most of my users' experiences on my websites.

eBay Listings Data

I use the eBay Developers API to manage listings on my sporting goods and other websites. Since this data can be aggregated day by day, it is possible to do time series analysis of eBay product listings by keyword, by category, by price, etc. In the early days of eBay, I wanted to get the domain debay.com, but I missed it by just a few months! My idea was to "deconstruct eBay" by analyzing the product listing data and giving sellers and buyers the information they'd need to make the best decisions when listing or buying. I didn't get the debay.com domain, but I went ahead and built an interface centered on aggregated eBay data. I started with a category I knew nothing about: jewelry. I quickly knew exactly which keywords were tracking to higher sales figures, getting the most bids, etc. Conversely, I also knew which types of jewelry listings were duds with zero bids. I continue to use eBay data now on QBike.com, where I list road bikes and mountain bikes for sale as well as all types of bike parts.

Sports Competition Results

I always found it fascinating that in the world of triathlon, no two courses are alike, even if the stated distances are the same. Furthermore, the race conditions can vary from year to year. And lastly, though it's easy to look at individual times for the swim, bike and run (and T1 and T2), and ranks can be set for each of the legs, there is no metric to describe consistency. So I've always taken an interest in analyzing sports competition, particularly for triathlon.

I created an interace where anyone in the world could upload a plain text data result file from a triathlon, and I could present statistics and comparative results for each athlete. I developed a consistency metric, which was a bit trickier than I first thought it would be. I imagined that the "winner" of the most consistent athlete would be a middle-of-the-pack athlete who, for example, swam in 289th place, biked in 270th place, and ran in 281st place. From my own experience, I knew that I often posted very level times, so even I might be a candidate for highly consistent athlete. (Consistent here meaning, from leg to leg in a single race, not over the course of many races, which is a different measure and interesting in its own right.) It turns out, the most consistent triathletes tended to always emerge from the top 10 finishers of the race! Frequently the race winner would be 3rd on the swim, 1st on the bike and 2nd on the run, for instance. That's tough to beat. However, my thought on that matter was: competition. Who within reason (and that means a statistically determined time gap) could compete with that top triathlete in order to change their ordinal ranking? That was the main part of my model adjustment, to account for the fact that a MOP triathlete has a whole lot more competition before and after him to alter his rankings on each leg. So I eventually finished up this data model, and it's always been fun to see the results of a race to uncover who was most consistent by my measure.

I competed in a swim series in Miami in 2012, and the rankings were intended to be for the whole series. I quickly conjured up a scoring system for the 3 event swim series and emailed the race director to suggest my scoring system. I ran through simulated results data to prove to myself that my scoring model worked, and it did, and it was used to score swimmers for that swim race series.

Now that I've taken an interest in MMA and the UFC, I am thinking of different ways to analyze fighter results. I find it intriguing when the UFC announcers on TV describe some fact about a fighter or summarize a fighter's numbers for a fight, and often I get ideas on different ways to look at fights and pencile notes. So this is something I may jump into some day - not the fighting part, the data analysis!

Data Analysis

Newsletter Subscription
Follow me on:

Latest news

  • Twitter Bootstrap
    Twitter Bootstrap 3.0.0 RC1
    It's here - and now just one CSS file
    Time to bootstrap more websites!
  • QBike Store News
    QBike Store 1,000th Sale
    QBike just surpassed 1,000 sales
    That's 1,000 satisfied customers and counting!

© 2012-2013, All Rights Reserved