Dream Job

I recently learned about a dream job — its too bad that I am not currently a post-doc.

Victoria University (Australia’s Sport University) in collaboration with Tennis Australia are seeking a skilled and passionate professional to progress the sport, through statistics and/or data science. The appointment will focus on the deployment of suitable statistical and/or machine learning techniques to derive game changing insights in the following areas: on-court strategy (using HawkEye and IBM data), modelling interactions between workload-injury-performance of athletes, and data mining of business intelligence (tournament, event, ranking, sensor and performance) data.

First Electronic Undergraduate Statistics Research Conference

There is a free e-conference next week (Friday October 2) which contains presentations by USPROC student award winners and virtual poster presentations by undergraduate statistics students.

Here’s information about the meeting.

There are two special addresses by Professor Ben Baumer and Dr. Rachel Schutt on Data Science — Baumer’s talk is from 1:30 to 2, and Schuss’s talk is from 3:15 to 3:45.

We will be showing both talks in Math Science 240, but everyone is welcome to register (remember it is free) and participate.

Creating Data Files

My class just got some experience creating datafiles in cvs format. Here is some advice based on my observations looking at these files.

  1. Don’t forget to start with a header line with the variable names separated with commas.
  2. It is helpful to use relative short names for variables.
  3. Don’t put commas in data although we’re used to representing large numbers (say population sizes) using commas.
  4. Don’t forget to document your data — where you found the data and a clear description of the meaning of each variable.

Here is some file editing tips.

  1. Suppose you have population data with commas — how to get rid of them? Use the gsub function.
  2. us_states$Population <- with(us_states, as.numeric(gsub(",", "", Population)))
    
  3. Generally, make sure the quantitative variables are numeric — use the as.numeric function to convert character data to numeric
  4. Actually, non-numeric data will generally be read into R as factors. To change this to numeric, first convert to character (using as.character) and then convert to numeric.
  5. Last, you will only need the first 50 rows of the data (the remaining lines had documentation information). You can do something like this to extract the first 50 rows.
  6. us_states <- us_states[1:50, ]
    

Nate Silver and Skills for Data Science

Nate Silver is one of the stars in the data science world.  I have known about Silver for awhile, since his early fame was in the area of sabermetrics which is the science of learning about baseball through data.

Here is a short video where Silver talks about his work in statistics and sports.

In searching for Nate Silver and data science, I found an interesting site that describes the “must-have skills you need to become a data scientist.”

Welcome to Computing with Data

We are starting a new Data Science program in the Department of Mathematics and Statistics at Bowling Green State University.  I’ll use this blog to talk about interesting issues related to Data Science and give links to interesting graphs and data science articles.

One topic in my current “Computing with Data” course is statistical graphics and I’ll give attention to good and poor graphs in the news.

A Good Graph

The New York Times often provide interesting graphical displays. This article shows an interesting plot relating the body fat and BMI for 5000 adults. What I especially like is the use of guiding lines to highlight discrepancies between the two measures to show the “healthy obese” and “skinny fat” groups.

Improving a Bad Graph

For the last couple of years, I’ve been participating in a book that describe the use of R to learn about baseball. In today’s post, I show a famous baseball graph by Stephen Jay Gould and show an improved graphical display that gives a better message about how the standard deviations of batting averages have been changing over time.