Aaron C. Smith Ph. D. Statistics and Data Consulting

Data visualization portfolio

This page shows some visualizations, and insights from these plots.

Bar plots

An interesting insight is that, within this data set, midsize and SUVs proportionally have a lot of automatic transmissions.

Box plots

Interesting insights from this data set are,

  • The order of classes, compact has a slightly better average mpg than subcompact.

  • Subcompact mpg has skewed right mpg, subcompact has a few exceptional mpg vehicles.

  • Overall, midsize, subcompact and compact has similar mpg.

Violin plots

Violin plots give a more descriptive presentation of the data than box plots. This visualization gives the same insights as the box plots above.

Scatterplot with model lines

I conclude that each drive train should have a different model if we want to model mpg by displacement. I feel that rear wheel drive should not be modeled in this manner.

Scatterplot with contour lines

The contours show that there are two distinct peaks in the distribution of points. I would consider data segmentation.

Scatterplot with density plot

The density plot show that there are two distinct regions of points. I would consider data segmentation.

Bubble plot

Bubble plots are a good way to present tables of numbers. The size of the bubble corresponds to the number. Notice how the lower right bubbles are larger.

Histogram with another variable coloring

This histogram shows that the diamond prices in the data set are skewed right. Most have low price, and a few have extremely high price.

The coloring by carat quantifies how important size is in determining price.

Normal QQ-plot

This plot evaluates how closely the data fits a probability distribution. This is important in a lot of statistical analyses.

This plot shows that diamond price is not normally distributed. Many statistical techniques taught in intro statistics is inappropriate for this data.

Scatterplot with linear models by another variable

Notice spread out the points are vertically. The amount of spread shows that the difference between regression models is not operationally significant. If you run hypothesis testing, the difference between models is statistically significant, but operationally, carat is so much more important than cut that we could model price with carat without cut.

Time series plot

Bar plot showing the count and percentage of survived/died by sex.

Diagram of a decision tree modeling survived/died by sex and age.

Eigenvalues from a path of circulant unistochastic matrices.

The points on the plot are in the complex unit circle. The blue points are conjugate to the green points. The points were generated by creating a path of 3x3 unitary matrices, mapping the matrices to unistochastic matrices, then taking their eigenvalues.

Time series plot of baby girls named Kaitlin.

For coworkers and friends, Aaron made time series plots of the number of babies given their name.

Scatterplots showing the importance of categorical variables