In this post, I will be going over a simple example on how to use R to find insights using the diamonds dataset that comes with R.
The Dataset:
There are three columns within this modified dataset, carat (weight), clarity, and price. Here is a sample of the dataset and the clarity value explanations:
There are two things we need to do before we can begin charting this data in R:
- Import the diamond dataset.
- Install a mapping package. I am installing ggplot2 in this example.
Building Scatter Plot:
Now that we have our data imported and charting package installed, lets begin by building a simple scatter plot.
All the data points have now been plotted and the data is interesting, but ugly. I will add some color for aesthetics and for further insight.
Adding color enables us to see clarity of the diamonds in relation to price and weight.
Let us take a couple of steps to make this data clearer.
- Removing data points that are not statistically significant. The outliners appear to begin around 2.5 carats or greater. I will add in a filter to remove diamonds that have a carat higher than 2.5.
- The biggest problem with scatter plots is over plotting. Over plotting is whenever you have more than a few points, points maybe plotted on top of one another. This can distort the visual appearance of plot. We are going to alleviate this issue by using alpha to make the data points transparent.
The new scatter plot looks better, but it can be difficult for the human eye to find a patter in scatter plot data. I will use geom_smooth, which helps aid the eye in seeing patterns in the presence of over plotting.
geom_smooth adds a trend line for diamond clarity, as well as a confidence interval(gray shadowing).
Insights & Conclusion:
Now that we have built the ideal scatter plot. Let’s find some insights:
- Diamonds with the highest clarity should be priced higher than those with lower clarity, given that the diamonds are of the same carat/weight.
- Diamonds tend to be purchased at specific carat/weights(1.0, 1.5 ,2.0).
- The higher the carat value, the less confidence we have in predicting price because we have fewer data points. Notice that the gray area gets larger as the carat size increases.
- There are mispricings in the diamond market. On the plot, when the lines cross one another it means there is a mispricing. There are two areas on the plot where mispricings are most likely, I have circled this in red below:
If I was a purchaser of diamonds, I may want to purchase a diamond that has a carat value of the areas circled in red. Buying diamonds of this carat gives me an opportunity of buying a diamond that has a mispriced lower value.
This is just a quick example of how can utilize R scatterplots of to find data insights.
If you have any questions on R or this scatter plot, Please contact me at:
Kiel Briggs