Median vs Mean

R day 2:

I was working on a dataset of Airbnb in New York City from Kaggle, when i run the summary function for the price variable in R, i noticed there’s a strong difference between Mean and Median of the variable.

summary(ab$price)

Min. 1st Qu. Median Mean  3rd Qu. Max.
0.0    69.0      106.0     152.7  175.0      10000.0

In this case, which variable is more persuasive? Mean or Median.

In order to answer this question, we will run the density distribution of the price variable first.
As the graph shows, the price density distribution is extremely skewed to the left.

Can you guess which one would make more sense? 
Yes, it is the median value that tells a better story about Airbnb price in NYC

d1<- ggplot(ab, aes(price))+geom_density(alpha=0.2)
d1

 

What if the data is not skewed or just slightly skewed?

In this case, Mean Value is very reliable to describe the central tendency of the data

 

carrots <- data.frame(length = rnorm(100000, 6, 2))
cukes <- data.frame(length = rnorm(50000, 7, 2.5))

#Now, combine your two dataframes into one.  First make a new column in each.
carrots$veg <- 'carrot'
cukes$veg <- 'cuke'

#and combine into your new data frame vegLengths
vegLengths <- rbind(carrots, cukes)

#now make your lovely plot
p <- ggplot(vegLengths, aes(length, fill = veg)) + geom_density(alpha = 0.2)

p

by examining the density distributions of data, now we have a conclusion.

 

Conclusion:

if a data distribution is Normal/slightly Skewed the Mean Value shows the Central Tendency of the dataset. Whereas if the data is skewed, then the Median is a more intuitive measurement. 

 

Thanks to Jun.z, who is willing to share with me about all the stats tricks.

REF:

https://plot.ly/ggplot2/geom_density/

Post navigation

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Social media & sharing icons powered by UltimatelySocial
%d bloggers like this: