R day 2:
I was working on a dataset of Airbnb in New York City from Kaggle, when i run the summary function for the price variable in R, i noticed there’s a strong difference between Mean and Median of the variable.
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0 69.0 106.0 152.7 175.0 10000.0
In this case, which variable is more persuasive? Mean or Median.
In order to answer this question, we will run the density distribution of the price variable first.
As the graph shows, the price density distribution is extremely skewed to the left.
Can you guess which one would make more sense?
Yes, it is the median value that tells a better story about Airbnb price in NYC !
d1<- ggplot(ab, aes(price))+geom_density(alpha=0.2) d1
What if the data is not skewed or just slightly skewed?
In this case, Mean Value is very reliable to describe the central tendency of the data
carrots <- data.frame(length = rnorm(100000, 6, 2)) cukes <- data.frame(length = rnorm(50000, 7, 2.5)) #Now, combine your two dataframes into one. First make a new column in each. carrots$veg <- 'carrot' cukes$veg <- 'cuke' #and combine into your new data frame vegLengths vegLengths <- rbind(carrots, cukes) #now make your lovely plot p <- ggplot(vegLengths, aes(length, fill = veg)) + geom_density(alpha = 0.2) p
by examining the density distributions of data, now we have a conclusion.
if a data distribution is Normal/slightly Skewed the Mean Value shows the Central Tendency of the dataset. Whereas if the data is skewed, then the Median is a more intuitive measurement.
Thanks to Jun.z, who is willing to share with me about all the stats tricks.