Data Science Day 20: When we are watching Soccer games, at the beginning of the match, the screen will show the basic info for each team. Suppose we want to know is there any difference between the average age between Real Madrid and Barcelona players, What statistical test should we use? Answer: We can use T-test to determine whether there…

# Category: Python

## Clustering Analysis -Iris Dataset

Data Science Day 19: In Supervised Learning, we specify the possible categorical values and train the models for pattern recognition. However, what if we don’t have the existing classified data model to learn from? The case we model the data in order to discover the way it clusters, based on certain attributes is Unsupervised Learning. Clustering Analysis in one of…

## Risk Ratio

Data Science Day 15 Risk Ratio Last time, we give a SAS example of Risk Difference to test if two groups are experiencing the same proportion of a certain event. In order to understand the topic better, we will go over Risk Ratio. Definition: Risk Ratio or Relative Risk (RR) is the probability that an event occurs in a group 1…

## Python Network Graph

Python Day 1: Neuron Network Graph Suppose we would like to build a basic network graph implies a student’s grade is affected by IQ and Study. In addition, Interest and method affect the result of the study. # libraries import pandas as pd import numpy as np import networkx as nx import matplotlib.pyplot as plt #build dataframe with connections: df…

## Odds Ratio

Data Science Day 12: Odds Ratio Learning Objective: Probability vs Odds Vs Odds Ratio 1. Probability = Event/Sample Space 2. Odds= Prob(Event)/Prob(Non-Event) 3. Odds Ratio = Odds(Group 1)/ Odds(Group 2) Interpretation The Odds Ratio is a measure of association between exposure and outcome. OR=Odds(Group 1)/Odds(Group2)>1 indicates the increased occurrence of an event in Group 1 compared to Group…

## Normalization

Data science Day 8: Data transformation is one of the critical steps in Data Mining. Among many data transformation methods, normalization is a most frequently used technique. For example, we can use Z-score normalization to reduce possible noise in sound frequency. We will introduce three common normalization method, Max-Min Normalization, Z-Score Normalization, Scale multiplication. Max-Min Normalization x_normal= (x- min(x))/ (max(x)- min(x))…

## Chi-Square 5

Data Science Day 7: Where does the name of Chi-square distribution come from? From the first day, we know Chi-square distribution is the sum of the squared standard deviates, known as variance. If we investigate the standard deviates, then we find an interesting relation between Chi-square and Normal Distribution: If a variable follows the standard normal distribution, then its square follows the…

## Chi Square 4

Data Science Day 6: Chi-square application 3: Test for Homogeneity of One Categorical Variable across serveral sample spaces. We use the Chi-square test for Homogeneity to evaluate if one single categorical variable has a similar distribution (or frequency proportion)across two or more sample spaces (or populations). Example: Couple make-up companies wish to determine if there are differences in the sales market for…

## Chi-Square 3

Data Science Day 5: Chi-Square Application 2: Test Independence of Two categorical variables, or known as Contingency Table. We use the Chi-Square test of Independence to check if two categorical variables are independent, or have a strong association. Example 1: Ice-cream Favor VS. Buyer’s Gender We want to see if there is a preference for ice-cream favor based on the gender of…

## Chi-Square 2

Data Science Day 4: Chi-Square test application 1: Test Goodness of a fit. We use the goodness of a fit to test if the observed categorical data follows the hypothesized or expected distribution. Example 1: P-value Interpretation Suppose f_exp are the expected number of boys in grade 1 different classes. f_obs are the observed number of boys in grade 1. We want…