Numerical Summaries

Summarizing data in R 1/2

  • Have seen summary (5-number summary of each column). But what if we want:
    • a summary or two of just one column
    • a count of observations in each category of a categorical variable
    • summaries by group
    • a different summary of all columns (eg. SD)
  • To do this, meet pipe operator %>%. This takes input data frame, does something to it, and outputs result. (Learn: Ctrl-Shift-M.)

Summarizing data in R 2/2

  • Output from a pipe can be used as input to something else, so can have a sequence of pipes.
  • Summaries include: mean, median, min, max, sd, IQR, quantile (for obtaining quartiles or any percentile), n (for counting observations).
  • Use our Australian athletes data again.

Packages for this section

library(tidyverse)
summary(athletes)
     Sex               Sport                RCC             WCC        
 Length:202         Length:202         Min.   :3.800   Min.   : 3.300  
 Class :character   Class :character   1st Qu.:4.372   1st Qu.: 5.900  
 Mode  :character   Mode  :character   Median :4.755   Median : 6.850  
                                       Mean   :4.719   Mean   : 7.109  
                                       3rd Qu.:5.030   3rd Qu.: 8.275  
                                       Max.   :6.720   Max.   :14.300  
       Hc              Hg             Ferr             BMI       
 Min.   :35.90   Min.   :11.60   Min.   :  8.00   Min.   :16.75  
 1st Qu.:40.60   1st Qu.:13.50   1st Qu.: 41.25   1st Qu.:21.08  
 Median :43.50   Median :14.70   Median : 65.50   Median :22.72  
 Mean   :43.09   Mean   :14.57   Mean   : 76.88   Mean   :22.96  
 3rd Qu.:45.58   3rd Qu.:15.57   3rd Qu.: 97.00   3rd Qu.:24.46  
 Max.   :59.70   Max.   :19.20   Max.   :234.00   Max.   :34.42  
      SSF             %Bfat             LBM               Ht       
 Min.   : 28.00   Min.   : 5.630   Min.   : 34.36   Min.   :148.9  
 1st Qu.: 43.85   1st Qu.: 8.545   1st Qu.: 54.67   1st Qu.:174.0  
 Median : 58.60   Median :11.650   Median : 63.03   Median :179.7  
 Mean   : 69.02   Mean   :13.507   Mean   : 64.87   Mean   :180.1  
 3rd Qu.: 90.35   3rd Qu.:18.080   3rd Qu.: 74.75   3rd Qu.:186.2  
 Max.   :200.80   Max.   :35.520   Max.   :106.00   Max.   :209.4  
       Wt        
 Min.   : 37.80  
 1st Qu.: 66.53  
 Median : 74.40  
 Mean   : 75.01  
 3rd Qu.: 84.12  
 Max.   :123.20  

Summarizing one column

  • Mean height:
athletes %>% summarize(m=mean(Ht))

or to get mean and SD of BMI:

athletes %>% summarize(m = mean(BMI), s = sd(BMI)) -> d
d

This doesn’t work:

mean(BMI)
Error: object 'BMI' not found

Quartiles

  • quantile calculates percentiles (“fractiles”), so we want the 25th and 75th percentiles:
athletes %>% summarize( Q1=quantile(Wt, 0.25),
                        Q3=quantile(Wt, 0.75))

Creating new columns

  • These weights are in kilograms. Maybe we want to summarize the weights in pounds.
  • Convert kg to lb by multiplying by 2.2.
  • Create new column and summarize that:
athletes %>% mutate(wt_lb=Wt*2.2) %>%
  summarize(Q1_lb=quantile(wt_lb, 0.25),
            Q3_lb=quantile(wt_lb, 0.75)) 

Counting how many

for example, number of athletes in each sport:

athletes %>% count(Sport)

Counting how many, variation 2:

Another way (which will make sense in a moment):

athletes %>% group_by(Sport) %>%
  summarize(count=n())

Summaries by group

  • Might want separate summaries for each “group”, eg. mean and SD of height for males and females. Strategy is group_by (to define the groups) and then summarize:
athletes %>% group_by(Sex) %>% 
  summarize(mean_Ht = mean(Ht), sd_Ht = sd(Ht))

Count plus stats

  • If you want number of observations per group plus some stats, you need to go the n() way:
athletes %>% group_by(Sex) %>%
summarize(n = n(), mean_Ht = mean(Ht), sd_Ht = sd(Ht))
  • This explains second variation on counting within group: “within each sport/Sex, how many athletes were there?”

Summarizing several columns

  • Standard deviation of each (numeric) column:
athletes %>% summarize(across(where(is.numeric), \(x) sd(x))) 
  • Median and IQR of all columns whose name starts with H:
athletes %>% summarize(across(starts_with("H"),
                       list(med = \(x) median(x), 
                            iqr = \(x) IQR(x))))

Same thing by group

athletes %>% 
  group_by(Sex) %>% 
  summarize(across(starts_with("H"), 
                   list(med = \(h) median(h), 
                        iqr = \(h) IQR(h))))
athletes %>% 
  group_by(Sex) %>% 
  summarize(across(ends_with("C"), 
                   list(med = \(h) median(h), 
                        iqr = \(h) IQR(h))))