STAC32 Assignment 2

Packages

library(tidyverse)

You are expected to complete this assignment on your own: that is, you may discuss general ideas with others, but the writeup of the work must be entirely your own. If your assignment is unreasonably similar to that of another student, you can expect to be asked to explain yourself.

If you run into problems on this assignment, it is up to you to figure out what to do. The only exception is if it is impossible for you to complete this assignment, for example a data file cannot be read. (There is a difference between you not knowing how to do something, which you have to figure out, and something being impossible, which you are allowed to contact me about.)

You must hand in a rendered document that shows your code, the output that the code produces, and your answers to the questions. This should be a file with .html on the end of its name. There is no credit for handing in your unrendered document (ending in .qmd), because the grader cannot then see whether the code in it runs properly. After you have handed in your file, you should be able to see (in Attempts) what file you handed in, and you should make a habit of checking that you did indeed hand in what you intended to, and that it displays as you expect.

Hint: render your document frequently, and solve any problems as they come up, rather than trying to do so at the end (when you may be close to the due date). If your document will not successfully render, it is because of an error in your code that you will have to find and fix. The error message will tell you where the problem is, but it is up to you to sort out what the problem is.

Childbirth and smoking

Data were obtained on a random sample of 150 births in North Carolina, in https://ritsokiguess.site/datafiles/births_smoking.csv. Interest was in whether the mother being a smoker had any impact on the baby. Variables of interest to us are:

  • f_age, m_age: age of the father and mother of the baby (years)
  • weeks: the length of the pregnancy in weeks
  • premature: whether or not the baby was born prematurely (premie) or whether the pregnancy was the usual length (full term)
  • weight of the baby at birth, in decimal pounds
  • sex_baby: whether the baby was male or female
  • smoke: whether the mother was a smoker or a nonsmoker.
  1. (2 points) Read in and display (some of) the data. Confirm that the data you read in is as I described above.

You’ll get very accustomed to this procedure: put the data file URL in a variable, note that it is a .csv file, use read_csv to read it in directly, save it under a suitable name, and then put its name on a line by itself to display it:

my_url <- "https://ritsokiguess.site/datafiles/births_smoking.csv"
births <- read_csv(my_url)
Rows: 150 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): premature, sex_baby, smoke
dbl (6): f_age, m_age, weeks, visits, gained, weight

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
births

My description above said there were 150 births, so there should be 150 rows (and are). I also gave you the names of some columns; check that those columns actually are in the dataframe (they are; you can assert this once you have convinced yourself that it is true). There are some other columns that we do not use (the clue is in “variables of interest to us”, which implies that there might be some others), but it is enough to say that the columns we want are actually in there.

Extra: it is good practice to check any data you read in for sanity. If you know how many rows there should be, check that. Check that you have the variables of interest (in columns), and check that they have sensible values (eg. by looking at the first few rows). In this case, some of the variables are categorical and you know what levels they have (three of them, and the values you see in the data match what they were supposed to be). The other two variables are quantitative, and you can check that a pregnancy is about 9 months long (not quite 40 weeks), and that the baby weights are somewhere around 7 or 8 pounds, which is typical (you’ll have to do a quick conversion if you are accustomed to baby weights in kilograms).

This is the births dataset from the openintro package. There are some other similar datasets around; I use one in PASIAS which has some extra complications with variable names. This one is nice and well-behaved.

  1. (2 points) Work out the number of observations, mean birth weight, and SD of birth weights, for all the babies taken together.

This one is just a summarize (no group-by; that’s coming up):

births %>% 
  summarize(n = n(), mean_bwt = mean(weight), sd_bwt = sd(weight))

There are 150 observations (as we already know). The mean birthweight is 7.0 pounds and the standard deviation of birthweights is 1.5 pounds.

  1. (3 points) How do the mean birthweights compare between smoking and non-smoking mothers?

First, work out the mean birthweights for each group of mothers. The summarize part of the code is like what you just did, but before that put a group_by to get results for each group of mothers separately:

births %>% 
  group_by(smoke) %>% 
  summarize(mean_bwt = mean(weight))

The mean birth weight of babies born to smoking mothers is less than that of babies born to non-smoking mothers, by about 0.4 pounds on average.

Alternatively, you can leave either or both of the sample size and the SD in, which gives you this:

births %>% 
  group_by(smoke) %>% 
  summarize(n = n(), mean_bwt = mean(weight), sd_bwt = sd(weight))

and the conclusion about means is the same.

Extra: you might be curious about whether this difference is chance or indicative of something real. Under the assumption that these births are a random sample of “all possible” births, this is a two-sample \(t\)-test (that we will see in the course shortly):

t.test(weight ~ smoke, data = births)

    Welch Two Sample t-test

data:  weight by smoke
t = 1.4967, df = 89.277, p-value = 0.138
alternative hypothesis: true difference in means between group nonsmoker and group smoker is not equal to 0
95 percent confidence interval:
 -0.1311663  0.9321663
sample estimates:
mean in group nonsmoker    mean in group smoker 
                 7.1795                  6.7790 

This is not significant (P-value 0.138), so, on this view, the difference is just chance. There is, however, another issue at play, which is coming up.

  1. (2 points) Classify the births by whether or not the mother was a smoker, and by whether or not the baby was born prematurely. How many births fall into each combination of categories?

This one is most easily done as a count:

births %>% count(smoke, premature)

I think it’s better to put smoke first, because we want to see whether this has any impact on prematureness. Alternatively, you can do it like this:

births %>% group_by(smoke, premature) %>% 
  summarize(n = n())
`summarise()` has grouped output by 'smoke'. You can override using the
`.groups` argument.

Extra: There were 100 nonsmoking mothers and only 50 smoking ones, so it’s a bit difficult to compare these numbers. There are a couple of ways around this. One is to turn the counts into percentages, which goes like this. This requires the n() approach, and there is some extra subtlety that I will explain:

births %>% group_by(smoke, premature) %>% 
  summarize(n = n()) %>% 
  mutate(pct = n / sum(n) * 100)
`summarise()` has grouped output by 'smoke'. You can override using the
`.groups` argument.

The percentage of premature births is almost the same between the smoking and non-smoking mothers.

The group-by and summarize produces the column n, and the new column pct is calculated from n. But what does sum(n) mean? Specifically, what is it adding up? The answer is that it sums over the last thing in the group_by. This may seem like an odd way for it to work, but it was actually designed this way. So the 84 was calculated as \(42 / (42 + 8) \times 100\). This is the right way around to do the percentages, because the questions you are asking are “out of the smokers, what percentage of births were premature?” and then “out of the non-smokers, what percentage of births were premature?”.

Compare that with this, where I switched the order of the group-by around:

births %>% group_by(premature, smoke) %>% 
  summarize(n = n()) %>% 
  mutate(pct = n / sum(n) * 100)
`summarise()` has grouped output by 'premature'. You can override using the
`.groups` argument.

The counts are the same, but the percentages are different. That is because, now, we are asking “out of the full-term babies, what percentage had a mother who smoked?” and getting an answer “32.6%”. This, though, is logically backwards, because smoking might affect prematureness, not the other way around.

You might be getting some vague echoes of row and column percentages in contingency tables (from, probably, your first course). The base R table makes those:

with(births, table(smoke, premature))
           premature
smoke       full term premie
  nonsmoker        87     13
  smoker           42      8

Work out row and column percentages from that, and see how it compares to what I did above.

I said there was a second way around the issue of comparing prematureness rates. That is to make a graph. The starting point for this, with two categorical variables, is a grouped bar chart. We said that prematureness was the outcome (or that smoking was explanatory), so we’ll use premature as fill:

ggplot(births, aes(x = smoke, fill = premature)) + geom_bar(position = "dodge")

Most of the births are full term, whether the mother smoked or not, but there are fewer smoking mothers than non-smoking ones, so it is difficult to compare. I said in lecture that I was not a fan of stacking the bars. This is what happens if you do:

ggplot(births, aes(x = smoke, fill = premature)) + geom_bar(position = "stack")

The blue piece on the right is smaller than the blue piece on the left, but so is the whole bar. What we care about is what fraction of the whole bar is blue in each case, and a variation on stacking is useful here:

ggplot(births, aes(x = smoke, fill = premature)) + geom_bar(position = "fill")

What this does is to scale the two bars to have the same height, so now you can compare how much of each bar is red and how much is blue. This shows that the fraction of premature births among mothers who smoke is very slightly larger than among non-smoking mothers. That said, we can now also say that there is very little difference.

  1. (3 points) How is a premature birth defined in terms of the number of weeks that a pregnancy lasted? Calculate one or more numerical summaries that will enable you to figure this out, and describe what you find. (Hint: min and max do what you would expect.)

Taking the hint, let’s work out the largest and smallest number of weeks that go with premature and full term babies:

births %>% 
  group_by(premature) %>% 
  summarize(min_weeks = min(weeks), max_weeks = max(weeks))

The full-term pregnancies are all between 37 and 44 weeks, and the pregnancies of the premature births are all less than that. So a “full-term pregnancy” has been defined as 37 or more weeks; otherwise, it is a premature birth.

Extra: if you didn’t think of that, you could try drawing a graph. This is not a numerical summary, so you won’t get full marks for it, but you will get something if you follow it through. The relevant variables are premature (categorical) and weeks (quantitative), so a boxplot:

ggplot(births, aes(x = premature, y = weeks)) + geom_boxplot()

The boxplots don’t overlap, so read the scales to see where the dividing line is. The tick mark between 35 and 40 is at 37.5, so 37 or more weeks is full term and 36 or fewer weeks is premature.

Another route to full marks is to do this for yourself first, and then figure out how what you see here translates into min and max: the minimum of the full-term births is 37 weeks, and the maximum of the premature births is 36 weeks. Hence, if you do a numerical summary using min and max for the full-term and premature babies (that is, grouped by premature) you will get the same thing as you see here. So, do that, and hand it in.

  1. (3 points) Work out the mean and SD of birth weight for all the combinations of whether or not the mother smoked, and whether or not the birth was premature. What seems to be the effect of smoking during pregnancy?

There are now two categorical variables, smoke and premature, and one quantitative one, weight. The idea with something like this is that you put all the categorical variables into group_by, and then calculate whatever you want to in the summarize. If you are not sure about this, experiment:

births %>% 
  group_by(smoke, premature) %>% 
  summarize(mean_bwt = mean(weight), sd_bwt = sd(weight))
`summarise()` has grouped output by 'smoke'. You can override using the
`.groups` argument.

To assess the effect of smoking, fix the value of premature and compare the mean birthweight between smoker and nonsmoker:

  • for full term babies, the mean birthweight for babies born to nonsmoking mothers is a little higher (7.50 pounds vs. 7.27)
  • for premature babies, the mean birthweight for babies born to nonsmoking mothers is also a little higher (5.03 pounds vs. 4.20).

Thus the effect of smoking appears to be to reduce the mean birthweight overall.

Extra 1: we should be cautious about cause and effect here, because neither the smoking nor the prematureness were, or could be, randomized. For example, the smoking mothers might have tended to also have other health conditions or diet differences that were really the cause of the lower birthweights. (Or it might just be chance.)

Extra 2: there is no effect of smoking on the standard deviations, but the birthweights of premature babies are noticeably more variable than those of full-term babies. This is because full-term babies have a relatively predictable birth weight (they are born when they are the right size to be born), but premature babies can be very small indeed (their weights vary from almost the same weight as a full-term baby to a lot smaller):

ggplot(births, aes(x = smoke, y = weight, fill = premature)) + geom_boxplot()

Sometimes it really takes a graph to show what is going on.

  1. (3 points) For each of the smoking and nonsmoking mothers, work out the mean age of the father and mother, without naming (or numbering) those columns explicitly. Hint: some of the fathers’ ages are not known.

Think before you code:

  • The thing about each of smoking and nonsmoking mothers is meant to suggest group_by(smoke).
  • To do something without naming columns explicitly means to figure out what those columns have in common: in this case, their names end in age, and they are the only ones that do.
  • Finally, remember the na.rm from worksheet 3 to work out the mean of something without getting tripped up by missing values.

Hence:

births %>% 
  group_by(smoke) %>% 
  summarize(across(ends_with("age"), \(x) mean(x, na.rm = TRUE)))

The way you do something with several columns not explicitly named is to use across. Inside the across, two things: (i) something that will pick out the columns you want and only those, (ii) an “anonymous function” that says what to do with each of those columns, in this case work out the mean of it. The way to read the third line in English is “for each of the columns whose name ends with age, work out the mean of it, dropping any missing values.”

Possible variations:

  • another way of selecting those two columns is good if it works
  • using anything else as the input to the anonymous function is good as long as you use that same name inside mean, for example
births %>% 
  group_by(smoke) %>% 
  summarize(across(contains("age"), \(age) mean(age, na.rm = TRUE)))

As long as you get to that table without explicitly naming the columns f_age and m_age (or using the fact that they are columns number 1 and 2 in the dataframe), I don’t much mind precisely how you do it.

Extra: smoking mothers are almost a year younger on average, but the fathers have about the same average age whether the mother smokes or not.

Countries of the world

Data were collected on 77 countries of the world in 2008, with variables as follows:

  • Country: Name of the country
  • Code: Three letter country code
  • LandArea: Size in sq. kilometers
  • Population: Population in millions
  • Energy: Energy usage (kilotons of oil)
  • Rural: Percentage of population living in rural areas
  • Military: Percentage of government expenditures directed toward the military
  • Health: Percentage of government expenditures directed towards healthcare
  • HIV: Percentage of the population with HIV
  • Internet: Percentage of the population with access to the internet
  • BirthRate: Births per 1000 people
  • ElderlyPop Percentage of the population at least 65 years old
  • LifeExpectancy Average life expectancy (years)
  • CO2: CO2 emissions (metric tons per capita)
  • GDP: Gross Domestic Product (per capita)
  • Cell: Cell phone subscriptions (per 100 people)
  • Electricity: Electric power consumption (kWh per capita)
  • Electric_use: Electricity use, classified as Low, Medium, or High

The data are in http://ritsokiguess.site/datafiles/countries.csv. Note that most (but not all) of the variables are measured per person or as a percentage, so that these variables are not dependent on how big the country is.

In the questions below, unless stated otherwise, if you are asked to display some of the columns, your code may display all of the rows; if you are asked to display some of the rows, your code may display all of the columns. In the output you hand in, make sure that only 10 rows or as many columns as will display on the screen are actually shown. There are a lot of questions below, but each one is meant to be quick, except perhaps for the last one of them.

  1. (1 point) Read in and display (some of) the data.

As you would expect:

my_url <- "http://ritsokiguess.site/datafiles/countries.csv"
countries <- read_csv(my_url)
Rows: 77 Columns: 18
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (3): Country, Code, Electric_use
dbl (15): LandArea, Population, Energy, Rural, Military, Health, HIV, Intern...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
countries

There are indeed 77 countries (rows), and, if you scroll across, all the variables listed.

Extra: this is dataset AllCountries1e from the package Lock5Data, but I had to do a bit of reorganization first.

The first thing to observe is that this dataset has a lot more countries than ours, but also there are a lot of missing values:

library(Lock5Data)
data("AllCountries1e")
AllCountries1e

To see just how many missing values, you can run a summary of the entire dataframe:

AllCountries1e %>% summary()
           Country         Code        LandArea          Population       
 Afghanistan   :  1          :  3   Min.   :       2   Min.   :   0.0200  
 Albania       :  1   AFG    :  1   1st Qu.:   10830   1st Qu.:   0.7728  
 Algeria       :  1   ALB    :  1   Median :   94080   Median :   5.6135  
 American Samoa:  1   ALG    :  1   Mean   :  608120   Mean   :  31.4849  
 Andorra       :  1   AND    :  1   3rd Qu.:  446300   3rd Qu.:  20.5835  
 Angola        :  1   ANG    :  1   Max.   :16376870   Max.   :1324.6550  
 (Other)       :207   (Other):205                      NA's   :1          
     Energy            Rural          Military          Health     
 Min.   :    159   Min.   : 0.00   Min.   : 0.000   Min.   : 0.70  
 1st Qu.:   5252   1st Qu.:22.90   1st Qu.: 3.800   1st Qu.: 8.00  
 Median :  17478   Median :40.40   Median : 5.850   Median :11.30  
 Mean   :  86312   Mean   :42.13   Mean   : 8.277   Mean   :11.22  
 3rd Qu.:  52486   3rd Qu.:63.20   3rd Qu.:12.175   3rd Qu.:14.45  
 Max.   :2283722   Max.   :89.60   Max.   :29.300   Max.   :26.10  
 NA's   :77                        NA's   :115      NA's   :26     
      HIV            Internet       Developed       BirthRate    
 Min.   : 0.100   Min.   : 0.20   Min.   :1.000   Min.   : 8.20  
 1st Qu.: 0.100   1st Qu.: 5.65   1st Qu.:1.000   1st Qu.:12.10  
 Median : 0.400   Median :22.80   Median :1.000   Median :19.40  
 Mean   : 1.977   Mean   :28.96   Mean   :1.763   Mean   :22.02  
 3rd Qu.: 1.300   3rd Qu.:48.15   3rd Qu.:3.000   3rd Qu.:28.90  
 Max.   :25.900   Max.   :90.50   Max.   :3.000   Max.   :53.50  
 NA's   :68       NA's   :14      NA's   :78      NA's   :16     
   ElderlyPop     LifeExpectancy       CO2                GDP          
 Min.   : 1.000   Min.   :43.90   Min.   : 0.02262   Min.   :   192.1  
 1st Qu.: 3.400   1st Qu.:62.80   1st Qu.: 0.61765   1st Qu.:  1252.7  
 Median : 5.400   Median :71.90   Median : 2.73694   Median :  4408.8  
 Mean   : 7.473   Mean   :68.94   Mean   : 5.08557   Mean   : 11298.4  
 3rd Qu.:11.600   3rd Qu.:76.03   3rd Qu.: 7.01656   3rd Qu.: 12431.0  
 Max.   :21.400   Max.   :82.80   Max.   :49.05058   Max.   :105437.7  
 NA's   :22       NA's   :17      NA's   :15         NA's   :40        
      Cell          Electricity      
 Min.   :  1.238   Min.   :   35.68  
 1st Qu.: 59.206   1st Qu.:  800.32  
 Median : 93.696   Median : 2237.51  
 Mean   : 91.093   Mean   : 4109.13  
 3rd Qu.:121.160   3rd Qu.: 5824.24  
 Max.   :206.429   Max.   :51259.19  
 NA's   :12        NA's   :78        

Some of the variables have a lot of missing values. There are sophisticated methods for estimating the values of variables that are missing (these fall under the umbrella of “imputation”), for example running a multiple regression to predict the values of variables that were missing from values of variables that were observed. Some of these variables you would expect to be correlated: for example, an industrialized country would be expected to have high energy use generally and electricity use in particular, along with high CO2 emissions and maybe a large amount of cellphone use. Having said all of that, we are going to be a lot less sophisticated: we are just going to throw away the data for any country that has any missing values anywhere, which is what drop_na does:

AllCountries1e %>% drop_na() -> countries0
countries0

I am using a “disposable” name countries0 here, so that we don’t get confused with the dataframe read in from the file.

These are the 77 countries in our dataset, for which you can check there are no missing values remaining:

countries0 %>% summary()
       Country        Code       LandArea          Population      
 Algeria   : 1   ALG    : 1   Min.   :     320   Min.   :   0.317  
 Armenia   : 1   ARM    : 1   1st Qu.:   62670   1st Qu.:   5.494  
 Austria   : 1   AUT    : 1   Median :  155360   Median :  10.708  
 Azerbaijan: 1   AZE    : 1   Mean   :  815544   Mean   :  48.821  
 Bangladesh: 1   BAN    : 1   3rd Qu.:  499110   3rd Qu.:  45.012  
 Belarus   : 1   BEL    : 1   Max.   :16376870   Max.   :1139.965  
 (Other)   :71   (Other):71                                        
     Energy            Rural          Military          Health     
 Min.   :    819   Min.   : 0.00   Min.   : 0.000   Min.   : 2.50  
 1st Qu.:   7735   1st Qu.:26.50   1st Qu.: 4.100   1st Qu.: 8.20  
 Median :  22009   Median :36.10   Median : 5.800   Median :11.90  
 Mean   :  95623   Mean   :37.26   Mean   : 8.166   Mean   :11.55  
 3rd Qu.:  72748   3rd Qu.:48.10   3rd Qu.:10.800   3rd Qu.:15.20  
 Max.   :2283722   Max.   :84.90   Max.   :29.300   Max.   :19.90  
                                                                   
      HIV             Internet       Developed       BirthRate    
 Min.   : 0.1000   Min.   : 0.30   Min.   :1.000   Min.   : 8.30  
 1st Qu.: 0.1000   1st Qu.:11.10   1st Qu.:1.000   1st Qu.:11.00  
 Median : 0.2000   Median :32.60   Median :2.000   Median :14.90  
 Mean   : 0.8519   Mean   :38.43   Mean   :1.857   Mean   :17.64  
 3rd Qu.: 0.6000   3rd Qu.:62.30   3rd Qu.:3.000   3rd Qu.:22.00  
 Max.   :17.9000   Max.   :90.50   Max.   :3.000   Max.   :39.80  
                                                                  
   ElderlyPop    LifeExpectancy       CO2               GDP         
 Min.   : 2.70   Min.   :47.90   Min.   : 0.2457   Min.   :  523.1  
 1st Qu.: 5.00   1st Qu.:70.20   1st Qu.: 1.3373   1st Qu.: 2795.5  
 Median :10.40   Median :73.00   Median : 4.5414   Median : 7537.7  
 Mean   :10.35   Mean   :72.43   Mean   : 5.2060   Mean   :15984.9  
 3rd Qu.:15.90   3rd Qu.:78.80   3rd Qu.: 8.1236   3rd Qu.:22850.7  
 Max.   :20.10   Max.   :82.00   Max.   :17.9417   Max.   :84538.2  
                                                                    
      Cell         Electricity      
 Min.   : 40.69   Min.   :   91.26  
 1st Qu.: 88.85   1st Qu.:  970.98  
 Median :108.60   Median : 3200.47  
 Mean   :104.52   Mean   : 4494.89  
 3rd Qu.:124.34   3rd Qu.: 6006.35  
 Max.   :167.68   Max.   :51259.19  
                                    

I did one more thing: the variable that’s called Developed here, though a numeric 1, 2, or 3, is really a categorical “low”, “medium”, “high”, so I decided to make it this. It is actually related to electricity use, so I want to have it be called Electricity_use. There are several ways you might do this. One is lvls_revalue from the forcats package (loaded with the tidyverse; this is where fct_inorder comes from), but I decided to use an idea like the Canadian Tire nails from lecture and make a little lookup table:

conversion <- tribble(
  ~Developed, ~Electric_use,
  1, "low",
  2, "moderate",
  3, "high"
)
conversion

and now we can left-join this onto our countries0:

countries0 %>% 
  left_join(conversion, join_by(Developed)) -> countries0
countries0

You can check that Developed and the new column Electric_use (on the right-hand end) do actually correspond as they should. If you want to be sophisticated about that:

countries0 %>% count(Developed, Electric_use)

and you see that the only combinations of these two variables are the ones that match, so we have not introduced any errors.

The final step is to get rid of the no-longer-needed Developed column, and then I saved that result for you.

  1. (2 points) Display only the country names and the percentage of population living in rural areas.

This is choosing columns (variables), so is a select, with the names of the columns you want:

countries %>% select(Country, Rural)
  1. (3 points) Display all the columns whose names have E as their first letter, uppercase or lowercase.

A select-helper, namely starts_with. The E itself can be uppercase or lowercase, since the select-helpers will select both:

countries %>% select(starts_with("E"))

or

countries %>% select(starts_with("e"))

Either is good.

  1. (3 points) Display all the columns whose names have a lowercase o in them somewhere.

“In them somewhere” translates to contains. To make sure that columns with an uppercase O don’t get chosen, that is to pay attention to case, you need the double negative “don’t ignore case”, that is:

countries %>% select(contains("o", ignore.case = FALSE))

Make sure you put the ignore.case inside the contains, not inside the select (or else you will get an error message that doesn’t give you much of a clue about what has gone wrong).

If you omit the ignore.case, you’ll get too many columns:

countries %>% select(contains("o"))

because the CO2 column’s name has an uppercase O in it.

  1. (3 points) Display only the columns that are text.

This uses where. Inside where goes something that will be TRUE for the columns that you want. is.character is the thing:

countries %>% select(where(is.character))

If you eyeball your dataframe, you’ll see that all the columns are either text or numbers, so an alternative way to do this is to note that is.numeric will be FALSE for the columns you want, so “is not numeric” will also get them, but you have to be careful:

countries %>% select(where( \(x) !is.numeric(x)))

Inside where, you need the name of a function (like is.character), or an anonymous function such as you would use inside across.

  1. (2 points) Display the countries that have high electricity use.

This is displaying rows (only the observations that satisfy a condition), so filter. Don’t forget the double equals sign for testing whether something is true:

countries %>% filter(Electric_use == "high")
  1. (2 points) Display the five countries with the largest populations.

You have a choice here: the easier is to use slice_max:

countries %>% slice_max(Population, n = 5)

or, if you don’t think of that, sort all the countries by population (in descending order) and then grab just the top five:

countries %>% arrange(desc(Population)) %>% 
  slice(1:5)

Extra: You are probably wondering where China went:

countries0 %>% filter(Country == "China")

China was not part of our original dataset, even before we removed the missing values.

  1. (3 points) Are there any countries with population less than 5 million that have more than 60% of their population in rural areas? How do you know?

Try to find all of them. The countries you want are ones that satisfy both conditions, so a logical “and”:

countries %>%  filter(Population < 5, Rural > 60)

or, equivalently, two filters one after the other, in either order:

countries %>%  filter(Population < 5) %>%  
  filter(Rural > 60)

There are no rows in the answer, so there are no countries that satisfy both conditions: that is, there are no countries with a population less than 5 million that have a rural population greater than 60%. (The answer to the question is “no”, but make sure you actually do answer it somewhere.)

  1. (3 points) Find the median of all the variables that are quantitative.

Doing something with multiple columns that you are not naming one by one is across. Inside the across goes something that will pick out the columns you want (where(is.numeric)), and an anonymous function that will work out whatever you want to work out for each of those columns. The input to the anonymous function can be called anything, as long as you use the same “anything” inside the function:

countries %>% summarize(across(where(is.numeric), \(x) median(x)))

Another approach you can try is to select the quantitative columns first, then work out the median of each of them. This perhaps does not actually make things much easier, because you still have to remember how to do something for all the columns. It goes like this:

countries %>% 
  select(where(is.numeric)) %>% 
  summarize(across(everything(), \(x) median(x)))

The key select-helper is everything(). Maybe this appeals to you if you like to break things down into small parts: “grab the quantitative columns, and then for each of the columns I have left, work out the median of it”.

  1. (2 points) Do countries that have an above-average rural population also tend to have an above-average percentage of the population with access to the Internet? Explain briefly.

Find a way to answer this (there is probably not one best way). I think the easiest way is to make a graph: these two are quantitative variables, so a scatterplot is called for:

ggplot(countries, aes(x = Rural, y = Internet)) + geom_point()

This is a downward trend, so countries with above-average rural populations are in fact below average in terms of Internet access.

Another way is to count the number of countries that are above or below average on these two variables combined. You just worked out the medians, so you can use those as averages. You may or may not realize that this does in fact work:

countries %>% count(Rural > 36.1, Internet > 32.6)

In the likely event that you didn’t realize you could do that, create new columns that are TRUE or FALSE according to those, and then count the new columns:

countries %>% 
  mutate(rural_above = (Rural > 36.1),
         internet_above = (Internet > 32.6)) %>% 
  count(rural_above, internet_above)

This tells the same story: most of the countries that are above-average rural are also below-average on Internet access.

Whichever way you do it, the answer to the original question is “no” plus this kind of sentence of explanation.

You don’t have to do it either of these ways, but you need to come to a conclusion somehow, via a graph or numerical summary, that highly rural countries are likely to have lower internet access.

Extra: One of the reasons for this is that urbanization is a sign of development (or industrialization if you prefer), so countries that have less of their population living in rural areas are more likely to show signs of being developed, and internet access is (or was, in 2008) one of those signs.