library(tidyverse)
STAC32 Assignment 2
Packages
You are expected to complete this assignment on your own: that is, you may discuss general ideas with others, but the writeup of the work must be entirely your own. If your assignment is unreasonably similar to that of another student, you can expect to be asked to explain yourself.
If you run into problems on this assignment, it is up to you to figure out what to do. The only exception is if it is impossible for you to complete this assignment, for example a data file cannot be read. (There is a difference between you not knowing how to do something, which you have to figure out, and something being impossible, which you are allowed to contact me about.)
You must hand in a rendered document that shows your code, the output that the code produces, and your answers to the questions. This should be a file with .html
on the end of its name. There is no credit for handing in your unrendered document (ending in .qmd
), because the grader cannot then see whether the code in it runs properly. After you have handed in your file, you should be able to see (in Attempts) what file you handed in, and you should make a habit of checking that you did indeed hand in what you intended to, and that it displays as you expect.
Hint: render your document frequently, and solve any problems as they come up, rather than trying to do so at the end (when you may be close to the due date). If your document will not successfully render, it is because of an error in your code that you will have to find and fix. The error message will tell you where the problem is, but it is up to you to sort out what the problem is.
Childbirth and smoking
Data were obtained on a random sample of 150 births in North Carolina, in https://ritsokiguess.site/datafiles/births_smoking.csv. Interest was in whether the mother being a smoker had any impact on the baby. Variables of interest to us are:
f_age
,m_age
: age of the father and mother of the baby (years)weeks
: the length of the pregnancy in weekspremature
: whether or not the baby was born prematurely (premie
) or whether the pregnancy was the usual length (full term
)weight
of the baby at birth, in decimal poundssex_baby
: whether the baby wasmale
orfemale
smoke
: whether the mother was asmoker
or anonsmoker
.
- (2 points) Read in and display (some of) the data. Confirm that the data you read in is as I described above.
You’ll get very accustomed to this procedure: put the data file URL in a variable, note that it is a .csv
file, use read_csv
to read it in directly, save it under a suitable name, and then put its name on a line by itself to display it:
<- "https://ritsokiguess.site/datafiles/births_smoking.csv"
my_url <- read_csv(my_url) births
Rows: 150 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): premature, sex_baby, smoke
dbl (6): f_age, m_age, weeks, visits, gained, weight
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
births
My description above said there were 150 births, so there should be 150 rows (and are). I also gave you the names of some columns; check that those columns actually are in the dataframe (they are; you can assert this once you have convinced yourself that it is true). There are some other columns that we do not use (the clue is in “variables of interest to us”, which implies that there might be some others), but it is enough to say that the columns we want are actually in there.
Extra: it is good practice to check any data you read in for sanity. If you know how many rows there should be, check that. Check that you have the variables of interest (in columns), and check that they have sensible values (eg. by looking at the first few rows). In this case, some of the variables are categorical and you know what levels they have (three of them, and the values you see in the data match what they were supposed to be). The other two variables are quantitative, and you can check that a pregnancy is about 9 months long (not quite 40 weeks), and that the baby weights are somewhere around 7 or 8 pounds, which is typical (you’ll have to do a quick conversion if you are accustomed to baby weights in kilograms).
This is the births
dataset from the openintro
package. There are some other similar datasets around; I use one in PASIAS which has some extra complications with variable names. This one is nice and well-behaved.
- (2 points) Work out the number of observations, mean birth weight, and SD of birth weights, for all the babies taken together.
This one is just a summarize
(no group-by; that’s coming up):
%>%
births summarize(n = n(), mean_bwt = mean(weight), sd_bwt = sd(weight))
There are 150 observations (as we already know). The mean birthweight is 7.0 pounds and the standard deviation of birthweights is 1.5 pounds.
- (3 points) How do the mean birthweights compare between smoking and non-smoking mothers?
First, work out the mean birthweights for each group of mothers. The summarize
part of the code is like what you just did, but before that put a group_by
to get results for each group of mothers separately:
%>%
births group_by(smoke) %>%
summarize(mean_bwt = mean(weight))
The mean birth weight of babies born to smoking mothers is less than that of babies born to non-smoking mothers, by about 0.4 pounds on average.
Alternatively, you can leave either or both of the sample size and the SD in, which gives you this:
%>%
births group_by(smoke) %>%
summarize(n = n(), mean_bwt = mean(weight), sd_bwt = sd(weight))
and the conclusion about means is the same.
Extra: you might be curious about whether this difference is chance or indicative of something real. Under the assumption that these births are a random sample of “all possible” births, this is a two-sample \(t\)-test (that we will see in the course shortly):
t.test(weight ~ smoke, data = births)
Welch Two Sample t-test
data: weight by smoke
t = 1.4967, df = 89.277, p-value = 0.138
alternative hypothesis: true difference in means between group nonsmoker and group smoker is not equal to 0
95 percent confidence interval:
-0.1311663 0.9321663
sample estimates:
mean in group nonsmoker mean in group smoker
7.1795 6.7790
This is not significant (P-value 0.138), so, on this view, the difference is just chance. There is, however, another issue at play, which is coming up.
- (2 points) Classify the births by whether or not the mother was a smoker, and by whether or not the baby was born prematurely. How many births fall into each combination of categories?
This one is most easily done as a count
:
%>% count(smoke, premature) births
I think it’s better to put smoke
first, because we want to see whether this has any impact on prematureness. Alternatively, you can do it like this:
%>% group_by(smoke, premature) %>%
births summarize(n = n())
`summarise()` has grouped output by 'smoke'. You can override using the
`.groups` argument.
Extra: There were 100 nonsmoking mothers and only 50 smoking ones, so it’s a bit difficult to compare these numbers. There are a couple of ways around this. One is to turn the counts into percentages, which goes like this. This requires the n()
approach, and there is some extra subtlety that I will explain:
%>% group_by(smoke, premature) %>%
births summarize(n = n()) %>%
mutate(pct = n / sum(n) * 100)
`summarise()` has grouped output by 'smoke'. You can override using the
`.groups` argument.
The percentage of premature births is almost the same between the smoking and non-smoking mothers.
The group-by and summarize produces the column n
, and the new column pct
is calculated from n
. But what does sum(n)
mean? Specifically, what is it adding up? The answer is that it sums over the last thing in the group_by
. This may seem like an odd way for it to work, but it was actually designed this way. So the 84 was calculated as \(42 / (42 + 8) \times 100\). This is the right way around to do the percentages, because the questions you are asking are “out of the smokers, what percentage of births were premature?” and then “out of the non-smokers, what percentage of births were premature?”.
Compare that with this, where I switched the order of the group-by around:
%>% group_by(premature, smoke) %>%
births summarize(n = n()) %>%
mutate(pct = n / sum(n) * 100)
`summarise()` has grouped output by 'premature'. You can override using the
`.groups` argument.
The counts are the same, but the percentages are different. That is because, now, we are asking “out of the full-term babies, what percentage had a mother who smoked?” and getting an answer “32.6%”. This, though, is logically backwards, because smoking might affect prematureness, not the other way around.
You might be getting some vague echoes of row and column percentages in contingency tables (from, probably, your first course). The base R table
makes those:
with(births, table(smoke, premature))
premature
smoke full term premie
nonsmoker 87 13
smoker 42 8
Work out row and column percentages from that, and see how it compares to what I did above.
I said there was a second way around the issue of comparing prematureness rates. That is to make a graph. The starting point for this, with two categorical variables, is a grouped bar chart. We said that prematureness was the outcome (or that smoking was explanatory), so we’ll use premature
as fill
:
ggplot(births, aes(x = smoke, fill = premature)) + geom_bar(position = "dodge")
Most of the births are full term, whether the mother smoked or not, but there are fewer smoking mothers than non-smoking ones, so it is difficult to compare. I said in lecture that I was not a fan of stacking the bars. This is what happens if you do:
ggplot(births, aes(x = smoke, fill = premature)) + geom_bar(position = "stack")
The blue piece on the right is smaller than the blue piece on the left, but so is the whole bar. What we care about is what fraction of the whole bar is blue in each case, and a variation on stacking is useful here:
ggplot(births, aes(x = smoke, fill = premature)) + geom_bar(position = "fill")
What this does is to scale the two bars to have the same height, so now you can compare how much of each bar is red and how much is blue. This shows that the fraction of premature births among mothers who smoke is very slightly larger than among non-smoking mothers. That said, we can now also say that there is very little difference.
- (3 points) How is a premature birth defined in terms of the number of weeks that a pregnancy lasted? Calculate one or more numerical summaries that will enable you to figure this out, and describe what you find. (Hint:
min
andmax
do what you would expect.)
Taking the hint, let’s work out the largest and smallest number of weeks that go with premature and full term babies:
%>%
births group_by(premature) %>%
summarize(min_weeks = min(weeks), max_weeks = max(weeks))
The full-term pregnancies are all between 37 and 44 weeks, and the pregnancies of the premature births are all less than that. So a “full-term pregnancy” has been defined as 37 or more weeks; otherwise, it is a premature birth.
Extra: if you didn’t think of that, you could try drawing a graph. This is not a numerical summary, so you won’t get full marks for it, but you will get something if you follow it through. The relevant variables are premature
(categorical) and weeks
(quantitative), so a boxplot:
ggplot(births, aes(x = premature, y = weeks)) + geom_boxplot()
The boxplots don’t overlap, so read the scales to see where the dividing line is. The tick mark between 35 and 40 is at 37.5, so 37 or more weeks is full term and 36 or fewer weeks is premature.
Another route to full marks is to do this for yourself first, and then figure out how what you see here translates into min and max: the minimum of the full-term births is 37 weeks, and the maximum of the premature births is 36 weeks. Hence, if you do a numerical summary using min and max for the full-term and premature babies (that is, grouped by premature
) you will get the same thing as you see here. So, do that, and hand it in.
- (3 points) Work out the mean and SD of birth weight for all the combinations of whether or not the mother smoked, and whether or not the birth was premature. What seems to be the effect of smoking during pregnancy?
There are now two categorical variables, smoke
and premature
, and one quantitative one, weight
. The idea with something like this is that you put all the categorical variables into group_by
, and then calculate whatever you want to in the summarize
. If you are not sure about this, experiment:
%>%
births group_by(smoke, premature) %>%
summarize(mean_bwt = mean(weight), sd_bwt = sd(weight))
`summarise()` has grouped output by 'smoke'. You can override using the
`.groups` argument.
To assess the effect of smoking, fix the value of premature
and compare the mean birthweight between smoker
and nonsmoker
:
- for full term babies, the mean birthweight for babies born to nonsmoking mothers is a little higher (7.50 pounds vs. 7.27)
- for premature babies, the mean birthweight for babies born to nonsmoking mothers is also a little higher (5.03 pounds vs. 4.20).
Thus the effect of smoking appears to be to reduce the mean birthweight overall.
Extra 1: we should be cautious about cause and effect here, because neither the smoking nor the prematureness were, or could be, randomized. For example, the smoking mothers might have tended to also have other health conditions or diet differences that were really the cause of the lower birthweights. (Or it might just be chance.)
Extra 2: there is no effect of smoking on the standard deviations, but the birthweights of premature babies are noticeably more variable than those of full-term babies. This is because full-term babies have a relatively predictable birth weight (they are born when they are the right size to be born), but premature babies can be very small indeed (their weights vary from almost the same weight as a full-term baby to a lot smaller):
ggplot(births, aes(x = smoke, y = weight, fill = premature)) + geom_boxplot()
Sometimes it really takes a graph to show what is going on.
- (3 points) For each of the smoking and nonsmoking mothers, work out the mean age of the father and mother, without naming (or numbering) those columns explicitly. Hint: some of the fathers’ ages are not known.
Think before you code:
- The thing about each of smoking and nonsmoking mothers is meant to suggest
group_by(smoke)
. - To do something without naming columns explicitly means to figure out what those columns have in common: in this case, their names end in
age
, and they are the only ones that do. - Finally, remember the
na.rm
from worksheet 3 to work out the mean of something without getting tripped up by missing values.
Hence:
%>%
births group_by(smoke) %>%
summarize(across(ends_with("age"), \(x) mean(x, na.rm = TRUE)))
The way you do something with several columns not explicitly named is to use across
. Inside the across
, two things: (i) something that will pick out the columns you want and only those, (ii) an “anonymous function” that says what to do with each of those columns, in this case work out the mean of it. The way to read the third line in English is “for each of the columns whose name ends with age
, work out the mean of it, dropping any missing values.”
Possible variations:
- another way of selecting those two columns is good if it works
- using anything else as the input to the anonymous function is good as long as you use that same name inside
mean
, for example
%>%
births group_by(smoke) %>%
summarize(across(contains("age"), \(age) mean(age, na.rm = TRUE)))
As long as you get to that table without explicitly naming the columns f_age
and m_age
(or using the fact that they are columns number 1 and 2 in the dataframe), I don’t much mind precisely how you do it.
Extra: smoking mothers are almost a year younger on average, but the fathers have about the same average age whether the mother smokes or not.
Countries of the world
Data were collected on 77 countries of the world in 2008, with variables as follows:
Country
: Name of the countryCode
: Three letter country codeLandArea
: Size in sq. kilometersPopulation
: Population in millionsEnergy
: Energy usage (kilotons of oil)Rural
: Percentage of population living in rural areasMilitary
: Percentage of government expenditures directed toward the militaryHealth
: Percentage of government expenditures directed towards healthcareHIV
: Percentage of the population with HIVInternet
: Percentage of the population with access to the internetBirthRate
: Births per 1000 peopleElderlyPop
Percentage of the population at least 65 years oldLifeExpectancy
Average life expectancy (years)CO2
: CO2 emissions (metric tons per capita)GDP
: Gross Domestic Product (per capita)Cell
: Cell phone subscriptions (per 100 people)Electricity
: Electric power consumption (kWh per capita)Electric_use
: Electricity use, classified as Low, Medium, or High
The data are in http://ritsokiguess.site/datafiles/countries.csv. Note that most (but not all) of the variables are measured per person or as a percentage, so that these variables are not dependent on how big the country is.
In the questions below, unless stated otherwise, if you are asked to display some of the columns, your code may display all of the rows; if you are asked to display some of the rows, your code may display all of the columns. In the output you hand in, make sure that only 10 rows or as many columns as will display on the screen are actually shown. There are a lot of questions below, but each one is meant to be quick, except perhaps for the last one of them.
- (1 point) Read in and display (some of) the data.
As you would expect:
<- "http://ritsokiguess.site/datafiles/countries.csv"
my_url <- read_csv(my_url) countries
Rows: 77 Columns: 18
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): Country, Code, Electric_use
dbl (15): LandArea, Population, Energy, Rural, Military, Health, HIV, Intern...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
countries
There are indeed 77 countries (rows), and, if you scroll across, all the variables listed.
Extra: this is dataset AllCountries1e
from the package Lock5Data
, but I had to do a bit of reorganization first.
The first thing to observe is that this dataset has a lot more countries than ours, but also there are a lot of missing values:
library(Lock5Data)
data("AllCountries1e")
AllCountries1e
To see just how many missing values, you can run a summary
of the entire dataframe:
%>% summary() AllCountries1e
Country Code LandArea Population
Afghanistan : 1 : 3 Min. : 2 Min. : 0.0200
Albania : 1 AFG : 1 1st Qu.: 10830 1st Qu.: 0.7728
Algeria : 1 ALB : 1 Median : 94080 Median : 5.6135
American Samoa: 1 ALG : 1 Mean : 608120 Mean : 31.4849
Andorra : 1 AND : 1 3rd Qu.: 446300 3rd Qu.: 20.5835
Angola : 1 ANG : 1 Max. :16376870 Max. :1324.6550
(Other) :207 (Other):205 NA's :1
Energy Rural Military Health
Min. : 159 Min. : 0.00 Min. : 0.000 Min. : 0.70
1st Qu.: 5252 1st Qu.:22.90 1st Qu.: 3.800 1st Qu.: 8.00
Median : 17478 Median :40.40 Median : 5.850 Median :11.30
Mean : 86312 Mean :42.13 Mean : 8.277 Mean :11.22
3rd Qu.: 52486 3rd Qu.:63.20 3rd Qu.:12.175 3rd Qu.:14.45
Max. :2283722 Max. :89.60 Max. :29.300 Max. :26.10
NA's :77 NA's :115 NA's :26
HIV Internet Developed BirthRate
Min. : 0.100 Min. : 0.20 Min. :1.000 Min. : 8.20
1st Qu.: 0.100 1st Qu.: 5.65 1st Qu.:1.000 1st Qu.:12.10
Median : 0.400 Median :22.80 Median :1.000 Median :19.40
Mean : 1.977 Mean :28.96 Mean :1.763 Mean :22.02
3rd Qu.: 1.300 3rd Qu.:48.15 3rd Qu.:3.000 3rd Qu.:28.90
Max. :25.900 Max. :90.50 Max. :3.000 Max. :53.50
NA's :68 NA's :14 NA's :78 NA's :16
ElderlyPop LifeExpectancy CO2 GDP
Min. : 1.000 Min. :43.90 Min. : 0.02262 Min. : 192.1
1st Qu.: 3.400 1st Qu.:62.80 1st Qu.: 0.61765 1st Qu.: 1252.7
Median : 5.400 Median :71.90 Median : 2.73694 Median : 4408.8
Mean : 7.473 Mean :68.94 Mean : 5.08557 Mean : 11298.4
3rd Qu.:11.600 3rd Qu.:76.03 3rd Qu.: 7.01656 3rd Qu.: 12431.0
Max. :21.400 Max. :82.80 Max. :49.05058 Max. :105437.7
NA's :22 NA's :17 NA's :15 NA's :40
Cell Electricity
Min. : 1.238 Min. : 35.68
1st Qu.: 59.206 1st Qu.: 800.32
Median : 93.696 Median : 2237.51
Mean : 91.093 Mean : 4109.13
3rd Qu.:121.160 3rd Qu.: 5824.24
Max. :206.429 Max. :51259.19
NA's :12 NA's :78
Some of the variables have a lot of missing values. There are sophisticated methods for estimating the values of variables that are missing (these fall under the umbrella of “imputation”), for example running a multiple regression to predict the values of variables that were missing from values of variables that were observed. Some of these variables you would expect to be correlated: for example, an industrialized country would be expected to have high energy use generally and electricity use in particular, along with high CO2 emissions and maybe a large amount of cellphone use. Having said all of that, we are going to be a lot less sophisticated: we are just going to throw away the data for any country that has any missing values anywhere, which is what drop_na
does:
%>% drop_na() -> countries0
AllCountries1e countries0
I am using a “disposable” name countries0
here, so that we don’t get confused with the dataframe read in from the file.
These are the 77 countries in our dataset, for which you can check there are no missing values remaining:
%>% summary() countries0
Country Code LandArea Population
Algeria : 1 ALG : 1 Min. : 320 Min. : 0.317
Armenia : 1 ARM : 1 1st Qu.: 62670 1st Qu.: 5.494
Austria : 1 AUT : 1 Median : 155360 Median : 10.708
Azerbaijan: 1 AZE : 1 Mean : 815544 Mean : 48.821
Bangladesh: 1 BAN : 1 3rd Qu.: 499110 3rd Qu.: 45.012
Belarus : 1 BEL : 1 Max. :16376870 Max. :1139.965
(Other) :71 (Other):71
Energy Rural Military Health
Min. : 819 Min. : 0.00 Min. : 0.000 Min. : 2.50
1st Qu.: 7735 1st Qu.:26.50 1st Qu.: 4.100 1st Qu.: 8.20
Median : 22009 Median :36.10 Median : 5.800 Median :11.90
Mean : 95623 Mean :37.26 Mean : 8.166 Mean :11.55
3rd Qu.: 72748 3rd Qu.:48.10 3rd Qu.:10.800 3rd Qu.:15.20
Max. :2283722 Max. :84.90 Max. :29.300 Max. :19.90
HIV Internet Developed BirthRate
Min. : 0.1000 Min. : 0.30 Min. :1.000 Min. : 8.30
1st Qu.: 0.1000 1st Qu.:11.10 1st Qu.:1.000 1st Qu.:11.00
Median : 0.2000 Median :32.60 Median :2.000 Median :14.90
Mean : 0.8519 Mean :38.43 Mean :1.857 Mean :17.64
3rd Qu.: 0.6000 3rd Qu.:62.30 3rd Qu.:3.000 3rd Qu.:22.00
Max. :17.9000 Max. :90.50 Max. :3.000 Max. :39.80
ElderlyPop LifeExpectancy CO2 GDP
Min. : 2.70 Min. :47.90 Min. : 0.2457 Min. : 523.1
1st Qu.: 5.00 1st Qu.:70.20 1st Qu.: 1.3373 1st Qu.: 2795.5
Median :10.40 Median :73.00 Median : 4.5414 Median : 7537.7
Mean :10.35 Mean :72.43 Mean : 5.2060 Mean :15984.9
3rd Qu.:15.90 3rd Qu.:78.80 3rd Qu.: 8.1236 3rd Qu.:22850.7
Max. :20.10 Max. :82.00 Max. :17.9417 Max. :84538.2
Cell Electricity
Min. : 40.69 Min. : 91.26
1st Qu.: 88.85 1st Qu.: 970.98
Median :108.60 Median : 3200.47
Mean :104.52 Mean : 4494.89
3rd Qu.:124.34 3rd Qu.: 6006.35
Max. :167.68 Max. :51259.19
I did one more thing: the variable that’s called Developed
here, though a numeric 1, 2, or 3, is really a categorical “low”, “medium”, “high”, so I decided to make it this. It is actually related to electricity use, so I want to have it be called Electricity_use
. There are several ways you might do this. One is lvls_revalue
from the forcats
package (loaded with the tidyverse; this is where fct_inorder
comes from), but I decided to use an idea like the Canadian Tire nails from lecture and make a little lookup table:
<- tribble(
conversion ~Developed, ~Electric_use,
1, "low",
2, "moderate",
3, "high"
) conversion
and now we can left-join this onto our countries0
:
%>%
countries0 left_join(conversion, join_by(Developed)) -> countries0
countries0
You can check that Developed
and the new column Electric_use
(on the right-hand end) do actually correspond as they should. If you want to be sophisticated about that:
%>% count(Developed, Electric_use) countries0
and you see that the only combinations of these two variables are the ones that match, so we have not introduced any errors.
The final step is to get rid of the no-longer-needed Developed
column, and then I saved that result for you.
- (2 points) Display only the country names and the percentage of population living in rural areas.
This is choosing columns (variables), so is a select
, with the names of the columns you want:
%>% select(Country, Rural) countries
- (3 points) Display all the columns whose names have E as their first letter, uppercase or lowercase.
A select-helper, namely starts_with
. The E
itself can be uppercase or lowercase, since the select-helpers will select both:
%>% select(starts_with("E")) countries
or
%>% select(starts_with("e")) countries
Either is good.
- (3 points) Display all the columns whose names have a lowercase
o
in them somewhere.
“In them somewhere” translates to contains
. To make sure that columns with an uppercase O
don’t get chosen, that is to pay attention to case, you need the double negative “don’t ignore case”, that is:
%>% select(contains("o", ignore.case = FALSE)) countries
Make sure you put the ignore.case
inside the contains
, not inside the select
(or else you will get an error message that doesn’t give you much of a clue about what has gone wrong).
If you omit the ignore.case
, you’ll get too many columns:
%>% select(contains("o")) countries
because the CO2
column’s name has an uppercase O in it.
- (3 points) Display only the columns that are text.
This uses where
. Inside where
goes something that will be TRUE
for the columns that you want. is.character
is the thing:
%>% select(where(is.character)) countries
If you eyeball your dataframe, you’ll see that all the columns are either text or numbers, so an alternative way to do this is to note that is.numeric
will be FALSE
for the columns you want, so “is not numeric” will also get them, but you have to be careful:
%>% select(where( \(x) !is.numeric(x))) countries
Inside where
, you need the name of a function (like is.character
), or an anonymous function such as you would use inside across
.
- (2 points) Display the countries that have high electricity use.
This is displaying rows (only the observations that satisfy a condition), so filter
. Don’t forget the double equals sign for testing whether something is true:
%>% filter(Electric_use == "high") countries
- (2 points) Display the five countries with the largest populations.
You have a choice here: the easier is to use slice_max
:
%>% slice_max(Population, n = 5) countries
or, if you don’t think of that, sort all the countries by population (in descending order) and then grab just the top five:
%>% arrange(desc(Population)) %>%
countries slice(1:5)
Extra: You are probably wondering where China went:
%>% filter(Country == "China") countries0
China was not part of our original dataset, even before we removed the missing values.
- (3 points) Are there any countries with population less than 5 million that have more than 60% of their population in rural areas? How do you know?
Try to find all of them. The countries you want are ones that satisfy both conditions, so a logical “and”:
%>% filter(Population < 5, Rural > 60) countries
or, equivalently, two filters one after the other, in either order:
%>% filter(Population < 5) %>%
countries filter(Rural > 60)
There are no rows in the answer, so there are no countries that satisfy both conditions: that is, there are no countries with a population less than 5 million that have a rural population greater than 60%. (The answer to the question is “no”, but make sure you actually do answer it somewhere.)
- (3 points) Find the median of all the variables that are quantitative.
Doing something with multiple columns that you are not naming one by one is across
. Inside the across
goes something that will pick out the columns you want (where(is.numeric)
), and an anonymous function that will work out whatever you want to work out for each of those columns. The input to the anonymous function can be called anything, as long as you use the same “anything” inside the function:
%>% summarize(across(where(is.numeric), \(x) median(x))) countries
Another approach you can try is to select the quantitative columns first, then work out the median of each of them. This perhaps does not actually make things much easier, because you still have to remember how to do something for all the columns. It goes like this:
%>%
countries select(where(is.numeric)) %>%
summarize(across(everything(), \(x) median(x)))
The key select-helper is everything()
. Maybe this appeals to you if you like to break things down into small parts: “grab the quantitative columns, and then for each of the columns I have left, work out the median of it”.
- (2 points) Do countries that have an above-average rural population also tend to have an above-average percentage of the population with access to the Internet? Explain briefly.
Find a way to answer this (there is probably not one best way). I think the easiest way is to make a graph: these two are quantitative variables, so a scatterplot is called for:
ggplot(countries, aes(x = Rural, y = Internet)) + geom_point()
This is a downward trend, so countries with above-average rural populations are in fact below average in terms of Internet access.
Another way is to count the number of countries that are above or below average on these two variables combined. You just worked out the medians, so you can use those as averages. You may or may not realize that this does in fact work:
%>% count(Rural > 36.1, Internet > 32.6) countries
In the likely event that you didn’t realize you could do that, create new columns that are TRUE
or FALSE
according to those, and then count the new columns:
%>%
countries mutate(rural_above = (Rural > 36.1),
internet_above = (Internet > 32.6)) %>%
count(rural_above, internet_above)
This tells the same story: most of the countries that are above-average rural are also below-average on Internet access.
Whichever way you do it, the answer to the original question is “no” plus this kind of sentence of explanation.
You don’t have to do it either of these ways, but you need to come to a conclusion somehow, via a graph or numerical summary, that highly rural countries are likely to have lower internet access.
Extra: One of the reasons for this is that urbanization is a sign of development (or industrialization if you prefer), so countries that have less of their population living in rural areas are more likely to show signs of being developed, and internet access is (or was, in 2008) one of those signs.