library(tidyverse)
2 Reading in data
In this chapter, and from here on, the questions are first, and then my answers are in the second appearances of the problems with the same name. It is much better for your learning to spend some time thinking about how you would solve these problems, and after that you can see how I did it.
2.1 Orange juice
The quality of orange juice produced by a manufacturer (identity unknown) is constantly being monitored. The manufacturer has developed a “sweetness index” for its orange juice, for which a higher value means sweeter juice. Is the sweetness index related to a chemical measure such as the amount of water-soluble pectin (parts per million) in the orange juice? Data were obtained from 24 production runs, and the sweetness and pectin content were measured for each run. The data are in link. Open that link up now. You can click on that link just above to open the file.
The data values are separated by a space. Use the appropriate Tidyverse function to read the data directly from the course website into a “tibble”.
Take a look at what got read in. Do you have data for all 24 runs?
In your data frame, where did the column (variable) names come from? How did R know where to get them from?
2.2 Making soap
A company operates two production lines in a factory for making soap bars. The production lines are labelled A and B. A production line that moves faster may produce more soap, but may possibly also produce more “scrap” (that is, bits of soap that can no longer be made into soap bars and will have to be thrown away).
The data are in link.
Read the data into R. Display the data.
There should be 27 rows. Are there? What columns are there?
2.3 Handling shipments
A company called Global Electronics from time to time imports shipments of a certain large part used as a component in several of its products. The size of the shipment varies each time. Each shipment is sent to one of two warehouses (labelled A and B) for handling. The data in link show the size
of each shipment (in thousands of parts) and the direct cost
of handling it, in thousands of dollars. Also shown is the warehouse
(A or B) that handled each shipment.
Read the data into R and display your data frame.
Describe how many rows and columns your data frame has, and what they contain.
My solutions follow:
2.4 Orange juice
The quality of orange juice produced by a manufacturer (identity unknown) is constantly being monitored. The manufacturer has developed a “sweetness index” for its orange juice, for which a higher value means sweeter juice. Is the sweetness index related to a chemical measure such as the amount of water-soluble pectin (parts per million) in the orange juice? Data were obtained from 24 production runs, and the sweetness and pectin content were measured for each run. The data are in link. Open that link up now. You can click on that link just above to open the file.
- The data values are separated by a space. Use the appropriate Tidyverse function to read the data directly from the course website into a “tibble”.
Solution
Start with this (almost always):
library(tidyverse)
The appropriate function, the data values being separated by a space, will be read_delim
. Put the URL as the first thing in read_delim
, or (better) define it into a variable first:1
<- "http://ritsokiguess.site/datafiles/ojuice.txt"
url <- read_delim(url, " ") juice
Rows: 24 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: " "
dbl (3): run, sweetness, pectin
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
read_delim
(or read_csv
or any of the others) tell you what variables were read in, and also tell you about any “parsing errors” where it couldn’t work out what was what. Here, we have three variables, which is entirely consistent with the three columns of data values in the file.
read_delim
can handle data values separated by any character, not just spaces, but the separating character, known as a “delimiter”, does not have a default, so you have to say what it is, every time.
\(\blacksquare\)
- Take a look at what got read in. Do you have data for all 24 runs?
Solution
Type the name of the data frame in a code chunk (a new one, or add it to the end of the previous one). Because this is actually a “tibble”, which is what read_delim
reads in, you’ll only actually see the first 10 lines, but it will tell you how many lines there are altogether, and you can click on the appropriate thing to see the rest of it.
juice
I appear to have all the data. If you want further convincing, click Next a couple of times to be sure that the runs go down to number 24.
\(\blacksquare\)
- In your data frame, where did the column (variable) names come from? How did R know where to get them from?
Solution
They came from the top line of the data file, so we didn’t have to specify them. This is the default behaviour of all the read_
functions, so we don’t have to ask for it specially.
Extra: in fact, if the top line of your data file is not variable names, that’s when you have to say something special. The read_
functions have an option col_names
which can either be TRUE
(the default), which means “read them in from the top line”, FALSE
(“they are not there, so make some up”) or a list of column names to use. You might use the last alternative when the column names that are in the file are not the ones you want to use; in that case, you would also say skip=1
to skip the first line. For example, with file a.txt
thus:
a b
1 2
3 4
5 6
you could read the same data but call the columns x
and y
thus:
read_delim("a.txt", " ", col_names = c("x", "y"), skip = 1)
Rows: 3 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: " "
dbl (2): x, y
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
\(\blacksquare\)
2.5 Making soap
A company operates two production lines in a factory for making soap bars. The production lines are labelled A and B. A production line that moves faster may produce more soap, but may possibly also produce more “scrap” (that is, bits of soap that can no longer be made into soap bars and will have to be thrown away).
The data are in link.
- Read the data into R. Display the data.
Solution
Read directly from the URL, most easily:
<- "http://ritsokiguess.site/datafiles/soap.txt"
url <- read_delim(url, " ") soap
Rows: 27 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: " "
chr (1): line
dbl (3): case, scrap, speed
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
soap
\(\blacksquare\)
- There should be 27 rows. Are there? What columns are there?
Solution
There are indeed 27 rows, one per observation. The column called case
identifies each particular run of a production line (scroll down to see that it gets to 27 as well). Though it is a number, it is an identifier variable and so should not be treated quantitatively. The other columns (variables) are scrap
and speed
(quantitative) and line
(categorical). These indicate which production line was used for each run, the speed it was run at, and the amount of scrap produced.
This seems like an odd place to end this question, but later we’ll be using these data to draw some graphs.
\(\blacksquare\)
2.6 Handling shipments
A company called Global Electronics from time to time imports shipments of a certain large part used as a component in several of its products. The size of the shipment varies each time. Each shipment is sent to one of two warehouses (labelled A and B) for handling. The data in link show the size
of each shipment (in thousands of parts) and the direct cost
of handling it, in thousands of dollars. Also shown is the warehouse
(A or B) that handled each shipment.
- Read the data into R and display your data frame.
Solution
If you open the data file in your web browser, it will probably open as a spreadsheet, which is not really very helpful, since then it is not clear what to do with it. You could, I suppose, save it and upload it to r.datatools
, but it requires much less brainpower to open it directly from the URL:
<- "http://ritsokiguess.site/datafiles/global.csv"
url <- read_csv(url) shipments
Rows: 10 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): warehouse
dbl (2): size, cost
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
If you display your data frame and it looks like this, you are good (you can give the data frame any name):
shipments
\(\blacksquare\)
- Describe how many rows and columns your data frame has, and what they contain.
Solution
It has 10 rows and 3 columns. You need to say this.
That is, there were 10 shipments recorded, and for each of them, 3 variables were noted: the size and cost of the shipment, and the warehouse it was handled at.
We will also be making some graphs of these data later.
\(\blacksquare\)
I say “better” because otherwise the read line gets rather long. This way you read it as “the URL is some long thing that I don’t care about especially, and I what I need to do is to read the data from that URL, separated by spaces.”↩︎