Dataset Description

For this week’s optional practice problems, we have provided another dataset for you to use.

midwest is a dataset that’s available in R as part of the ggplot2 pacakge, similar to the complete_old dataset we have been using from the ratdat package. The midwest dataset has demographic information for counties from several Midwest states from the 2000 U.S. Census.

  1. Write and execute the following code to read about where the dataset came from and what the variables are.
?midwest
  1. Use str to learn more about data types in the midwest dataset. Of the primary vector types we discussed in this week’s workshop (character, interger, numeric, logical), which are represented in this dataset?

Answer: character, integer, and numeric. There are no logical vectors.

str(midwest)
## tibble [437 × 28] (S3: tbl_df/tbl/data.frame)
##  $ PID                 : int [1:437] 561 562 563 564 565 566 567 568 569 570 ...
##  $ county              : chr [1:437] "ADAMS" "ALEXANDER" "BOND" "BOONE" ...
##  $ state               : chr [1:437] "IL" "IL" "IL" "IL" ...
##  $ area                : num [1:437] 0.052 0.014 0.022 0.017 0.018 0.05 0.017 0.027 0.024 0.058 ...
##  $ poptotal            : int [1:437] 66090 10626 14991 30806 5836 35688 5322 16805 13437 173025 ...
##  $ popdensity          : num [1:437] 1271 759 681 1812 324 ...
##  $ popwhite            : int [1:437] 63917 7054 14477 29344 5264 35157 5298 16519 13384 146506 ...
##  $ popblack            : int [1:437] 1702 3496 429 127 547 50 1 111 16 16559 ...
##  $ popamerindian       : int [1:437] 98 19 35 46 14 65 8 30 8 331 ...
##  $ popasian            : int [1:437] 249 48 16 150 5 195 15 61 23 8033 ...
##  $ popother            : int [1:437] 124 9 34 1139 6 221 0 84 6 1596 ...
##  $ percwhite           : num [1:437] 96.7 66.4 96.6 95.3 90.2 ...
##  $ percblack           : num [1:437] 2.575 32.9 2.862 0.412 9.373 ...
##  $ percamerindan       : num [1:437] 0.148 0.179 0.233 0.149 0.24 ...
##  $ percasian           : num [1:437] 0.3768 0.4517 0.1067 0.4869 0.0857 ...
##  $ percother           : num [1:437] 0.1876 0.0847 0.2268 3.6973 0.1028 ...
##  $ popadults           : int [1:437] 43298 6724 9669 19272 3979 23444 3583 11323 8825 95971 ...
##  $ perchsd             : num [1:437] 75.1 59.7 69.3 75.5 68.9 ...
##  $ percollege          : num [1:437] 19.6 11.2 17 17.3 14.5 ...
##  $ percprof            : num [1:437] 4.36 2.87 4.49 4.2 3.37 ...
##  $ poppovertyknown     : int [1:437] 63628 10529 14235 30337 4815 35107 5241 16455 13081 154934 ...
##  $ percpovertyknown    : num [1:437] 96.3 99.1 95 98.5 82.5 ...
##  $ percbelowpoverty    : num [1:437] 13.15 32.24 12.07 7.21 13.52 ...
##  $ percchildbelowpovert: num [1:437] 18 45.8 14 11.2 13 ...
##  $ percadultpoverty    : num [1:437] 11.01 27.39 10.85 5.54 11.14 ...
##  $ percelderlypoverty  : num [1:437] 12.44 25.23 12.7 6.22 19.2 ...
##  $ inmetro             : int [1:437] 0 0 0 1 0 0 0 0 0 1 ...
##  $ category            : chr [1:437] "AAR" "LHR" "AAR" "ALU" ...

Make a Fancy Boxplot

Create a graph from the midwest data that compares population density between states. Follow the instructions below:

  1. Compare state on the x-axis to popdensity on the y-axis.
  2. Create a combined boxplot and scatter plot graph with the following features:
    • Use geom_jitter to create the scatter plot layer.
      • Set the color of the scatter plot data points to vary by the variable state.
      • Choose a new point shape (shape =). Shape is identified using integers, and these are some of your point shape options:
        point shape options in R
      • Set a transparency level.
      • Make the point size larger (size =).
    • Use geom_boxplot to create the boxplot layer.
      • Remove the fill color.
      • Remove the outlier.shape to avoid double-plotting outliers.
  3. Set a new theme (e.g., theme_classic(), theme_bw())
  4. Change the labels so that:
    • Plot title is “Midwest Population Demographics (2000)”.
    • X-axis is “State”.
    • Y-axis is “Population Density (person/unit area)”.
  5. Change other plot features using the theme function so that:
    • The size of the plot title text is 16 and the text face is bold.
    • The position of the legend is “none” (removes legend).

The code and graph you created should look similar to the following:

ggplot(data = midwest,
       aes(x = state,
           y = popdensity)) +
  geom_jitter(aes(color = state),
              shape = 18,
              alpha = 0.6,
              size = 3) +
  geom_boxplot(outlier.shape = NA,
               fill = NA) +
  theme_bw() +
  labs(title = "Midwest Population Demographics (2000)",
       x = "State",
       y = "Population Density (person/unit area)") +
  theme(plot.title = element_text(size = 16,
                             face = "bold"),
        legend.position = "none")

Summary Statistics

Compute the following calculations on the midwest dataset:

  1. What is the maximum value of total population (poptotal)?
max(midwest$poptotal)
## [1] 5105067
  1. What is the minimum value of population density (popdensity)?
min(midwest$popdensity)
## [1] 85.05
  1. What are the quartiles (25%, 50%, 75%) for the percent of a county’s population that is college educated (percollege)? The quantile function will be helpful here.
quantile(midwest$percollege, prob = c(0.25,0.5,0.75))
##      25%      50%      75% 
## 14.11372 16.79756 20.54989
  1. What is the average number of adults per county (popadults)?
mean(midwest$popadults)
## [1] 60972.61

Sequences

  1. Create a sequence that runs from 8 to 85 at intervals of 7. Store this sequence as an object called weird_seq.
weird_seq <- seq(from = 8, to = 85, by = 7)

# print out list values
weird_seq
##  [1]  8 15 22 29 36 43 50 57 64 71 78 85
  1. Create a sequence that runs from 1900 to 2025 at intervals of 5. Store this sequence as an object called year_seq.
year_seq <- seq(from = 1900, to = 2025, by = 5)

# print out list values
year_seq
##  [1] 1900 1905 1910 1915 1920 1925 1930 1935 1940 1945 1950 1955 1960 1965 1970
## [16] 1975 1980 1985 1990 1995 2000 2005 2010 2015 2020 2025
  1. Create a sequence that is runs from 12 to 22 and has a length of 47 (47 total items in the sequence). Store this sequence as an object called seq_length.
seq_length <- seq(from = 12, to = 22, length.out = 47)

# print out list values
seq_length
##  [1] 12.00000 12.21739 12.43478 12.65217 12.86957 13.08696 13.30435 13.52174
##  [9] 13.73913 13.95652 14.17391 14.39130 14.60870 14.82609 15.04348 15.26087
## [17] 15.47826 15.69565 15.91304 16.13043 16.34783 16.56522 16.78261 17.00000
## [25] 17.21739 17.43478 17.65217 17.86957 18.08696 18.30435 18.52174 18.73913
## [33] 18.95652 19.17391 19.39130 19.60870 19.82609 20.04348 20.26087 20.47826
## [41] 20.69565 20.91304 21.13043 21.34783 21.56522 21.78261 22.00000

Reflections

  1. What was the most challenging aspect of this week’s workshop? Were you able to overcome it? If not, what assistance do you need to continue working through it?

  2. What was the most rewarding aspect of this week’s workshop?