If you already have access to a copy of R and RStudio, you can skip to Required Packages

Downloading R and RStudio

For OSU learners: since your access to TIGER was only temporary, you will need to download R and RStudio on your own computer if you want to continue working in this program.

For UCO learners: you may choose to download R and RStudio on your own computer, or continue to work in Buddy. If you want to download your own copies of the software, follow the instructions in this section.

Download and Install R

You can find installations for Windows, MacOS, and Linux at the Comprehensive R Archive Network website.

Download and Install RStudio

After you have R installed on your computer, you can then install RStudio Desktop from the Posit website. If you scroll down, you can find installations for MacOS and Linux.

Required Packages

For OSU learners: since these practice problems are not taking place on TIGER, make sure you install the necessary packages on your personal version of R and RStudio.

For UCO learners: if you are continuing to use Buddy, you should already have tidyverse installed. If you are working on your personal versions of R and RStudio, you will also need to install the necessary packages if you do not have them installed already.

install.packages("tidyverse") # only if you do not have it installed
library(tidyverse) # load the package

Dataset Description and Download

For this week’s optional practice problems, we have provided another dataset for you to use. This time, you will be downloading this file yourself and importing it into R.

This dataset is from Studying African Farmer-Led Irrigation (SAFI), a study that conducted interviews in Tanzania and Mozambique to assess farming and irrigation methods. Learn more about the dataset variables.

  1. Download the dataset (CSV)
    • This should open a page in GitHub. In the upper right hand corner, you should see a button with three dots
    • Click the button to open a menu and select Download to download the dataset
  2. Pay attention to which folder you save the dataset in

Importing the Dataset

If you are familiar with setting your working directory in R, use your preferred method to set your working directory to the folder that the dataset is in and skip to Read in the Dataset.

For Buddy users: You will need to import the file into Buddy using the import tool.

If you are new to R or have have limited experience importing data, you will need to set your working directory. We outline this process below.

Set the Working Directory

To avoid making things too complicated, the reason this process was so seamless using TIGER and Buddy is because all of our files were in the main “folder” on our “computer”. If you are using your own computer and folders, we need to make sure R is paying attention to the correct location on your computer so it can find the files.

The simplest way to do this is to set the “Working Directory”. This is the location on your computer that R focuses its attention on. You can do this programmatically through code if you know the exact path to the folder with your data in it. For example:

setwd("C:/Users/username/OSU/Workshop_Files/Intro_R")

Alternatively, you can manually search for and set your working directory by going to the Session tab in RStudio >> Set Working Directory >> Choose Directory >> select the folder that contains your dataset.

RStudio interface with the “Session” tab selected, showing how to set the working directory manually
RStudio interface with the “Session” tab selected, showing how to set the working directory manually

You can check that your working directory is set to the correct location by running the following code:

getwd()

Projects are an alternative to repeatedly setting the working directory. If you want to learn more, read about RStudio Projects in the RStudio User Guide.

Read in the Dataset

Now that your working directory is set, you should be able to see the dataset we downloaded if you go to the Files tab on the lower right of the RStudio interface. If you see a file called SAFI_clean.csv, everything is in order.

Import the dataset using the read_csv() function since it is a CSV file. Don’t forget to put the file name in " " and include the file extension .csv. Store the dataset as an object called survey_data so that we can reference it in the next section.

survey_data <- read_csv("SAFI_clean.csv")

Practice Problems

Working with Data

In this section, you will be writing code to subset, transform, and create new variables from our survey_data dataset we imported.

  1. Subset the original dataset so that it includes the following components and store it as an object called bigger_households:
    • Only the columns village, no_membrs, respondent_wall_type, and rooms
    • Only households that have at least 8 people living in them (no_membrs)
bigger_households <- survey_data %>% 
  filter(no_membrs >= 8) %>% 
  select(village, no_membrs, respondent_wall_type, rooms)

bigger_households
## # A tibble: 49 × 4
##    village  no_membrs respondent_wall_type rooms
##    <chr>        <dbl> <chr>                <dbl>
##  1 God             10 burntbricks              1
##  2 Chirodzo        12 burntbricks              3
##  3 Chirodzo         8 burntbricks              1
##  4 Chirodzo        12 burntbricks              5
##  5 God             10 burntbricks              3
##  6 God              8 sunbricks                1
##  7 God              9 burntbricks              2
##  8 God              8 burntbricks              1
##  9 Ruaca           10 burntbricks              4
## 10 Ruaca           11 burntbricks              3
## # ℹ 39 more rows
  1. Subset the original dataset so that it includes the following components and store it as an object called earth_households:
    • Only households where respondent_wall_type is either “muddaub”, “burntbricks”, and “sunbricks”
    • Contains a new column called membrs_per_room that contains a calculation of the number of household members per room (no_membrs / rooms)
    • Contains a new column called wall_as_factor that contains the same data as respondent_wall_type but converted into a factor instead
    • Change the order of factor levels to “sunbricks”, “burntbricks” and “muddaub” (Hint: use fct_relevel())
    • Select only the variables no_membrs, rooms, membrs_per_room, and wall_as_factor
earth_households <- survey_data %>% 
  filter(respondent_wall_type %in% c("muddaub","burntbricks","sunbricks")) %>% 
  mutate(membrs_per_room = no_membrs / rooms,
         wall_as_factor = factor(respondent_wall_type),
         wall_as_factor = fct_relevel(wall_as_factor,
                                      c("sunbricks",
                                        "burntbricks",
                                        "muddaub"))) %>% 
  select(no_membrs, rooms, membrs_per_room, wall_as_factor)

earth_households
## # A tibble: 130 × 4
##    no_membrs rooms membrs_per_room wall_as_factor
##        <dbl> <dbl>           <dbl> <fct>         
##  1         3     1             3   muddaub       
##  2         7     1             7   muddaub       
##  3        10     1            10   burntbricks   
##  4         7     1             7   burntbricks   
##  5         7     1             7   burntbricks   
##  6         3     1             3   muddaub       
##  7         6     1             6   muddaub       
##  8        12     3             4   burntbricks   
##  9         8     1             8   burntbricks   
## 10        12     5             2.4 burntbricks   
## # ℹ 120 more rows
  1. Use the group_by() and summarize() approach to make the following comparisons between villages (village) from the original dataset (survey_data):
    • Average years lived in the area per household (years_liv)
    • Number of households in each village (Hint: use n())
survey_data %>% 
  group_by(village) %>% 
  summarize(mean_years_liv = mean(years_liv),
            n = n())
## # A tibble: 3 × 3
##   village  mean_years_liv     n
##   <chr>             <dbl> <int>
## 1 Chirodzo           23.6    39
## 2 God                20.4    43
## 3 Ruaca              24.9    49

Putting It All Together

  1. Create a subset that includes the following components and store it as an object called survey_50_years:
    • Use str to verify that interview_date is a date data type
      • Note: POSIXct is one way R refers to date-time data
    • Create a new variable called year by extracting the year from interview_date. (Hint: the function year() can extract the year from a date)
    • Convert year to a factor
    • Filter so that only households that have been living in the area for less than 50 years are represented
str(survey_data$interview_date)
##  POSIXct[1:131], format: "2016-11-17" "2016-11-17" "2016-11-17" "2016-11-17" "2016-11-17" ...
survey_50_years <- survey_data %>% 
  mutate(year = year(interview_date),
         year = factor(year)) %>% 
  filter(years_liv < 50)

survey_50_years
## # A tibble: 122 × 15
##    key_ID village  interview_date      no_membrs years_liv respondent_wall_type
##     <dbl> <chr>    <dttm>                  <dbl>     <dbl> <chr>               
##  1      1 God      2016-11-17 00:00:00         3         4 muddaub             
##  2      2 God      2016-11-17 00:00:00         7         9 muddaub             
##  3      3 God      2016-11-17 00:00:00        10        15 burntbricks         
##  4      4 God      2016-11-17 00:00:00         7         6 burntbricks         
##  5      5 God      2016-11-17 00:00:00         7        40 burntbricks         
##  6      6 God      2016-11-17 00:00:00         3         3 muddaub             
##  7      7 God      2016-11-17 00:00:00         6        38 muddaub             
##  8      9 Chirodzo 2016-11-16 00:00:00         8         6 burntbricks         
##  9     10 Chirodzo 2016-12-16 00:00:00        12        23 burntbricks         
## 10     11 God      2016-11-21 00:00:00         6        20 sunbricks           
## # ℹ 112 more rows
## # ℹ 9 more variables: rooms <dbl>, memb_assoc <chr>, affect_conflicts <chr>,
## #   liv_count <dbl>, items_owned <chr>, no_meals <dbl>, months_lack_food <chr>,
## #   instanceID <chr>, year <fct>
  1. Create a data visualization using the survey_50_years subset that has the following features:
    • year (as a factor) on the x-axis and no_membrs on the y-axis
    • create a boxplot and remove the outlier points (outlier.shape = NA)
    • create a geom_jitter layer to plot the raw data points and specify that
      • color should vary by village (but only for the jitter layer)
      • change the transparency to 0.7 (alpha =)
      • change the point size to 2 (size =)
    • Add your own additional customizations. For example:
      • change the axis titles
      • change the color of the data points
      • change the graph theme
      • change the font size
    • Store this graph as an object called years_graph

Your exact customizations will vary, but the graph generally should look like the following:

years_graph <- ggplot(data = survey_50_years,
       mapping = aes(x = year,
                     y = no_membrs)) +
  geom_boxplot(outlier.shape = NA) +
  geom_jitter(aes(color = village),
              alpha = 0.7,
              size = 2) +
  theme_bw() +
  labs(x = "Year Interviewed",
       y = "Number of Household Members",
       color = "Village",
       title = "Newer Households (< 50 years)") +
  scale_color_manual(values = c("purple","forestgreen","orange"))

years_graph # print the graph in the Plots pane

  1. Export the subset survey_50_years and the graph you just made.
    • The functions ggsave and write_csv will be useful here.
ggsave(filename = "years_graph.png",
       plot = years_graph,
       width = 6,
       height = 4,
       dpi = 300)

write_csv(survey_50_years,
          file = "new_house_survey_data.csv")

Reflections

  1. What was the most challenging aspect of this week’s workshop? Were you able to overcome it? If not, what assistance do you need to continue working through it?

  2. What was the most rewarding aspect of this week’s workshop?