Tutorial 1

Answer Model

R Syntax Fundamentals

Create a numeric vector called inflation_rates containing these values: 2.1, 3.4, 1.8, 4.2, 2.9. What is the result of inflation_rates[3]?

inflation_rates <- c(2.1, 3.4, 1.8, 4.2, 2.9)
inflation_rates[3]

[1] 1.8

Create a character vector called countries with elements: “Netherlands”, “Germany”, “France”, “Italy”. Use negative indexing to return all countries except “Germany” (the second element). Then confirm you get the same result using positive indexing. When would negative indexing be more convenient than positive indexing?

countries <- c("Netherlands", "Germany", "France", "Italy")
countries[-2]          # Negative indexing: exclude position 2

[1] "Netherlands" "France"      "Italy"

countries[c(1, 3, 4)]  # Positive indexing: select positions 1, 3, 4

[1] "Netherlands" "France"      "Italy"

Both return "Netherlands" "France" "Italy". Negative indexing is more convenient when you want to exclude a small number of elements from a long vector — specifying what to drop is simpler than listing all positions to keep.

Given the vector x <- c(10, 20, 30, 40, 50), what does x[c(2, 4)] return? What does x[x > 25] return?

x <- c(10, 20, 30, 40, 50)
x[c(2, 4)]    # Returns: 20 40

[1] 20 40

x[x > 25]     # Returns: 30 40 50

[1] 30 40 50

Create a data frame called cities with three columns:
- name: “Amsterdam”, “Rotterdam”, “The Hague”
- population: 872680, 651406, 545838
- province: “Noord-Holland”, “Zuid-Holland”, “Zuid-Holland”
How would you access the population of Rotterdam using three different methods?

cities <- data.frame(
  name = c("Amsterdam", "Rotterdam", "The Hague"),
  population = c(872680, 651406, 545838),
  province = c("Noord-Holland", "Zuid-Holland", "Zuid-Holland")
)

# Method 1: Logical indexing
cities$population[cities$name == "Rotterdam"]

[1] 651406

# Method 2: Row/column indexing
cities[2, "population"]

[1] 651406

# Method 3: Subset with column extraction
subset(cities, name == "Rotterdam")$population

[1] 651406

All three return 651406.

Explain the difference between these three expressions when applied to a data frame df with columns x and y:
- df$x
- df[["x"]]
- df[, "x"]

Try running all three on your cities data frame. Is the output always identical?

df$x: Uses dollar-sign notation; convenient but does not support programmatic column names (e.g., df$var fails if var is a character object containing "x").
df[["x"]]: Extracts a single column as a vector; supports programmatic access (e.g., col <- "x"; df[[col]]) and is safer with non-syntactic column names.
df[, "x"]: Returns a data frame with one column by default (use drop = TRUE to get a vector); preserves data frame structure and supports matrix-style subsetting.

cities$population

[1] 872680 651406 545838

cities[["population"]]

[1] 872680 651406 545838

cities[, "population"]

[1] 872680 651406 545838

The output is identical for a regular data frame — all three return the population column as a vector. However, df[, "x"] can behave differently in some contexts (e.g., inside functions or with tibbles, where it may return a tibble instead of a vector).

Working Directories and File Paths

Your project folder structure looks like this:
```
my_project/
├── data/
│   └── gdp.csv
├── scripts/
│   └── analysis.R
└── report.qmd
```
If your working directory is my_project/scripts/, write the relative path to access gdp.csv. Then write the equivalent here() call (assuming my_project is your Positron project root).

"../data/gdp.csv"          # Relative path from my_project/scripts/

[1] "../data/gdp.csv"

here::here("data", "gdp.csv")  # here() call — always relative to project root

[1] "/home/bas/Documents/git/iads_website/data/gdp.csv"

Using the here package approach, what single command would reliably read gdp.csv regardless of your current working directory (assuming you’ve opened my_project as your Positron project)?

library(readr)
read_csv(here::here("data", "gdp.csv"))

What does the command list.files("../data") do when your working directory is my_project/scripts/?

Lists all files in the ../data directory relative to scripts/—i.e., it lists files in my_project/data/.

Why is using here("data", "filename.csv") generally safer than "./data/filename.csv" in scripts?

here() constructs paths relative to the project root (detected via .Rproj, .here, or version control files), making scripts portable across sessions and users. "./data/..." depends on the current working directory, which may change during an R session and cause path failures.

Write R code to:
- Check your current working directory
- List all .csv files in your project’s data/ folder
- Change your working directory to the project root (without hardcoding the full path)

getwd()

[1] "/home/bas/Documents/git/iads_website/tutorial_solutions"

list.files(here::here("data"), pattern = "\\.csv$", full.names = TRUE)

character(0)

setwd(here::here())

Reading CSV Files

Download the students.csv file here and put it into a folder named tutorials in your working directory. Read the file students.csv from your tutorials/ folder using read_delim(). How many rows and columns does the resulting data frame have?

library(readr)
students <- read_delim(here::here("tutorials", "students.csv"))

Rows: 3 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ";"
chr (1): name
dbl (1): id
num (1): gpa
lgl (1): has_job

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

nrow(students)   # 3 rows

[1] 3

ncol(students)   # 4 columns

[1] 4

The same students.csv file is available online at: http://basm92.quarto.pub/intro-to-applied-data-science/tutorials/students.csv. Write the code to read this remote file directly into R.

students_remote <- read_csv("http://basm92.quarto.pub/intro-to-applied-data-science/tutorials/students.csv")

When reading a CSV file with read_csv(), you notice that a column containing postal codes (e.g., “1012 AB”) is being converted to numeric. How would you prevent this and keep it as character/text?

read_csv("file.csv", col_types = cols(postal_code = col_character()))
# OR use col_types = "c" for all columns if appropriate

What is the key practical difference between read_csv() (from readr) and read.csv() (base R) when importing large datasets? Name at least two advantages of read_csv().

read_csv() is significantly faster and more memory-efficient for large files.
It provides a progress bar during import.
It uses modern type guessing (e.g., doesn’t convert strings to factors by default).
It returns a tibble, which prints neatly and avoids row name complications.

After reading gdp_data <- read_csv(here("tutorials", "gdp.csv")), write code to:
- View the first 6 rows
- Get the column names
- Check the data types of each column

gdp_data <- read_csv(here::here("tutorials", "gdp.csv"))

Rows: 30 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): country
dbl (4): year, gdp_per_capita, population_millions, inflation_rate

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(gdp_data)        # First 6 rows

# A tibble: 6 × 5
  country      year gdp_per_capita population_millions inflation_rate
  <chr>       <dbl>          <dbl>               <dbl>          <dbl>
1 Netherlands  2019          52300                17.3            4.8
2 Netherlands  2020          51800                17.4            7.9
3 Netherlands  2021          54200                17.5            3.8
4 Netherlands  2022          56700                17.6            3.9
5 Netherlands  2023          58900                17.7            6.5
6 Belgium      2019          47800                11.4            4.5

names(gdp_data)       # Column names

[1] "country"             "year"                "gdp_per_capita"     
[4] "population_millions" "inflation_rate"

str(gdp_data)         # Structure and data types

spc_tbl_ [30 × 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ country            : chr [1:30] "Netherlands" "Netherlands" "Netherlands" "Netherlands" ...
 $ year               : num [1:30] 2019 2020 2021 2022 2023 ...
 $ gdp_per_capita     : num [1:30] 52300 51800 54200 56700 58900 47800 NA 48900 50100 51300 ...
 $ population_millions: num [1:30] 17.3 17.4 17.5 17.6 17.7 11.4 11.5 11.6 11.7 11.8 ...
 $ inflation_rate     : num [1:30] 4.8 7.9 3.8 3.9 6.5 4.5 8.2 1.5 7.6 2.8 ...
 - attr(*, "spec")=
  .. cols(
  ..   country = col_character(),
  ..   year = col_double(),
  ..   gdp_per_capita = col_double(),
  ..   population_millions = col_double(),
  ..   inflation_rate = col_double()
  .. )
 - attr(*, "problems")=<pointer: 0x55f573ec8c40>

Reading Excel Files

You need to read an Excel file eurostat.xlsx that contains multiple sheets.¹ How would you:
- List all available sheet names in the file?
- Read the sheet named “Population_2023”?

library(readxl)
library(here)

here() starts at /home/bas/Documents/git/iads_website

file_location <- here("tutorials", "eurostat.xlsx") # If your eurostat.xlsx file is in Tutorials
excel_sheets(file_location) # List sheet names

[1] "Sheet1"          "Population_2023"

read_excel(file_location, sheet = "Population_2023")

# A tibble: 5 × 2
  country     Population_2023
  <chr>                 <dbl>
1 Netherlands              18
2 Germany                  83
3 Belgium                  11
4 France                   67
5 Italy                    63

When reading an Excel file with read_excel(), you notice that date columns are being imported as character strings instead of proper dates. What parameter would you use to specify the correct column type during import?

Use the col_types parameter:

read_excel("file.xlsx", col_types = c("text", "date", "numeric", ...))
# OR let read_excel guess but ensure locale is set correctly:
read_excel("file.xlsx", locale = locale(date_format = "%Y-%m-%d"))

An Excel file trade_data.xlsx has column headers starting on row 3 (rows 1-2 contain metadata). How would you skip these first two rows when reading the data?

read_excel("trade_data.xlsx", skip = 2)

Reading Text Files and Integration

You receive a tab-delimited text file survey_results.txt where missing values are coded as “NA” and “MISSING”.² Write code to read this file while treating both codes as missing values (NA).

read_tsv(
  here::here("tutorials", "survey_results.txt"),
  na = c("NA", "MISSING")
)

Rows: 10 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (2): gender, comments
dbl (4): respondent_id, age, satisfaction, score

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# A tibble: 10 × 6
   respondent_id   age gender satisfaction score comments            
           <dbl> <dbl> <chr>         <dbl> <dbl> <chr>               
 1           101    28 Male              5    92 Great service       
 2           102    35 Female            4    85 <NA>                
 3           103    42 Male             NA    78 <NA>                
 4           104    29 Female            5    95 Excellent experience
 5           105    51 Male              3    70 <NA>                
 6           106    33 Female           NA    NA <NA>                
 7           107    45 Male              4    88 Good but slow       
 8           108    39 Female            2    NA Needs improvement   
 9           109    26 Male              5    97 Perfect             
10           110    48 Female           NA    75 <NA>

Combine multiple skills:
- Read gdp.csv from your tutorials folder
- Filter to keep only observations with GDP per capita > 40000
- Calculate the average GDP per capita for these countries
- Store the result in an object called high_income_avg

library(readr)
library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

gdp_data <- read_csv(here::here("tutorials", "gdp.csv"))

Rows: 30 Columns: 5

── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): country
dbl (4): year, gdp_per_capita, population_millions, inflation_rate

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

high_income <- subset(gdp_data, gdp_per_capita > 40000)
high_income_avg <- mean(high_income$gdp_per_capita)
high_income_avg

[1] 48738.89

Footnotes

The file is available here ↩︎
The file is available here.↩︎