Tutorial 2

Answer Model

R Fundamentals

Create a numeric vector called exam_scores containing the values 65, 78, 92, 88, and 73. Calculate the mean and standard deviation using built-in functions.

exam_scores <- c(65, 78, 92, 88, 73)
mean_score <- mean(exam_scores)
sd_score <- sd(exam_scores)

mean_score

[1] 79.2

sd_score

[1] 10.98636

Explain the difference between these three expressions when applied to a data frame df with a column named “price”:
1. df$price
2. df[["price"]]
3. df["price"]
  What class does each return?

df$price: Returns a vector (atomic vector of the column’s underlying type). Most convenient for interactive use but doesn’t support programmatic column names well.
df[["price"]]: Returns a vector (same type as $). Preferred for programmatic access since it accepts character strings and handles special characters in column names.
df["price"]: Returns a data.frame containing only the “price” column (one-column subset). Maintains data frame structure.

Why would the following code produce an error? Fix it:
student name <- "Maria"
age <- twenty five

Errors:

Object names cannot contain spaces (student name is invalid)
Character values must be quoted (twenty five is interpreted as an undefined object)

Fixed code:

student_name <- "Maria"
age <- "twenty five"  # Or numeric: age <- 25

Create a data frame called countries with three columns: name (character), population (numeric in millions), and continent (character). Include data for at least three countries.

countries <- data.frame(
  name = c("Netherlands", "Brazil", "Japan"),
  population = c(17.8, 214.3, 125.7),
  continent = c("Europe", "South America", "Asia"),
  stringsAsFactors = FALSE
)
countries

         name population     continent
1 Netherlands       17.8        Europe
2      Brazil      214.3 South America
3       Japan      125.7          Asia

What would be the result of executing x <- 10 followed by x <- x + 5? Explain what happens in memory during this operation.

After execution, x equals 15.
Memory behavior:

First assignment (x <- 10) allocates memory for the value 10 and binds the symbol x to it.
Second assignment (x <- x + 5) retrieves the current value of x (10), computes 10 + 5 = 15, allocates new memory for 15, and rebinds x to this new location. The original memory holding 10 becomes eligible for garbage collection. R uses copy-on-modify semantics—objects are immutable, so reassignment creates a new object rather than modifying the existing one in place.

You run ls() and see objects named data, data_clean, and data_final. Why is this naming convention preferable to repeatedly overwriting a single object called data?

This convention preserves data provenance and enables reproducibility:

Allows tracing the transformation pipeline (raw → cleaned → final)
Enables debugging by inspecting intermediate states
Prevents irreversible loss of original data if errors occur during cleaning
Supports collaborative work where others can understand processing steps
Facilitates rollback to earlier stages without rerunning entire workflows

Indexing and Data Manipulation

Given vector v <- c(5, 10, 15, 20, 25), write R code to:
1. Extract the third element
2. Extract elements 2 through 4
3. Extract all elements greater than 15

v <- c(5, 10, 15, 20, 25)

# a) Third element
v[3]

[1] 15

# b) Elements 2 through 4
v[2:4]

[1] 10 15 20

# c) Elements > 15
v[v > 15]

[1] 20 25

For a data frame employees with columns name, department, and salary:
1. Write code to get all employees in the “Finance” department
2. Write code to get only the names of employees earning more than 70000
3. Explain the difference between employees[3, 2] and employees[3, "department"]

# a) All Finance employees
employees[employees$department == "Finance", ]

# b) Names of high earners
employees$name[employees$salary > 70000]

# c) Explanation:
# employees[3, 2] accesses row 3, column 2 by POSITION (numeric index)
# employees[3, "department"] accesses row 3, column named "department" by NAME
# Both return the same value if "department" is the second column, but the named approach is safer against column reordering

Given list experiment <- list(trial1 = c(1.2, 1.5, 1.3), trial2 = c(2.1, 2.4, 2.0), success = TRUE), how would you:
1. Extract the entire trial1 vector?
2. Extract the second value from trial2?
3. Check if the experiment was successful?

experiment <- list(trial1 = c(1.2, 1.5, 1.3), trial2 = c(2.1, 2.4, 2.0), success = TRUE)

# a) Entire trial1 vector
experiment$trial1          # or experiment[["trial1"]]

[1] 1.2 1.5 1.3

# b) Second value from trial2
experiment$trial2[2]       # or experiment[["trial2"]][2]

[1] 2.4

# c) Check success status
experiment$success         # Returns TRUE/FALSE

[1] TRUE

Why does R use 1-based indexing (first element is position 1) rather than 0-based indexing like some other programming languages? What common error might occur when someone assumes 0-based indexing?

R uses 1-based indexing primarily due to its origins in statistical computing environments (like S language) where human readability was prioritized—researchers naturally count starting from 1.

Common error: Attempting to access vector[0] returns an empty vector (not an error), causing silent failures in loops or subsetting. For example, for(i in 0:4) print(v[i]) would skip the first element and print four empty results before accessing valid indices.

Create a logical vector that identifies which students in the students_df from Part A have grades above 8.0. Use this vector to subset the data frame to show only high-performing students.

# Assuming students_df has a 'grade' column
high_performers <- students_df$grade > 8.0
students_df[high_performers, ]

API Concepts

Explain the restaurant analogy for APIs: who is the customer, who is the waiter, and who is the kitchen? Why is this analogy helpful for understanding API functionality?

Customer = Your application/client making the request
Waiter = The API (interface that takes requests and returns responses)
Kitchen = The server/backend system that processes requests and prepares data

This analogy clarifies that APIs act as intermediaries: you don’t need to know kitchen operations (server implementation details) to get your meal (data). You simply make a request through the waiter (API) using a standard protocol (menu), and receive a prepared response.

An API request returns status code 429. What does this mean, and what should you do in response? How is this different from status code 503?

429 Too Many Requests: Client has exceeded rate limits. Response: Implement exponential backoff, reduce request frequency, or check API documentation for quota limits.
503 Service Unavailable: Server is temporarily down/maintenance (server-side issue). Response: Retry later with backoff; not caused by client behavior.
Key difference: 429 is client-induced (fix by throttling requests); 503 is server-induced (requires waiting for service restoration).

Deconstruct this URL into its components:
https://api.example.com/v2/products?category=electronics&limit=10&api_key=abc123
Identify: protocol, domain, endpoint, and all parameters.

Protocol: https
Domain: api.example.com
Endpoint: /v2/products
Parameters:
- category = "electronics"
- limit = "10"
- api_key = "abc123"

Why do most APIs require authentication via API keys rather than allowing completely open access? Name two legitimate reasons API providers implement this requirement.
1. Usage monitoring and quota enforcement: Track client consumption to prevent abuse, allocate fair resource shares, and enable tiered pricing models.
2. Security and accountability: Identify malicious actors, restrict access to authorized users, and maintain audit trails for compliance (e.g., GDPR, HIPAA).
You need weather data for Paris, Berlin, and Rome. Why is it better to make three separate API requests (one per city) rather than downloading a complete global weather dataset containing billions of records?

Bandwidth efficiency: Transfer only needed data (~KB per city vs. GB/TB for global dataset)
Processing efficiency: Avoid filtering massive datasets locally (CPU/memory intensive)
Cost reduction: Many APIs charge by data volume transferred
Timeliness: Smaller requests complete faster with lower failure risk
Respect for provider resources: Prevents unnecessary server load on API infrastructure

JSON and Practical Implementation

Convert this JSON structure into its equivalent R objects (specify whether each becomes a vector, list, or data frame):

{
  "university": "Utrecht",
  "departments": ["Economics", "Computer Science", "Law"],
  "enrollment": [
    {"year": 2022, "students": 4500},
    {"year": 2023, "students": 4750}
  ]
}

"university" → character vector (length 1): "Utrecht"
"departments" → character vector: c("Economics", "Computer Science", "Law")
"enrollment" → data frame (after conversion):
```
data.frame(
  year = c(2022, 2023),
  students = c(4500, 4750)
)
```
Note: When parsed with jsonlite::fromJSON(), the entire structure becomes a list containing these components.

When using GET() from the httr package, why should you always check status_code(response) before attempting to parse the content? What error might occur if you skip this step?

Checking status code prevents attempting to parse error responses (e.g., HTML error pages) as valid data. Skipping this may cause:

jsonlite::fromJSON() to fail with cryptic parsing errors when receiving HTML instead of JSON
Silent data corruption if partial/error content is misinterpreted as valid data
Wasted computation on unusable responses
Best practice: Verify status_code(response) == 200 before parsing.

You receive this error when making an API call: Error in open.connection(con, "rb") : HTTP error 401. What is the most likely cause, and what steps should you take to resolve it?

Cause: 401 Unauthorized indicates missing, invalid, or expired authentication credentials (e.g., incorrect API key).
Resolution steps:
1. Verify API key spelling and format
2. Confirm key hasn’t expired or been revoked
3. Check authentication method (header vs. query parameter) matches API requirements
4. Ensure key has required permissions/scopes for the endpoint
5. Test key validity using a simple endpoint (e.g., /status)

Design a safe workflow for using an API key in your R project that prevents accidental exposure when sharing code on GitHub. Describe two specific techniques you would implement.
1. Store keys in .Renviron:
  Use usethis::edit_r_environ() to add API_KEY="your_key" to your user-level .Renviron file (outside project directory). Access via Sys.getenv("API_KEY") in scripts. Never commit .Renviron to version control.
2. Include startup validation in scripts:
```
api_key <- Sys.getenv("WEATHER_API_KEY")
if (api_key == "") stop("API key not found! Set WEATHER_API_KEY in .Renviron")
```
This fails safely if keys are missing while preventing accidental commits of credentials.