Tutorial 3

Answer Model

HTML and DOM Fundamentals

The parent element of <span class="price"> is <div class="product">.
(The <span> is a direct child of this <div>)
id vs class:
- id: Must be unique per page (one element only). Use for scraping single, critical elements (e.g., main content container).
- class: Can be reused across multiple elements. Use for scraping repeating structures (e.g., product cards, list items).
  Scraping tip: Prefer class for scalable patterns; use id only for guaranteed-unique anchors. Avoid id if values are dynamically generated (e.g., id="product-123").
The two <p> elements are not siblings.
- First <p>: Direct child of #container
- Second <p>: Child of <section>, which is a child of #container
- They share a common ancestor (#container) but reside in different branches of the DOM tree (no direct parent/child/sibling relationship).
Why rvest fails with JS content:
rvest parses static HTML only and does not execute JavaScript. Dynamic content (loaded via JS after initial page load) is absent in the raw HTML source.

Alternatives:
- RSelenium (browser automation)
- splashr (headless browser via Docker)
- playwright (modern browser control)
  Always verify if an API endpoint provides the data first (more efficient than browser automation).

CSS Selector Syntax

```
a[href^="https://"]
```
```
div.product.featured > span.price.sale
```
(Uses child combinator > and combines both classes on <span> to avoid matching other .price spans)
Difference:
- div p: Selects all descendant <p> elements inside <div> (any depth)
- div > p: Selects only direct children <p> elements of <div>
Example where results differ:
```
<div>
  <p>Matched by both</p>
  <article>
    <p>Matched ONLY by "div p"</p>
  </article>
</div>
```
```
ul > li:nth-of-type(3)
```
(:nth-of-type(3) targets the 3rd <li> specifically; > ensures direct child of <ul>)
```
[data-category*="electronics"]
```
(*= matches attribute values containing the substring)
```
span.stock.unavailable
```
(Combines both classes to uniquely identify the “Out of Stock” span; avoids matching .available)

`rvest` Implementation

Purpose of trimws(): Removes leading/trailing whitespace (including newlines, tabs) from extracted text.

Example:

# Raw HTML: <span>  $24.99  </span>
html_text(html_element(page, "span"))        # Returns "  $24.99  "
trimws(html_text(html_element(page, "span"))) # Returns "$24.99"

html_element() vs html_elements():

Function	Use Case	No Match Behavior
`html_element()`	Extract one element per input (e.g., page title)	Returns a missing element (no error); `html_text()`/`html_attr()` then yield `NA`
`html_elements()`	Extract all matching elements (e.g., all product cards)	Returns an empty node set (no error)

Neither function errors on a non-match. Use html_element() when you want exactly one value per input (it keeps vectors aligned by inserting missing elements); use html_elements() to collect every match.

Extract image URLs:

page |>
  html_elements("img") |>
  html_attr("src")  # Extracts "src" attribute values

Data structure: tables is a list of data frames (one per matched <table>). Access second table:
```
tables[[2]]  # Double brackets for list element extraction
```
What default does in html_attr():

The default argument sets the value returned when the requested attribute is missing (or the element itself is missing). With default = NA, a book without a title attribute yields NA instead of an error or an empty string.
```
# If html_element(book, "h3 a") finds no match, it returns a "missing" node,
# and html_attr() then returns the supplied default:
html_attr(html_element(book, "h3 a"), "title", default = NA)  # -> NA when absent
```
Why it helps when scraping many products: html_element() (singular) always returns one node per input, using a missing node where nothing matched — it does not raise an error. Pairing it with default = NA guarantees every book contributes exactly one (possibly NA) value, so the extracted vectors all stay the same length and line up correctly when combined into a data frame. Without a default you would get empty/NA_character_ gaps that are harder to detect and clean.

Note: html_element() has no optional argument — unlike html_table(). A non-matching selector simply produces a missing element rather than an error, so no tryCatch() is needed just to guard against absent attributes.

Workflow and Best Practices

Generate URLs:

urls <- paste0("https://store.com/item?id=", 1:50)

Purpose of Sys.sleep(2):
- Ethical: Respects server resources; complies with robots.txt crawl-delay directives
- Practical: Prevents IP bans, rate-limiting blocks, or triggering anti-bot systems
Always prioritize server load over scraping speed.
Pre-scraping checks:
1. robots.txt: Check https://target-site.com/robots.txt for disallowed paths
2. Terms of Service: Verify scraping isn’t prohibited in legal/terms documentation
3. Data sensitivity: Confirm data isn’t personal, copyrighted, or behind auth walls without permission
  Bonus: Check for a public API first (more reliable and ethical).

Convert prices to numeric:

library(readr)
prices_raw <- c("$19.99", "$24.50", "$9.75")
prices_clean <- parse_number(prices_raw)  # Handles currency symbols automatically
# Output: [1] 19.99 24.50  9.75

Alternative: as.numeric(sub("\\$", "", prices_raw))

Why save raw extracted data: enables reproducible debugging and reprocessing without re-scraping.

Time-saving scenario:
> After cleaning 10,000 product prices, you discover a regex error converted “$1,299” to 1 (not 1299). With raw HTML snippets saved, you fix the cleaning script and reprocess in seconds. Without raw data, you must re-scrape all pages (risking blocks, delays, or changed content).

Best practice: Save raw HTML fragments alongside cleaned data (e.g., write_rds(raw_html, "raw_data.rds")).

Answer Model

HTML and DOM Fundamentals

CSS Selector Syntax

rvest Implementation

Workflow and Best Practices

`rvest` Implementation