Tutorial 3

Answer Model

HTML and DOM Fundamentals

  1. The parent element of <span class="price"> is <div class="product">.
    (The <span> is a direct child of this <div>)

  2. id vs class:

    • id: Must be unique per page (one element only). Use for scraping single, critical elements (e.g., main content container).
    • class: Can be reused across multiple elements. Use for scraping repeating structures (e.g., product cards, list items).
      Scraping tip: Prefer class for scalable patterns; use id only for guaranteed-unique anchors. Avoid id if values are dynamically generated (e.g., id="product-123").
  3. The two <p> elements are not siblings.

    • First <p>: Direct child of #container
    • Second <p>: Child of <section>, which is a child of #container
    • They share a common ancestor (#container) but reside in different branches of the DOM tree (no direct parent/child/sibling relationship).
  4. Why rvest fails with JS content:
    rvest parses static HTML only and does not execute JavaScript. Dynamic content (loaded via JS after initial page load) is absent in the raw HTML source.

    Alternatives:

    • RSelenium (browser automation)
    • splashr (headless browser via Docker)
    • playwright (modern browser control)
      Always verify if an API endpoint provides the data first (more efficient than browser automation).

CSS Selector Syntax

  1. a[href^="https://"]
  2. div.product.featured > span.price.sale

    (Uses child combinator > and combines both classes on <span> to avoid matching other .price spans)

  3. Difference:

    • div p: Selects all descendant <p> elements inside <div> (any depth)
    • div > p: Selects only direct children <p> elements of <div>

    Example where results differ:

    <div>
      <p>Matched by both</p>
      <article>
        <p>Matched ONLY by "div p"</p>
      </article>
    </div>
  4. ul > li:nth-of-type(3)

    (:nth-of-type(3) targets the 3rd <li> specifically; > ensures direct child of <ul>)

  5. [data-category*="electronics"]

    (*= matches attribute values containing the substring)

  6. span.stock.unavailable

    (Combines both classes to uniquely identify the “Out of Stock” span; avoids matching .available)

rvest Implementation

  1. Purpose of trimws(): Removes leading/trailing whitespace (including newlines, tabs) from extracted text.

    Example:

    # Raw HTML: <span>  $24.99  </span>
    html_text(html_element(page, "span"))        # Returns "  $24.99  "
    trimws(html_text(html_element(page, "span"))) # Returns "$24.99"
  2. html_element() vs html_elements():

    Function Use Case No Match Behavior
    html_element() Extract one element (e.g., page title) Returns NA if optional = TRUE; error if optional = FALSE (default)
    html_elements() Extract multiple elements (e.g., all product cards) Returns empty list (no error)

    Always use html_elements() for lists; use html_element(optional = TRUE) for optional single elements.

  3. Extract image URLs:

    page |>
      html_elements("img") |>
      html_attr("src")  # Extracts "src" attribute values
  4. Data structure: tables is a list of data frames (one per matched <table>). Access second table:

    tables[[2]]  # Double brackets for list element extraction
  5. Why optional = TRUE matters:

    Prevents script failure if an element is missing (common in production due to layout changes, A/B tests, or edge cases). Without it:

    html_element(page, ".promo-banner")  # Error if banner absent → halts entire script
    html_element(page, ".promo-banner", optional = TRUE)  # Returns NA → script continues

    Critical for robust, maintainable scrapers.

Workflow and Best Practices

  1. Generate URLs:

    urls <- paste0("https://store.com/item?id=", 1:50)
  2. Purpose of Sys.sleep(2):

    • Ethical: Respects server resources; complies with robots.txt crawl-delay directives
    • Practical: Prevents IP bans, rate-limiting blocks, or triggering anti-bot systems

    Always prioritize server load over scraping speed.

  3. Pre-scraping checks:

    1. robots.txt: Check https://target-site.com/robots.txt for disallowed paths
    2. Terms of Service: Verify scraping isn’t prohibited in legal/terms documentation
    3. Data sensitivity: Confirm data isn’t personal, copyrighted, or behind auth walls without permission
      Bonus: Check for a public API first (more reliable and ethical).
  4. Convert prices to numeric:

    library(readr)
    prices_raw <- c("$19.99", "$24.50", "$9.75")
    prices_clean <- parse_number(prices_raw)  # Handles currency symbols automatically
    # Output: [1] 19.99 24.50  9.75

    Alternative: as.numeric(sub("\\$", "", prices_raw))

  5. Why save raw extracted data: enables reproducible debugging and reprocessing without re-scraping.

    Time-saving scenario:
    > After cleaning 10,000 product prices, you discover a regex error converted “$1,299” to 1 (not 1299). With raw HTML snippets saved, you fix the cleaning script and reprocess in seconds. Without raw data, you must re-scrape all pages (risking blocks, delays, or changed content).

    Best practice: Save raw HTML fragments alongside cleaned data (e.g., write_rds(raw_html, "raw_data.rds")).