# Raw HTML: <span> $24.99 </span>
html_text(html_element(page, "span")) # Returns " $24.99 "
trimws(html_text(html_element(page, "span"))) # Returns "$24.99"Tutorial 3
Answer Model
HTML and DOM Fundamentals
The parent element of
<span class="price">is<div class="product">.
(The<span>is a direct child of this<div>)idvsclass:id: Must be unique per page (one element only). Use for scraping single, critical elements (e.g., main content container).
class: Can be reused across multiple elements. Use for scraping repeating structures (e.g., product cards, list items).
Scraping tip: Preferclassfor scalable patterns; useidonly for guaranteed-unique anchors. Avoididif values are dynamically generated (e.g.,id="product-123").
The two
<p>elements are not siblings.- First
<p>: Direct child of#container
- Second
<p>: Child of<section>, which is a child of#container
- They share a common ancestor (
#container) but reside in different branches of the DOM tree (no direct parent/child/sibling relationship).
- First
Why
rvestfails with JS content:
rvestparses static HTML only and does not execute JavaScript. Dynamic content (loaded via JS after initial page load) is absent in the raw HTML source.Alternatives:
RSelenium(browser automation)
splashr(headless browser via Docker)
playwright(modern browser control)
Always verify if an API endpoint provides the data first (more efficient than browser automation).
CSS Selector Syntax
a[href^="https://"]div.product.featured > span.price.sale(Uses child combinator
>and combines both classes on<span>to avoid matching other.pricespans)Difference:
div p: Selects all descendant<p>elements inside<div>(any depth)
div > p: Selects only direct children<p>elements of<div>
Example where results differ:
<div> <p>Matched by both</p> <article> <p>Matched ONLY by "div p"</p> </article> </div>ul > li:nth-of-type(3)(
:nth-of-type(3)targets the 3rd<li>specifically;>ensures direct child of<ul>)[data-category*="electronics"](
*=matches attribute values containing the substring)span.stock.unavailable(Combines both classes to uniquely identify the “Out of Stock” span; avoids matching
.available)
rvest Implementation
Purpose of
trimws(): Removes leading/trailing whitespace (including newlines, tabs) from extracted text.Example:
html_element()vshtml_elements():Function Use Case No Match Behavior html_element()Extract one element (e.g., page title) Returns NAifoptional = TRUE; error ifoptional = FALSE(default)html_elements()Extract multiple elements (e.g., all product cards) Returns empty list (no error) Always use
html_elements()for lists; usehtml_element(optional = TRUE)for optional single elements.Extract image URLs:
page |> html_elements("img") |> html_attr("src") # Extracts "src" attribute valuesData structure:
tablesis a list of data frames (one per matched<table>). Access second table:tables[[2]] # Double brackets for list element extractionWhy
optional = TRUEmatters:Prevents script failure if an element is missing (common in production due to layout changes, A/B tests, or edge cases). Without it:
html_element(page, ".promo-banner") # Error if banner absent → halts entire script html_element(page, ".promo-banner", optional = TRUE) # Returns NA → script continuesCritical for robust, maintainable scrapers.
Workflow and Best Practices
Generate URLs:
urls <- paste0("https://store.com/item?id=", 1:50)Purpose of
Sys.sleep(2):- Ethical: Respects server resources; complies with
robots.txtcrawl-delay directives
- Practical: Prevents IP bans, rate-limiting blocks, or triggering anti-bot systems
Always prioritize server load over scraping speed.
- Ethical: Respects server resources; complies with
Pre-scraping checks:
robots.txt: Checkhttps://target-site.com/robots.txtfor disallowed paths
- Terms of Service: Verify scraping isn’t prohibited in legal/terms documentation
- Data sensitivity: Confirm data isn’t personal, copyrighted, or behind auth walls without permission
Bonus: Check for a public API first (more reliable and ethical).
Convert prices to numeric:
library(readr) prices_raw <- c("$19.99", "$24.50", "$9.75") prices_clean <- parse_number(prices_raw) # Handles currency symbols automatically # Output: [1] 19.99 24.50 9.75Alternative:
as.numeric(sub("\\$", "", prices_raw))Why save raw extracted data: enables reproducible debugging and reprocessing without re-scraping.
Time-saving scenario:
> After cleaning 10,000 product prices, you discover a regex error converted “$1,299” to1(not1299). With raw HTML snippets saved, you fix the cleaning script and reprocess in seconds. Without raw data, you must re-scrape all pages (risking blocks, delays, or changed content).Best practice: Save raw HTML fragments alongside cleaned data (e.g.,
write_rds(raw_html, "raw_data.rds")).