Tutorial 7

Answer Model

LLM Fundamentals

Which of the following statements about tokens is incorrect?

A system prompt does not consume tokens since it’s only used for initialization

Explain the difference between a provider and a model in the context of LLM APIs. Provide one example of a provider that hosts multiple models and one example where provider and model names are often used interchangeably.

A provider is a company or platform that hosts and serves LLMs via APIs (e.g., OpenAI, Anthropic, Google Cloud). A model is a specific AI architecture/version with distinct capabilities (e.g., GPT-4o, Claude 3.5 Sonnet). Providers manage infrastructure, pricing, and access; models define the AI’s behavior and performance characteristics.

Provider hosting multiple models: OpenAI (hosts GPT-3.5-turbo, GPT-4, GPT-4o, and GPT-4o-mini)
Names used interchangeably: “Gemma” (Google’s model family) where “Gemma” refers both to the model series and Google as the provider in casual usage, though technically Google is the provider and Gemma 2 9B is the specific model.

The following code initializes a chat session with a system prompt:

chat <- chat_ollama(
  model = "gemma3",
  system_prompt = "You are an expert R programmer who writes clean, efficient, and well-commented code. Return only code, no explanations."
)

The system_prompt argument) establishes persistent behavioral constraints and role definition for the entire conversation session. It shapes the model’s persona, output style, and constraints before any user interaction occurs.
Unlike user prompts sent via chat$chat() which represent turn-by-turn inputs within the conversation, the system prompt is set once at initialization and remains constant throughout the session, providing foundational context that influences all subsequent responses.
Specifying “Return only code, no explanations” ensures machine-readable output that can be directly executed or piped into downstream R workflows without requiring fragile regex parsing to strip natural language commentary—critical for automation reliability.

1. Longer conversations consume more tokens in the context window (all prior turns must be resent with each new query). Since LLM pricing is token-based for both input and output, cumulative context size directly increases per-request costs.
2. Two cost/quality control strategies:
- Context window management: Periodically summarize prior conversation history into a condensed “memory” document and restart the session with this summary as the new context
- Turn pruning: Drop oldest conversation turns once context window approaches 70-80% capacity, retaining only recent/relevant exchanges
1. Starting a fresh conversation is preferable when: (1) The topic has fundamentally shifted (making prior context irrelevant noise), (2) Context window saturation causes critical information to be truncated, or (3) Error propagation has occurred (e.g., model has adopted incorrect assumptions from earlier turns).

Prompt Engineering

An example could be the following:

Refactor the following R correlation function to:
1. Replace nested for-loops with vectorized operations using cov() and sd()
2. Add complete documentation with parameter descriptions for x and y inputs
3. Include input validation checking that x and y are numeric vectors of equal length
4. Return NA with warning if inputs contain missing values
Return ONLY the improved function code with no explanatory text.

1. Using emotional language to motivate the model (“Please try your best!”)
- “Infinitely patient”: Encourages iterative refinement—you can safely provide multiple examples, correct errors explicitly (“Actually, extract the fiscal year not calendar year”), and request re-attempts without social friction. This supports techniques like few-shot learning and error correction loops.
- “Forgets everything each conversation”: Mandates complete context provision in every session—you must re-specify output formats, domain constraints, and examples in each new interaction rather than assuming retained knowledge from prior sessions. This necessitates self-contained prompts with all necessary instructions.

You need to extract company names and revenue figures from financial news articles. Design a system prompt that would optimize an LLM for this specific task. Include at least three specific instructions that would improve extraction reliability.

You are a financial data extraction specialist. Extract company names and revenue figures with strict precision:
1. Return ONLY valid JSON with keys "company_name" (normalized legal entity name) and "revenue" (numeric value with currency unit preserved as string, e.g., "4.2B USD")
2. Extract ONLY explicitly stated revenue figures—never calculate, estimate, or infer values from percentages or growth rates
3. When multiple revenue figures appear, prioritize the most recent fiscal period mentioned in the article
4. If no explicit revenue figure exists, set revenue to null—never hallucinate values

Structured Data Extraction

library(ellmer)
type_object(
  title = type_string(),
  publication_year = type_integer(),
  authors = type_array(type_string()),
  keywords = type_array(type_string(), required = FALSE),
  citation_count = type_integer(required = FALSE)
)

The following code attempts to extract people’s information but produces errors:

type_people <- type_array(
  type_object(
    name = type_string(),
    age = type_integer(),
    hobbies = type_string()  # Problem here
  )
)

The hobbies field incorrectly uses type_string() when a person typically has multiple hobbies. This forces the model to concatenate multiple hobbies into a single string (e.g., “hiking, reading”) rather than representing them as discrete elements.

type_people <- type_array(
  type_object(
    name = type_string(),
    age = type_integer(),
    hobbies = type_array(type_string())
  )
)

A list-column in a data frame where each element is a character vector (e.g., c("hiking", "reading")).

1. Each object represents a row in the resulting data frame
Complete the following code to extract product reviews containing rating (1-5 integer), reviewer name (string), and review text (string) from multiple prompts using parallel processing:

library(ellmer)

prompts <- c(
  "Maria gave the coffee maker 5 stars: 'Best purchase ever!'",
  "John rated it 2/5: 'Broke after one week'",
  "Anonymous user: 4 stars - good value but slow shipping"
)

type_review <- type_object(
  rating = type_integer(),
  reviewer_name = type_string(),
  review_text = type_string()
)

chat <- chat_ollama(model="gemma3")
result <- parallel_chat_structured(chat, prompts, type = type_review)

Why might structured output (using $chat_structured()) be preferable to requesting JSON format in a regular prompt (using $chat() with “return JSON” instruction) for production data pipelines? Discuss two specific reliability advantages.
- Guaranteed schema compliance: Structured output uses the LLM’s constrained decoding capabilities to enforce valid schema adherence at generation time, eliminating malformed JSON errors that require fragile post-processing regex fixes in regular prompts.
- Elimination of natural language leakage: Prevents the model from prepending explanations (“Here is the JSON:”) or appending commentary after the structured data—common failure modes in prompt-based JSON requests that break automated parsers.

Tool Calling and RAG

You’re creating a tool to fetch current stock prices. The function signature is:

get_stock_price <- function(symbol, exchange = "NASDAQ") { ... }

Write a complete tool() wrapper including appropriate descriptions and argument specifications using type_string() and type_enum() where relevant. Justify your choice of required parameters.

tool(
  name = "get_stock_price",
  description = "Retrieve real-time stock price for a specified ticker symbol",
  parameters = type_object(
    symbol = type_string(
      description = "Stock ticker symbol (e.g., AAPL, TSLA)",
      required = TRUE
    ),
    exchange = type_enum(
      values = c("NASDAQ", "NYSE", "AMEX"),
      description = "Stock exchange where symbol is listed",
      required = FALSE,
      default = "NASDAQ"
    )
  ),
  function = get_stock_price
)

Justification: symbol is required because price lookup is impossible without it. exchange uses type_enum since valid values are constrained to major exchanges, with NASDAQ as sensible default for US equities.

Describe the complete 4-step flow of a tool calling interaction between user, LLM, and external function. Why is it important that the LLM requests tool execution rather than executing tools directly?
1. User submits query requiring external data (e.g., “What’s Apple’s current stock price?”)
2. LLM analyzes query, recognizes need for tool, and returns structured tool_call request with function name and arguments
3. Application executes the requested function in a sandboxed environment and captures result
4. Application returns result as tool_response to LLM, which synthesizes final user-facing answer
Importance of LLM requesting (not executing) tools: Prevents arbitrary code execution vulnerabilities, maintains security boundaries between LLM and system resources, and enables human-in-the-loop approval for sensitive operations.
Explain why Retrieval-Augmented Generation (RAG) reduces hallucinations compared to standard LLM generation. In your answer, address:
1. Hallucinations fundamentally stem from LLMs generating text based solely on parametric knowledge (training data patterns) without access to ground-truth sources for specific queries—leading to plausible but incorrect fabrications when knowledge is outdated, incomplete, or ambiguous.
2. RAG changes the task from pure generation to synthesis: The LLM receives retrieved document snippets as context and must ground its response in these excerpts rather than relying on internal knowledge alone. This shifts the cognitive load from “recall correct fact” to “accurately summarize provided evidence.”
3. RAG doesn’t solve hallucinations when retrieved documents contain misinformation, or when the LLM misinterprets/misrepresents the retrieved content (e.g., inventing relationships between facts in the documents). It also fails when retrieval misses relevant documents entirely.
Correct sequence:
1. Create store with ragnar_store_create()
2. Convert documents to markdown using read_as_markdown()
3. Insert processed chunks with ragnar_store_insert()
4. Call ragnar_store_build_index() to finalize the search index
5. Retrieve relevant content using ragnar_retrieve()

Practical Applications

1. Input tokens: 10,000 × 100 = 1,000,000 tokens → $3.00
  Output tokens: 10,000 × 30 = 300,000 tokens → $4.50
  Total cost: $7.50
2. Switching to local LLM eliminates cloud API costs (both input/output tokens). Remaining costs: electricity for inference hardware, hardware depreciation, and engineering time for setup/maintenance.
3. Batch processing with $parallel_chat_structured() reduces per-request overhead (connection setup, latency), enables better hardware utilization through concurrent requests, and may qualify for volume-based pricing tiers with cloud providers—lowering effective cost per token.
You’re using an LLM to extract economic indicators from policy documents for a research paper.
1. Two specific risks:
- Hallucinated numerical values or misattributed statistics undermining empirical validity
- Systematic extraction bias (e.g., consistently missing negative sentiment indicators) distorting research conclusions
1. Validation workflow:
1. Extract full dataset with LLM using structured schema
2. Randomly sample 15% of extractions for manual verification by domain expert
3. Calculate precision/recall metrics; if <95% accuracy, refine prompt/schema and re-extract problematic subset
4. For critical values (e.g., GDP figures), implement mandatory human verification before inclusion
1. 80% accuracy provides value when analyzing macro-level trends across thousands of documents where minor errors average out (e.g., tracking sentiment directionality over time), or for preliminary exploration to identify candidate documents requiring deep manual analysis—accelerating research without compromising final conclusions.
Approach: RAG + structured extraction combination
- First use RAG to retrieve relevant passages discussing each indicator
- Then apply structured extraction on retrieved passages to minimize hallucination risk
Schema specification:

indicator_sentiment <- type_object(
    indicator = type_enum(c("inflation", "unemployment", "GDP growth")),
    sentiment = type_enum(c("positive", "negative", "neutral")),
    supporting_quote = type_string(description = "Verbatim text excerpt justifying sentiment assessment"),
    document_section = type_string(description = "Section heading or paragraph context")
)

Handling unstated sentiment: Default to “neutral” with supporting_quote containing purely factual description (e.g., “inflation rose to 3.2%”) without evaluative language. Include confidence score field if schema permits.

Practical constraint (accuracy): Risk of sentiment misclassification for nuanced language (e.g., “inflation remains elevated but shows signs of moderation”).

Mitigation strategy: Implement two-stage extraction—first identify indicator mentions with high recall¹, then apply stricter sentiment classification only on passages with explicit evaluative language (“concerning,” “welcome decline”), flagging ambiguous cases for human review.

Footnotes

Recall is defined as true positive / all positive classifications.↩︎