Tutorial 6

Answer Model

Conceptual Understanding

1. Converting qualitative survey responses into Likert scales
Explanation: Tokenization discretizes continuous text into atomic units (tokens), analogous to converting unstructured qualitative survey responses into discrete ordinal categories (Likert scales). Both processes transform continuous/qualitative inputs into structured, machine-processable representations while losing some granular information.
Word embeddings capture semantic relationships and contextual nuance through dense vector representations, whereas word-counting (e.g., bag-of-words) treats words as independent discrete units.

Embeddings encode economic concepts like monetary policy stance: “hawkish” and “tightening” will have similar vectors despite rarely co-occurring, while “dovish” and “accommodative” form another cluster. Word-counting would miss this relationship, failing to distinguish nuanced Fed communications where “transitory” (2021) vs. “persistent” (2022) inflation descriptions signal policy shifts despite identical word frequencies.
The analogy is partially justified but incomplete. Attention dynamically weights input tokens based on context-specific relevance (e.g., in “The Fed raised rates because inflation exceeded target,” “inflation” receives high attention for predicting “exceeded”), similar to how economists prioritize variables with large coefficients when interpreting regression results.

However, attention weights are input-dependent (change per sequence) and capture relational dependencies (e.g., pronoun-antecedent links), whereas regression coefficients are fixed marginal effects. A better analogy: attention resembles an economist selecting which historical episodes are most relevant for forecasting current policy, not just coefficient magnitude.
1. The model optimizes for plausible-sounding text rather than factual accuracy.
Explanation: LLMs are trained to maximize likelihood of observed text sequences, not truthfulness. When generating references, they produce statistically plausible patterns (e.g., “NBER Working Paper No. XXXXX”) based on training data distributions, without access to real-time databases or verification mechanisms. Parameter count (a) and terminology coverage (c) are generally sufficient for economics; low temperature (d) reduces hallucinations by favoring high-probability tokens.
Positional encoding injects sequence order information into token representations, solving the problem of temporal ambiguity in narratives. Without it, transformers (which process tokens in parallel) would treat “First inflation rose, then the Fed acted” identically to “First the Fed acted, then inflation rose,” losing causal directionality critical for economic analysis. Positional encodings preserve the chronological sequence necessary to infer policy reactions versus preemptive actions in time-series narratives.

Vector calculation:

inflation - prices + wages 
= [0.85 - 0.75 + 0.40, 0.30 - 0.20 + 0.65, -0.20 - (-0.30) + 0.10] 
= [0.50, 0.75, 0.20]

Cosine similarity calculations (using R for precision):

result <- c(0.50, 0.75, 0.20)
inflation <- c(0.85, 0.30, -0.20)
deflation <- c(-0.78, 0.25, -0.15)
wages <- c(0.40, 0.65, 0.10)
prices <- c(0.75, 0.20, -0.30)

cos_sim <- function(a, b) sum(a*b) / (sqrt(sum(a^2)) * sqrt(sum(b^2)))

sim_inflation <- cos_sim(result, inflation)
sim_wages <- cos_sim(result, wages)
sim_prices <- cos_sim(result, prices)
sim_deflation <- cos_sim(result, deflation)

c(inflation=sim_inflation, wages=sim_wages, prices=sim_prices, deflation=sim_deflation)

 inflation      wages     prices  deflation 
 0.7155425  0.9954858  0.6051958 -0.3024014

Closest term: "wages" (similarity ≈ 0.99)

This arithmetic approximates real wage growth (wage growth adjusted for inflation). The relationship emerges naturally because economics texts frequently discuss wages relative to price changes (e.g., “wages failed to keep pace with inflation”), creating co-occurrence patterns where “wages” appears in contexts contrasting with “prices” and “inflation,” allowing the embedding geometry to capture this economic identity without explicit theoretical training.

1. Diagnosis: Learning rate (α) is too high, causing overshooting of the loss minimum. When |α∇L| exceeds the distance to the optimum, updates oscillate across the minimum.
  Adjustment: Reduce α (e.g., by 50–75%). Mathematically, for convex regions near optimum, stable convergence requires α < 2/|H| where H is the Hessian eigenvalue. With oscillating β values (e.g., -0.8 → -1.9), the step size α|∇L| exceeds the curvature radius. Halving α would dampen oscillations:
```
w_{t+1} = w_t - (α/2)∇L(w_t)  # Smaller steps prevent crossing the minimum
```
1. Stuck optimization likely reflects vanishingly small learning rate effects rather than a true local minimum. Non-zero gradients (implied by parameter changes) should continue reducing loss if α is appropriate. Here, α may be too small relative to gradient magnitude (e.g., α|∇L| ≈ 0.001 when loss curvature requires steps >0.1), causing negligible progress. A local minimum would show ∇L ≈ 0, but the persistent tiny parameter changes suggest gradients exist but updates are ineffective due to α.
2. Fundamental difference: Learning rate issues (a,b) are optimization failures (algorithm can’t find the minimum), while identification problems are statistical/modeling failures (multiple parameter sets yield identical likelihoods, making the minimum non-unique). No optimization tuning resolves identification.
Estimation strategy: Impose economic restrictions (e.g., β < 0 for downward-sloping demand) or use instrumental variables to break parameter equivalence. Bayesian estimation with informative priors (e.g., β ~ N(-1, 0.3)) also resolves identification by incorporating external theory.
1. Better handling of context-dependent language (e.g., “challenging environment” vs. “challenging our competitors”)
Explanation: LLMs use contextual embeddings to distinguish sentiment based on surrounding words (e.g., “challenging” is negative in “challenging environment” but positive in “challenging competitors”), whereas dictionary methods assign fixed sentiment scores per word, failing on economic jargon and sarcasm common in earnings calls.

Quantitative Reasoning

Cross-entropy loss = \(-\log(p_{\text{true}}) = -\log(0.5) = \log(2) \approx 0.6931\)
```
-log(0.5)
```
```
[1] 0.6931472
```
\(w_{\text{new}} = 2.0 - 0.1 \times (-4.0) = 2.0 + 0.4 = 2.4\)
```
2.0 - 0.1 * (-4.0)
```
```
[1] 2.4
```
Interpretation: The negative gradient (\(\partial L/\partial w = -4.0\)) indicates loss decreases as \(w\) increases. The update moves \(w\) toward higher values (2.4), analogous to an economist increasing a policy instrument (e.g., tax rate) when marginal welfare gain is positive (\(\partial \text{Welfare}/\partial \text{tax} > 0\)) to reach the welfare optimum.
Vector = \([0.9 - 0.7 + 0.3,\ 0.2 - 0.1 + 0.6] = [0.5,\ 0.7]\)
```
c(0.9 - 0.7 + 0.3, 0.2 - 0.1 + 0.6)
```
```
[1] 0.5 0.7
```
Economic concept: Real wage growth (nominal wage growth adjusted for price changes). The vector approximates purchasing power dynamics where wages rise relative to inflation/price pressures.
Softmax weights: \(\text{weight}_i = e^{z_i} / \sum_j e^{z_j}\) where \(z\) = attention scores. Highest raw score is inequality=3.0, so it receives the highest weight after softmax.
```
scores <- c(stimulus=2.1, package=0.8, passed=0.5, because=0.3, it=0.1, addressed=1.2, inequality=3.0)
exp_scores <- exp(scores)
weights <- exp_scores / sum(exp_scores)
weights[which.max(weights)]
```
```
inequality 
 0.5299458 
```
Economic meaning: Highest attention to “inequality” correctly resolves “it” as referring to the purpose of the stimulus package (addressing inequality), not the package itself. This captures policy motivation—critical for analyzing whether fiscal interventions target distributional concerns versus aggregate demand.
Adjusted probabilities at \(T=0.5\):

Unnormalized: \(p_i^{1/T} = p_i^2\) → progressive: \(0.6^2=0.36\), flat: \(0.3^2=0.09\), regressive: \(0.1^2=0.01\)

Normalized: progressive = \(0.36/0.46 \approx 0.78\), flat = \(0.09/0.46 \approx 0.20\), regressive = \(0.01/0.46 \approx 0.02\)
```
p <- c(0.6, 0.3, 0.1)
T <- 0.5
p_adj <- p^(1/T)
p_adj / sum(p_adj)
```
```
[1] 0.78260870 0.19565217 0.02173913
```
Effect on diversity: Lower temperature (0.5 < 1) reduces diversity by amplifying high-probability tokens (“progressive” dominates). Policy discussions become less exploratory, potentially overlooking minority viewpoints like flat taxes in progressive-leaning contexts.
Given \(L(2N)/L(N) = 0.85 = 2^b\) → \(b = \log_2(0.85) \approx -0.234\)
```
log2(0.85)
```
```
[1] -0.2344653
```
Interpretation: Exponent \(b \approx -0.23\) implies diminishing returns to scale—doubling parameters yields only 15% loss reduction, consistent with empirical scaling laws in LLMs.

Application & Critical Thinking (Questions 15-20)

Improvements:
1. Extract qualitative risk descriptions (e.g., “physical risk from sea-level rise”) beyond keyword counts, capturing nuanced disclosures like scenario analysis depth.
2. Detect forward-looking statements (e.g., “we anticipate regulatory changes by 2030”) using temporal reasoning, which bag-of-words misses.
Risk to mitigate: Hallucinated disclosures (e.g., inventing non-existent TCFD alignment). Mitigation: Constrain outputs to verbatim text spans with citation anchors to original transcripts.
Improved prompt: “Analyze U.S. core PCE inflation trends from Q4 2023 to Q1 2024 using a dual mandate framework (maximum employment vs. price stability). Contrast demand-pull factors (e.g., wage growth) and supply-side constraints (e.g., shelter costs). Output a 4-sentence summary: sentence 1 = trend magnitude, sentence 2 = dominant driver, sentence 3 = Fed policy implication, sentence 4 = key uncertainty. Cite one FRED data series.”
- Bias source: Gendered language patterns in training data (e.g., historical news articles describing male/female leaders with stereotyped adjectives).
- Mitigation: Adversarial debiasing during fine-tuning—train a classifier to detect gender from text embeddings, then update LLM weights to minimize this classifier’s accuracy while preserving economic content fidelity.
Least appropriate: b) Calculating exact present value of a 30-year bond with variable coupons . Justification: LLMs lack reliable arithmetic precision for multi-step financial calculations (e.g., compounding 360 coupon payments). Errors compound rapidly, risking material valuation mistakes. Tasks (a), (c), and (d) involve language understanding where LLMs excel with oversight; (b) requires deterministic computation better handled by specialized software.
Validation protocol:
1. Source anchoring: Require LLM to output the exact sentence/paragraph containing the claim (e.g., “unemployment fell to 3.8%” → quote report section 3.2).
2. Database cross-check: Automatically query official sources (e.g., BLS API for U.S. unemployment) using extracted figures; flag mismatches >0.1 percentage point.
  (Optional third step: Human spot-check 5% of extractions for contextual accuracy, e.g., “3.8%” referring to state vs. national rate.)
Why misconduct: Presenting AI-generated text as original scholarly work constitutes plagiarism—it misrepresents the LLM as a research tool rather than an uncredited co-author, violating academic integrity norms (e.g., APA, university policies) that require attribution of all substantive intellectual contributions.
Appropriate workflow:
1. Use LLM to identify relevant papers via prompt: “List 5 seminal papers on gig economy wage volatility (2020–2024) with DOIs.”
2. Read sources personally; use LLM to brainstorm structure: “Suggest 3 thematic sections for a literature review on platform labor markets.”
3. Write original analysis in own words; cite all human-authored sources. Disclose LLM use only for non-substantive tasks (e.g., “Grammar checked with Grammarly”) per journal guidelines.