data |>
unnest_tokens(word, text) |> # Tokenizes text column into one word per row
anti_join(stop_words) |> # Removes stop words (e.g., "the", "and")
count(doc_id, word) |> # Counts word frequencies per document
cast_dtm(doc_id, word, n) # Converts to sparse document-term matrixTutorial 5
Answer Model
Conceptual Understanding
Document-term matrices are typically sparse because most words in a language vocabulary appear in only a small subset of documents. In any given document, only a tiny fraction of the total vocabulary is used—most cells in the matrix contain zero counts. This sparsity reflects the lexical diversity of language: documents focus on specific topics using specialized terminology, while the vast majority of possible words remain unused. For economic texts, this implies that policy documents discussing “monetary policy” rarely contain terms like “agricultural subsidy,” creating natural topic boundaries visible through sparsity patterns.
Stemming uses rule-based heuristics to chop word suffixes (e.g., “economies” → “economi”), often producing non-words. Lemmatization uses morphological analysis and dictionaries to return valid base forms (lemmas; e.g., “economies” → “economy”). For economic text analysis, prefer lemmatization when precise semantic interpretation matters (e.g., distinguishing “policy” from “politician” in regulatory analysis). Prefer stemming for rapid preprocessing of large corpora where computational efficiency outweighs morphological accuracy (e.g., initial topic screening across millions of SEC filings).
TF-IDF assigns near-zero weight to words appearing in every document because IDF = \(\log(N/DF)\) approaches \(\log(1) = 0\) when document frequency \(DF = N\). Such words lack discriminative power for distinguishing documents. Economic example: “the” appears universally in financial reports. Downweighting it prevents common function words from dominating similarity calculations, allowing meaningful terms like “quantitative easing” or “yield curve” to drive analysis of monetary policy documents.
The distributional hypothesis states that “words that occur in similar contexts tend to have similar meanings.” Word2Vec operationalizes this by predicting neighboring words (skip-gram) or context from words (CBOW), creating dense vectors where semantic relationships (e.g.,
king - man + woman ≈ queen) emerge geometrically. This differs fundamentally from document-term matrices, which assume word meaning is defined solely by document co-occurrence (bag-of-words), ignoring syntactic context and producing sparse, high-dimensional representations without capturing semantic analogies.In LDA:
- \(\beta_{k,w}\) represents the probability of word \(w\) appearing in topic \(k\). Economically, this defines a topic’s thematic signature (e.g., high \(\beta\) for “inflation,” “CPI,” “wage” in a “price stability” topic).
- \(\gamma_{d,k}\) (often denoted \(\theta_{d,k}\)) represents the proportion of document \(d\) attributable to topic \(k\). Economically, this quantifies topical composition (e.g., a Fed minutes document with \(\gamma=0.7\) for “monetary policy” and \(\gamma=0.3\) for “labor markets” reflects dual mandate focus).
Applied Calculations
Raw term frequency (TF) = count of term in document
- Doc1: “inflation” appears 2 times → TF = 2
- Doc2: “inflation” appears 1 time → TF = 1
- Doc1: “inflation” appears 2 times → TF = 2
IDF = \(\log_{10}(N / DF)\) where \(N = 3\) documents
- “inflation”: appears in Doc1 and Doc2 → \(DF = 2\)
IDF = \(\log_{10}(3/2) = \log_{10}(1.5) \approx 0.176\)
- “unemployment”: appears only in Doc3 → \(DF = 1\)
IDF = \(\log_{10}(3/1) = \log_{10}(3) \approx 0.477\)
- “inflation”: appears in Doc1 and Doc2 → \(DF = 2\)
For “rising” in Doc1:
- TF (raw count) = 2 (appears twice in “inflation inflation rising prices”)
- DF for “rising”: appears in Doc1 and Doc3 → \(DF = 2\)
IDF = \(\log_{10}(3/2) \approx 0.176\)
- TF-IDF = TF × IDF = \(2 \times 0.176 = 0.352\)
- TF (raw count) = 2 (appears twice in “inflation inflation rising prices”)
Given:
- Document word count = 200
- “recession” count in document = 8 → TF = 8
- \(DF = 5\), \(N = 10,000\)
- IDF = \(\log_{10}(10000/5) = \log_{10}(2000) \approx 3.301\)
- TF-IDF = \(8 \times 3.301 = 26.408\)
- Document word count = 200
Porter stemming results:
- Group A: Not merged. “economy” → “economi”, “economies” → “economi”, but “economic” → “econom” (different stem due to adjective suffix handling).
- Group B: Not merged. “policy”/“policies” → “polici”, but “politician” → “politician” (stemmer doesn’t reduce to “polici” due to morphological complexity).
- Group C: Merged. “growth” → “growth”, “growing” → “grow”, “grow” → “grow” → all reduce to “grow” after stemming rules apply (
-th,-ingremoval).
- Group A: Not merged. “economy” → “economi”, “economies” → “economi”, but “economic” → “econom” (different stem due to adjective suffix handling).
R Code Interpretation
This pipeline constructs a document-term matrix:
The critical limitation is that lexicons like “bing” lack domain-specific economic sentiment cues. Words like “tightening” (negative in consumer contexts but neutral/positive in monetary policy) or “volatile” (negative generally but expected in markets) receive inappropriate sentiment labels, misrepresenting tone in financial texts.
This step solves case sensitivity in tokenization. Without lowercasing, “The” and “the” would be treated as distinct tokens, artificially inflating vocabulary size and fragmenting term frequencies. Lowercasing ensures consistent representation of words regardless of capitalization (critical for proper nouns in economics like “Fed” vs “fed”).
The
nvariable represents the raw term frequency count (i.e., number of occurrences) of eachwordwithin eachdocumentin the input data frame prior to TF-IDF transformation.The parameter
kcontrols the number of latent topics to extract.- Too low: Oversimplifies corpus structure (e.g., \(k=2\) might merge distinct “fiscal policy” and “trade policy” topics).
- Too high: Overfits noise, creating spurious topics (e.g., \(k=50\) for 100 documents may split coherent themes into fragmented subtopics). Optimal \(k\) balances interpretability and coverage (often determined via perplexity or semantic coherence metrics).
- Too low: Oversimplifies corpus structure (e.g., \(k=2\) might merge distinct “fiscal policy” and “trade policy” topics).
Critical Thinking & Application
Bag-of-words fails to capture negation scope and modifier intensity critical in policy texts. Example: “The committee does not anticipate rate hikes” vs. “The committee anticipates rate hikes” have identical DTM representations but opposite policy implications. Word embeddings (e.g., BERT) address this by modeling context: the vector for “anticipate” shifts meaning based on surrounding tokens (“not” vs. no negation), preserving semantic nuance lost in DTM.
Removing numeric tokens discards quantitative guidance essential for market reactions. Example: Earnings reports stating “revenue growth of 3.8%” versus “2.1%” convey materially different performance signals. Deleting these values reduces analysis to vague qualitative sentiment (“growth” = positive), ignoring magnitude that drives investor decisions and stock price movements.
Stop word removal harms analysis of modal expressions central to policy communication. Example: Studying Fed forward guidance requires preserving “will” (strong commitment) vs. “may” (contingency) vs. “could” (possibility). Removing these stop words erases gradations of certainty that signal policy stance shifts. Keeping them enables analysis of commitment language in monetary policy statements.
“Leverage” has specialized meaning in finance (debt-to-equity ratio) versus general usage (mechanical advantage or influence). Embeddings trained on news corpora may associate “leverage” primarily with physical mechanics or negotiation tactics, weakening its vector proximity to financial terms like “debt” or “margin.” This distorts similarity searches (e.g., failing to link “leverage” with “capital structure” in bank regulatory texts).
Raw sentiment scores are misleading due to temporal shifts in linguistic baselines. During recessions, even neutral statements (“output stabilized”) may carry positive sentiment relative to prior pessimism, inflating scores artificially. Normalization method: Calculate sentiment z-scores relative to a rolling 12-month window:
\[ z_t = \frac{\text{sentiment}_t - \mu_{t-12:t}}{\sigma_{t-12:t}} \]
This adjusts for era-specific language norms, enabling valid pre-recession vs. recession comparisons.