- Research Question(s) / Problem Statement: The study seeks to
address how the integration of Named Entity Recognition (NER) and Latent
Dirichlet Allocation (LDA) can improve the quality and interpretability of
topic modeling results, particularly in large-scale social media datasets.
Traditional LDA often struggles with identifying specific entities or
distinguishing between general and domain-specific terms, leading to
overlapping or ambiguous topics. This research investigates whether
incorporating NER — which extracts meaningful entities such as people,
organizations, and locations — can enhance topic coherence and topic
relevance. The central question is: Can NER preprocessing significantly
improve the performance and interpretability of LDA-based topic modeling
in noisy text data such as Twitter posts?
- Motivation / Relevance: With the exponential growth of
unstructured text data from social media, understanding public discourse
has become increasingly important for businesses, governments, and
researchers. However, textual data from platforms like Twitter or Reddit
are noisy, informal, and context-dependent. This makes extracting coherent
topics challenging. The study is motivated by the need to develop more
accurate text mining pipelines that combine linguistic and probabilistic
models to identify hidden patterns in text. By integrating NER with LDA,
analysts can isolate key entities before clustering related discussions,
thus improving topic granularity. This approach is highly relevant for
brand monitoring, trend analysis, policy evaluation, and crisis
communication, where identifying who and what people talk about is
crucial.
- Theoretical Framework: The research is grounded in Natural
Language Processing (NLP) and probabilistic topic modeling theories. NER
is based on sequence labeling techniques such as Conditional Random Fields
(CRF) and transformer-based architectures like BERT, which classify words
into entity categories (e.g., PERSON, ORGANIZATION, LOCATION). LDA,
introduced by Blei et al. (2003), assumes each document is a mixture of
topics, and each topic is a distribution over words. The integration of
NER and LDA theoretically assumes that filtering and grouping named
entities before topic modeling can reduce word noise and enhance topic
purity. This aligns with the distributional semantics theory, where
entity-level preprocessing ensures that semantically significant tokens
are preserved, leading to more interpretable topic clusters.
- Method: The study employs a quantitative and experimental approach
using a dataset of 50,000 cleaned tweets related to global economic
discussions collected between January 2023 and December 2024. The
methodology consists of the following steps:
- Preprocessing: Text normalization, tokenization, stopword removal,
and stemming.
- Named Entity Recognition: Entity extraction using a pre-trained
spaCy model (en_core_web_trf) to identify and retain entities such as
companies, countries, and key individuals.
- Topic Modeling: Application of standard LDA and NER-enhanced LDA
using Gensim to compare results.
- Evaluation Metrics: Coherence Score (C_v), Perplexity, and
qualitative analysis of topic interpretability
- Visualization: Topic distributions were visualized using pyLDAvis
to compare clustering differences between the two approaches. The
experiment aims to measure improvements in topic coherence and human
interpretability after integrating NER preprocessing.
- Results / Arguments: The findings indicate that integrating NER
before LDA significantly improves topic coherence by an average of 18%
compared to the baseline LDA model. The NER-LDA approach produced clearer
topic boundaries and minimized the mixing of unrelated terms. For
instance, in discussions about "gold" and
"cryptocurrency," the enhanced model successfully separated
topics related to financial institutions and market influencers, which
were previously merged. Additionally, the analysis revealed that
entity-aware topics provided richer contextual insights, as entities like
“Federal Reserve,” “Elon Musk,” and “Bank of Japan” became central nodes
in the topic structure. This demonstrates that NER preprocessing can help
capture contextual entities that drive discourse, making the model more
practical for real-world analytical applications.
- Conclusion: The study concludes that integrating NER with LDA
substantially enhances topic modeling performance, especially in noisy,
real-world datasets such as social media text. The hybrid approach
improves topic coherence, interpretability, and domain relevance by
reducing lexical ambiguity and emphasizing key entities. This finding
implies that text mining systems and sentiment analysis pipelines can
benefit from entity-aware preprocessing for more accurate trend detection
and decision-making. Future research could explore combining this approach
with deep learning-based topic models (e.g., BERTopic) or sentiment
analysis integration to further enrich contextual understanding in dynamic
text streams.
My Own Opinion
In my opinion, this
study presents a highly relevant and innovative approach to improving topic
modeling in unstructured data environments. The integration of NER adds
semantic depth and helps overcome one of the main limitations of traditional
LDA — the lack of context awareness. I appreciate the clear methodological
structure and strong theoretical justification. However, the study could be
further strengthened by testing across multiple domains (e.g., news, product
reviews, or policy debates) to verify generalizability. Moreover, adding a
visual comparison of topic networks before and after NER integration would make
the findings more tangible for data analysts and NLP practitioners.
Komentar
Posting Komentar