Rekapan Jurnal

Research Question(s) / Problem Statement: The study seeks to address how the integration of Named Entity Recognition (NER) and Latent Dirichlet Allocation (LDA) can improve the quality and interpretability of topic modeling results, particularly in large-scale social media datasets. Traditional LDA often struggles with identifying specific entities or distinguishing between general and domain-specific terms, leading to overlapping or ambiguous topics. This research investigates whether incorporating NER — which extracts meaningful entities such as people, organizations, and locations — can enhance topic coherence and topic relevance. The central question is: Can NER preprocessing significantly improve the performance and interpretability of LDA-based topic modeling in noisy text data such as Twitter posts?
Motivation / Relevance: With the exponential growth of unstructured text data from social media, understanding public discourse has become increasingly important for businesses, governments, and researchers. However, textual data from platforms like Twitter or Reddit are noisy, informal, and context-dependent. This makes extracting coherent topics challenging. The study is motivated by the need to develop more accurate text mining pipelines that combine linguistic and probabilistic models to identify hidden patterns in text. By integrating NER with LDA, analysts can isolate key entities before clustering related discussions, thus improving topic granularity. This approach is highly relevant for brand monitoring, trend analysis, policy evaluation, and crisis communication, where identifying who and what people talk about is crucial.
Theoretical Framework: The research is grounded in Natural Language Processing (NLP) and probabilistic topic modeling theories. NER is based on sequence labeling techniques such as Conditional Random Fields (CRF) and transformer-based architectures like BERT, which classify words into entity categories (e.g., PERSON, ORGANIZATION, LOCATION). LDA, introduced by Blei et al. (2003), assumes each document is a mixture of topics, and each topic is a distribution over words. The integration of NER and LDA theoretically assumes that filtering and grouping named entities before topic modeling can reduce word noise and enhance topic purity. This aligns with the distributional semantics theory, where entity-level preprocessing ensures that semantically significant tokens are preserved, leading to more interpretable topic clusters.
Method: The study employs a quantitative and experimental approach using a dataset of 50,000 cleaned tweets related to global economic discussions collected between January 2023 and December 2024. The methodology consists of the following steps:
Preprocessing: Text normalization, tokenization, stopword removal, and stemming.
Named Entity Recognition: Entity extraction using a pre-trained spaCy model (en_core_web_trf) to identify and retain entities such as companies, countries, and key individuals.
Topic Modeling: Application of standard LDA and NER-enhanced LDA using Gensim to compare results.
Evaluation Metrics: Coherence Score (C_v), Perplexity, and qualitative analysis of topic interpretability
Visualization: Topic distributions were visualized using pyLDAvis to compare clustering differences between the two approaches. The experiment aims to measure improvements in topic coherence and human interpretability after integrating NER preprocessing.
Results / Arguments: The findings indicate that integrating NER before LDA significantly improves topic coherence by an average of 18% compared to the baseline LDA model. The NER-LDA approach produced clearer topic boundaries and minimized the mixing of unrelated terms. For instance, in discussions about "gold" and "cryptocurrency," the enhanced model successfully separated topics related to financial institutions and market influencers, which were previously merged. Additionally, the analysis revealed that entity-aware topics provided richer contextual insights, as entities like “Federal Reserve,” “Elon Musk,” and “Bank of Japan” became central nodes in the topic structure. This demonstrates that NER preprocessing can help capture contextual entities that drive discourse, making the model more practical for real-world analytical applications.
Conclusion: The study concludes that integrating NER with LDA substantially enhances topic modeling performance, especially in noisy, real-world datasets such as social media text. The hybrid approach improves topic coherence, interpretability, and domain relevance by reducing lexical ambiguity and emphasizing key entities. This finding implies that text mining systems and sentiment analysis pipelines can benefit from entity-aware preprocessing for more accurate trend detection and decision-making. Future research could explore combining this approach with deep learning-based topic models (e.g., BERTopic) or sentiment analysis integration to further enrich contextual understanding in dynamic text streams.

My Own Opinion

In my opinion, this study presents a highly relevant and innovative approach to improving topic modeling in unstructured data environments. The integration of NER adds semantic depth and helps overcome one of the main limitations of traditional LDA — the lack of context awareness. I appreciate the clear methodological structure and strong theoretical justification. However, the study could be further strengthened by testing across multiple domains (e.g., news, product reviews, or policy debates) to verify generalizability. Moreover, adding a visual comparison of topic networks before and after NER integration would make the findings more tangible for data analysts and NLP practitioners.

FIRST GROUP BLOG

Cari Blog Ini

Rekapan Jurnal

Komentar

Posting Komentar