토픽모델링¶
- 목적: 문서 내에서 논의된 주요 주제를 자동으로 식별합니다.
- 방법: LDA(Latent Dirichlet Allocation)와 같은 주제 모델링 기법을 사용하여 텍스트에서 발견되는 주요 주제를 분석합니다. 이를 통해 회사와 관련된 다양한 논의의 축을 파악할 수 있습니다.
In [1]:
!pip install pyLDAvis
Requirement already satisfied: pyLDAvis in /usr/local/lib/python3.11/dist-packages (3.4.1) Requirement already satisfied: numpy>=1.24.2 in /usr/local/lib/python3.11/dist-packages (from pyLDAvis) (1.26.4) Requirement already satisfied: scipy in /usr/local/lib/python3.11/dist-packages (from pyLDAvis) (1.13.1) Requirement already satisfied: pandas>=2.0.0 in /usr/local/lib/python3.11/dist-packages (from pyLDAvis) (2.2.2) Requirement already satisfied: joblib>=1.2.0 in /usr/local/lib/python3.11/dist-packages (from pyLDAvis) (1.5.1) Requirement already satisfied: jinja2 in /usr/local/lib/python3.11/dist-packages (from pyLDAvis) (3.1.6) Requirement already satisfied: numexpr in /usr/local/lib/python3.11/dist-packages (from pyLDAvis) (2.11.0) Requirement already satisfied: funcy in /usr/local/lib/python3.11/dist-packages (from pyLDAvis) (2.0) Requirement already satisfied: scikit-learn>=1.0.0 in /usr/local/lib/python3.11/dist-packages (from pyLDAvis) (1.6.1) Requirement already satisfied: gensim in /usr/local/lib/python3.11/dist-packages (from pyLDAvis) (4.3.3) Requirement already satisfied: setuptools in /usr/local/lib/python3.11/dist-packages (from pyLDAvis) (75.2.0) Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.11/dist-packages (from pandas>=2.0.0->pyLDAvis) (2.9.0.post0) Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.11/dist-packages (from pandas>=2.0.0->pyLDAvis) (2025.2) Requirement already satisfied: tzdata>=2022.7 in /usr/local/lib/python3.11/dist-packages (from pandas>=2.0.0->pyLDAvis) (2025.2) Requirement already satisfied: threadpoolctl>=3.1.0 in /usr/local/lib/python3.11/dist-packages (from scikit-learn>=1.0.0->pyLDAvis) (3.6.0) Requirement already satisfied: smart-open>=1.8.1 in /usr/local/lib/python3.11/dist-packages (from gensim->pyLDAvis) (7.1.0) Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.11/dist-packages (from jinja2->pyLDAvis) (3.0.2) Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.11/dist-packages (from python-dateutil>=2.8.2->pandas>=2.0.0->pyLDAvis) (1.17.0) Requirement already satisfied: wrapt in /usr/local/lib/python3.11/dist-packages (from smart-open>=1.8.1->gensim->pyLDAvis) (1.17.2)
In [2]:
## 처음 한번 다운로드 필요
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')
[nltk_data] Downloading package punkt to /root/nltk_data... [nltk_data] Package punkt is already up-to-date! [nltk_data] Downloading package punkt_tab to /root/nltk_data... [nltk_data] Package punkt_tab is already up-to-date!
Out[2]:
True
In [3]:
import os
import re
from gensim import corpora
from gensim.models.ldamodel import LdaModel
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import pyLDAvis.gensim as gensimvis # 수정된 부분
import pyLDAvis
import matplotlib.pyplot as plt
# 수동으로 정의한 한국어 불용어 리스트
korean_stopwords = {
'의', '가', '이', '은', '들', '는', '좀', '잘', '걍', '과', '도', '를', '으로',
'자', '에', '와', '한', '하다', '에서', '것', '및', '위해', '그', '되다'
}
# 불용어 추가 (분석에 불필요한 단어 추가)
additional_stopwords = {'강점', '약점', '경쟁사'}
korean_stopwords.update(additional_stopwords)
# 텍스트 파일 경로
file_paths = [
"01_다른경쟁사와간단비교.txt",
"02_기업리서치관련정리.txt",
"03_생성AI분석.txt"
]
# 파일 내용을 하나로 결합
combined_text = ""
for file_path in file_paths:
with open(file_path, 'r', encoding='utf-8') as file:
combined_text += file.read() + "\n"
# 텍스트 전처리 및 토큰화
def preprocess(text):
# 소문자 변환, 특수 문자 제거, 토큰화
text = re.sub(r'[^\w\s]', '', text.lower())
tokens = word_tokenize(text)
tokens = [word for word in tokens if word not in korean_stopwords and len(word) > 1]
return tokens
# 전처리된 문서 리스트 생성
documents = preprocess(combined_text)
# 단어 사전 생성
dictionary = corpora.Dictionary([documents])
# 코퍼스 생성 (문서를 BOW(Bag of Words)로 변환)
corpus = [dictionary.doc2bow(documents)]
# LDA 모델 생성
lda_model = LdaModel(corpus, num_topics=3, id2word=dictionary, passes=15)
# pyLDAvis를 이용한 시각화
vis_data = gensimvis.prepare(lda_model, corpus, dictionary)
pyLDAvis.display(vis_data)
# 필요 시, HTML 파일로 저장
pyLDAvis.save_html(vis_data, 'lda_visualization.html')
In [1]:
# 25/06/19 설치 후, 적용을 위해 런타임 세션 다시 시작
# 그리고 다시 명령어 실행.
!pip install gensim
!pip install numpy==1.26.4 scipy==1.13.1 gensim==4.3.3 --force-reinstall
Requirement already satisfied: gensim in /usr/local/lib/python3.11/dist-packages (4.3.3) Requirement already satisfied: numpy<2.0,>=1.18.5 in /usr/local/lib/python3.11/dist-packages (from gensim) (1.26.4) Requirement already satisfied: scipy<1.14.0,>=1.7.0 in /usr/local/lib/python3.11/dist-packages (from gensim) (1.13.1) Requirement already satisfied: smart-open>=1.8.1 in /usr/local/lib/python3.11/dist-packages (from gensim) (7.1.0) Requirement already satisfied: wrapt in /usr/local/lib/python3.11/dist-packages (from smart-open>=1.8.1->gensim) (1.17.2) Collecting numpy==1.26.4 Using cached numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB) Collecting scipy==1.13.1 Using cached scipy-1.13.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (60 kB) Collecting gensim==4.3.3 Using cached gensim-4.3.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (8.1 kB) Collecting smart-open>=1.8.1 (from gensim==4.3.3) Using cached smart_open-7.1.0-py3-none-any.whl.metadata (24 kB) Collecting wrapt (from smart-open>=1.8.1->gensim==4.3.3) Using cached wrapt-1.17.2-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.4 kB) Using cached numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.3 MB) Using cached scipy-1.13.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (38.6 MB) Using cached gensim-4.3.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (26.7 MB) Using cached smart_open-7.1.0-py3-none-any.whl (61 kB) Using cached wrapt-1.17.2-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (83 kB) Installing collected packages: wrapt, numpy, smart-open, scipy, gensim Attempting uninstall: wrapt Found existing installation: wrapt 1.17.2 Uninstalling wrapt-1.17.2: Successfully uninstalled wrapt-1.17.2 Attempting uninstall: numpy Found existing installation: numpy 1.26.4 Uninstalling numpy-1.26.4: Successfully uninstalled numpy-1.26.4 Attempting uninstall: smart-open Found existing installation: smart-open 7.1.0 Uninstalling smart-open-7.1.0: Successfully uninstalled smart-open-7.1.0 Attempting uninstall: scipy Found existing installation: scipy 1.13.1 Uninstalling scipy-1.13.1: Successfully uninstalled scipy-1.13.1 Attempting uninstall: gensim Found existing installation: gensim 4.3.3 Uninstalling gensim-4.3.3: Successfully uninstalled gensim-4.3.3 ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. thinc 8.3.6 requires numpy<3.0.0,>=2.0.0, but you have numpy 1.26.4 which is incompatible. tsfresh 0.21.0 requires scipy>=1.14.0; python_version >= "3.10", but you have scipy 1.13.1 which is incompatible. Successfully installed gensim-4.3.3 numpy-1.26.4 scipy-1.13.1 smart-open-7.1.0 wrapt-1.17.2
In [2]:
import os
import re
from gensim import corpora
from gensim.models.ldamodel import LdaModel
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import pyLDAvis.gensim as gensimvis # 수정된 부분
import pyLDAvis
import matplotlib.pyplot as plt
# 수동으로 정의한 한국어 불용어 리스트
korean_stopwords = {
'의', '가', '이', '은', '들', '는', '좀', '잘', '걍', '과', '도', '를', '으로',
'자', '에', '와', '한', '하다', '에서', '것', '및', '위해', '그', '되다'
}
# 불용어 추가 (분석에 불필요한 단어 추가)
additional_stopwords = {'강점', '약점', '경쟁사'}
korean_stopwords.update(additional_stopwords)
# 텍스트 파일 경로
file_paths = [
"01_다른경쟁사와간단비교.txt",
"02_기업리서치관련정리.txt",
"03_생성AI분석.txt"
]
# 파일 내용을 하나로 결합
combined_text = ""
for file_path in file_paths:
with open(file_path, 'r', encoding='utf-8') as file:
combined_text += file.read() + "\n"
# 텍스트 전처리 및 토큰화
def preprocess(text):
# 소문자 변환, 특수 문자 제거, 토큰화
text = re.sub(r'[^\w\s]', '', text.lower())
tokens = word_tokenize(text)
tokens = [word for word in tokens if word not in korean_stopwords and len(word) > 1]
return tokens
# 전처리된 문서 리스트 생성
documents = preprocess(combined_text)
# 단어 사전 생성
dictionary = corpora.Dictionary([documents])
# 코퍼스 생성 (문서를 BOW(Bag of Words)로 변환)
corpus = [dictionary.doc2bow(documents)]
# LDA 모델 생성
lda_model = LdaModel(corpus, num_topics=3, id2word=dictionary, passes=15)
# pyLDAvis를 이용한 시각화
vis_data = gensimvis.prepare(lda_model, corpus, dictionary)
pyLDAvis.display(vis_data)
# 필요 시, HTML 파일로 저장
pyLDAvis.save_html(vis_data, 'lda_visualization.html')
노트북에서 바로 보기¶
In [5]:
import pyLDAvis.gensim as gensimvis
import pyLDAvis
pyLDAvis.enable_notebook()
vis_data = gensimvis.prepare(lda_model, corpus, dictionary)
vis_data
Out[5]:
In [ ]: