토픽모델링¶

  • 목적: 문서 내에서 논의된 주요 주제를 자동으로 식별합니다.
  • 방법: LDA(Latent Dirichlet Allocation)와 같은 주제 모델링 기법을 사용하여 텍스트에서 발견되는 주요 주제를 분석합니다. 이를 통해 회사와 관련된 다양한 논의의 축을 파악할 수 있습니다.
In [1]:
!pip install pyLDAvis
Requirement already satisfied: pyLDAvis in /usr/local/lib/python3.11/dist-packages (3.4.1)
Requirement already satisfied: numpy>=1.24.2 in /usr/local/lib/python3.11/dist-packages (from pyLDAvis) (1.26.4)
Requirement already satisfied: scipy in /usr/local/lib/python3.11/dist-packages (from pyLDAvis) (1.13.1)
Requirement already satisfied: pandas>=2.0.0 in /usr/local/lib/python3.11/dist-packages (from pyLDAvis) (2.2.2)
Requirement already satisfied: joblib>=1.2.0 in /usr/local/lib/python3.11/dist-packages (from pyLDAvis) (1.5.1)
Requirement already satisfied: jinja2 in /usr/local/lib/python3.11/dist-packages (from pyLDAvis) (3.1.6)
Requirement already satisfied: numexpr in /usr/local/lib/python3.11/dist-packages (from pyLDAvis) (2.11.0)
Requirement already satisfied: funcy in /usr/local/lib/python3.11/dist-packages (from pyLDAvis) (2.0)
Requirement already satisfied: scikit-learn>=1.0.0 in /usr/local/lib/python3.11/dist-packages (from pyLDAvis) (1.6.1)
Requirement already satisfied: gensim in /usr/local/lib/python3.11/dist-packages (from pyLDAvis) (4.3.3)
Requirement already satisfied: setuptools in /usr/local/lib/python3.11/dist-packages (from pyLDAvis) (75.2.0)
Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.11/dist-packages (from pandas>=2.0.0->pyLDAvis) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.11/dist-packages (from pandas>=2.0.0->pyLDAvis) (2025.2)
Requirement already satisfied: tzdata>=2022.7 in /usr/local/lib/python3.11/dist-packages (from pandas>=2.0.0->pyLDAvis) (2025.2)
Requirement already satisfied: threadpoolctl>=3.1.0 in /usr/local/lib/python3.11/dist-packages (from scikit-learn>=1.0.0->pyLDAvis) (3.6.0)
Requirement already satisfied: smart-open>=1.8.1 in /usr/local/lib/python3.11/dist-packages (from gensim->pyLDAvis) (7.1.0)
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.11/dist-packages (from jinja2->pyLDAvis) (3.0.2)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.11/dist-packages (from python-dateutil>=2.8.2->pandas>=2.0.0->pyLDAvis) (1.17.0)
Requirement already satisfied: wrapt in /usr/local/lib/python3.11/dist-packages (from smart-open>=1.8.1->gensim->pyLDAvis) (1.17.2)
In [2]:
## 처음 한번 다운로드 필요
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
Out[2]:
True
In [3]:
import os
import re
from gensim import corpora
from gensim.models.ldamodel import LdaModel
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import pyLDAvis.gensim as gensimvis  # 수정된 부분
import pyLDAvis
import matplotlib.pyplot as plt

# 수동으로 정의한 한국어 불용어 리스트
korean_stopwords = {
    '의', '가', '이', '은', '들', '는', '좀', '잘', '걍', '과', '도', '를', '으로',
    '자', '에', '와', '한', '하다', '에서', '것', '및', '위해', '그', '되다'
}

# 불용어 추가 (분석에 불필요한 단어 추가)
additional_stopwords = {'강점', '약점', '경쟁사'}
korean_stopwords.update(additional_stopwords)

# 텍스트 파일 경로
file_paths = [
    "01_다른경쟁사와간단비교.txt",
    "02_기업리서치관련정리.txt",
    "03_생성AI분석.txt"
]

# 파일 내용을 하나로 결합
combined_text = ""

for file_path in file_paths:
    with open(file_path, 'r', encoding='utf-8') as file:
        combined_text += file.read() + "\n"

# 텍스트 전처리 및 토큰화
def preprocess(text):
    # 소문자 변환, 특수 문자 제거, 토큰화
    text = re.sub(r'[^\w\s]', '', text.lower())
    tokens = word_tokenize(text)
    tokens = [word for word in tokens if word not in korean_stopwords and len(word) > 1]
    return tokens

# 전처리된 문서 리스트 생성
documents = preprocess(combined_text)

# 단어 사전 생성
dictionary = corpora.Dictionary([documents])

# 코퍼스 생성 (문서를 BOW(Bag of Words)로 변환)
corpus = [dictionary.doc2bow(documents)]

# LDA 모델 생성
lda_model = LdaModel(corpus, num_topics=3, id2word=dictionary, passes=15)

# pyLDAvis를 이용한 시각화
vis_data = gensimvis.prepare(lda_model, corpus, dictionary)
pyLDAvis.display(vis_data)

# 필요 시, HTML 파일로 저장
pyLDAvis.save_html(vis_data, 'lda_visualization.html')
In [1]:
# 25/06/19 설치 후, 적용을 위해 런타임 세션 다시 시작
# 그리고 다시 명령어 실행.
!pip install gensim
!pip install numpy==1.26.4 scipy==1.13.1 gensim==4.3.3 --force-reinstall
Requirement already satisfied: gensim in /usr/local/lib/python3.11/dist-packages (4.3.3)
Requirement already satisfied: numpy<2.0,>=1.18.5 in /usr/local/lib/python3.11/dist-packages (from gensim) (1.26.4)
Requirement already satisfied: scipy<1.14.0,>=1.7.0 in /usr/local/lib/python3.11/dist-packages (from gensim) (1.13.1)
Requirement already satisfied: smart-open>=1.8.1 in /usr/local/lib/python3.11/dist-packages (from gensim) (7.1.0)
Requirement already satisfied: wrapt in /usr/local/lib/python3.11/dist-packages (from smart-open>=1.8.1->gensim) (1.17.2)
Collecting numpy==1.26.4
  Using cached numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
Collecting scipy==1.13.1
  Using cached scipy-1.13.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (60 kB)
Collecting gensim==4.3.3
  Using cached gensim-4.3.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (8.1 kB)
Collecting smart-open>=1.8.1 (from gensim==4.3.3)
  Using cached smart_open-7.1.0-py3-none-any.whl.metadata (24 kB)
Collecting wrapt (from smart-open>=1.8.1->gensim==4.3.3)
  Using cached wrapt-1.17.2-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.4 kB)
Using cached numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.3 MB)
Using cached scipy-1.13.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (38.6 MB)
Using cached gensim-4.3.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (26.7 MB)
Using cached smart_open-7.1.0-py3-none-any.whl (61 kB)
Using cached wrapt-1.17.2-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (83 kB)
Installing collected packages: wrapt, numpy, smart-open, scipy, gensim
  Attempting uninstall: wrapt
    Found existing installation: wrapt 1.17.2
    Uninstalling wrapt-1.17.2:
      Successfully uninstalled wrapt-1.17.2
  Attempting uninstall: numpy
    Found existing installation: numpy 1.26.4
    Uninstalling numpy-1.26.4:
      Successfully uninstalled numpy-1.26.4
  Attempting uninstall: smart-open
    Found existing installation: smart-open 7.1.0
    Uninstalling smart-open-7.1.0:
      Successfully uninstalled smart-open-7.1.0
  Attempting uninstall: scipy
    Found existing installation: scipy 1.13.1
    Uninstalling scipy-1.13.1:
      Successfully uninstalled scipy-1.13.1
  Attempting uninstall: gensim
    Found existing installation: gensim 4.3.3
    Uninstalling gensim-4.3.3:
      Successfully uninstalled gensim-4.3.3
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
thinc 8.3.6 requires numpy<3.0.0,>=2.0.0, but you have numpy 1.26.4 which is incompatible.
tsfresh 0.21.0 requires scipy>=1.14.0; python_version >= "3.10", but you have scipy 1.13.1 which is incompatible.
Successfully installed gensim-4.3.3 numpy-1.26.4 scipy-1.13.1 smart-open-7.1.0 wrapt-1.17.2
In [2]:
import os
import re
from gensim import corpora
from gensim.models.ldamodel import LdaModel
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import pyLDAvis.gensim as gensimvis  # 수정된 부분
import pyLDAvis
import matplotlib.pyplot as plt

# 수동으로 정의한 한국어 불용어 리스트
korean_stopwords = {
    '의', '가', '이', '은', '들', '는', '좀', '잘', '걍', '과', '도', '를', '으로',
    '자', '에', '와', '한', '하다', '에서', '것', '및', '위해', '그', '되다'
}

# 불용어 추가 (분석에 불필요한 단어 추가)
additional_stopwords = {'강점', '약점', '경쟁사'}
korean_stopwords.update(additional_stopwords)

# 텍스트 파일 경로
file_paths = [
    "01_다른경쟁사와간단비교.txt",
    "02_기업리서치관련정리.txt",
    "03_생성AI분석.txt"
]

# 파일 내용을 하나로 결합
combined_text = ""

for file_path in file_paths:
    with open(file_path, 'r', encoding='utf-8') as file:
        combined_text += file.read() + "\n"

# 텍스트 전처리 및 토큰화
def preprocess(text):
    # 소문자 변환, 특수 문자 제거, 토큰화
    text = re.sub(r'[^\w\s]', '', text.lower())
    tokens = word_tokenize(text)
    tokens = [word for word in tokens if word not in korean_stopwords and len(word) > 1]
    return tokens

# 전처리된 문서 리스트 생성
documents = preprocess(combined_text)

# 단어 사전 생성
dictionary = corpora.Dictionary([documents])

# 코퍼스 생성 (문서를 BOW(Bag of Words)로 변환)
corpus = [dictionary.doc2bow(documents)]

# LDA 모델 생성
lda_model = LdaModel(corpus, num_topics=3, id2word=dictionary, passes=15)

# pyLDAvis를 이용한 시각화
vis_data = gensimvis.prepare(lda_model, corpus, dictionary)
pyLDAvis.display(vis_data)

# 필요 시, HTML 파일로 저장
pyLDAvis.save_html(vis_data, 'lda_visualization.html')

노트북에서 바로 보기¶

In [5]:
import pyLDAvis.gensim as gensimvis
import pyLDAvis

pyLDAvis.enable_notebook()

vis_data = gensimvis.prepare(lda_model, corpus, dictionary)
vis_data
Out[5]:
In [ ]: