首页天道酬勤lda主题模型速度,lda主题模型可视化

lda主题模型速度,lda主题模型可视化

admin 08-24 10:31 205次浏览

导入相关的包
https://github.com/lda-project/lda :lda包的文档!

采用LDA库,pip install lda

import numpy as npimport lda 12X = lda.datasets.load_reuters()X.shape12(395, 4258)1 这里说明X是395行4258列的数据,说明有395个训练样本 vocab = lda.datasets.load_reuters_vocab()len(vocab)# 这里是所有的词汇1242581 这里说明一个有4258个不重复的词语 1 选取前十个训练数据看一看 title = lda.datasets.load_reuters_titles()title[:10]12('0 UK: Prince Charles spearheads British royal revolution. LONDON 1996-08-20', '1 GERMANY: Historic Dresden church rising from WW2 ashes. DRESDEN, Germany 1996-08-21', "2 INDIA: Mother Teresa's condition said still unstable. CALCUTTA 1996-08-23", '3 UK: Palace warns British weekly over Charles pictures. LONDON 1996-08-25', '4 INDIA: Mother Teresa, slightly stronger, blesses nuns. CALCUTTA 1996-08-25', "5 INDIA: Mother Teresa's condition unchanged, thousands pray. CALCUTTA 1996-08-25", '6 INDIA: Mother Teresa shows signs of strength, blesses nuns. CALCUTTA 1996-08-26', "7 INDIA: Mother Teresa's condition improves, many pray. CALCUTTA, India 1996-08-25", '8 INDIA: Mother Teresa improves, nuns pray for "miracle". CALCUTTA 1996-08-26', '9 UK: Charles under fire over prospect of Queen Camilla. LONDON 1996-08-26')123456789101 开始训练,这顶主题数目是20,迭代次数是1500次 model = lda.LDA(n_topics = 20, n_iter = 1500, random_state = 1) #初始化模型, n_iter 迭代次数model.fit(X)

控制台输出:

INFO:lda:n_documents: 395INFO:lda:vocab_size: 4258INFO:lda:n_words: 84010INFO:lda:n_topics: 20INFO:lda:n_iter: 1500INFO:lda:<0> log likelihood: -1051748INFO:lda:<10> log likelihood: -719800INFO:lda:<20> log likelihood: -699115INFO:lda:<30> log likelihood: -689370INFO:lda:<40> log likelihood: -684918...INFO:lda:<1450> log likelihood: -654884INFO:lda:<1460> log likelihood: -655493INFO:lda:<1470> log likelihood: -655415INFO:lda:<1480> log likelihood: -655192INFO:lda:<1490> log likelihood: -655728INFO:lda:<1499> log likelihood: -655858<lda.lda.LDA at 0x7effa0508550>1234567891011121314151617181920212223 查看20个主题中的词分布 topic_word = model.topic_word_print(topic_word.shape)topic_word

查看输出:

(20, 4258)array([[3.62505347e-06, 3.62505347e-06, 3.62505347e-06, ..., 3.62505347e-06, 3.62505347e-06, 3.62505347e-06], [1.87498968e-02, 1.17916463e-06, 1.17916463e-06, ..., 1.17916463e-06, 1.17916463e-06, 1.17916463e-06], [1.52206232e-03, 5.05668544e-06, 4.05040504e-03, ..., 5.05668544e-06, 5.05668544e-06, 5.05668544e-06], ..., [4.17266923e-02, 3.93610908e-06, 9.05698699e-03, ..., 3.93610908e-06, 3.93610908e-06, 3.93610908e-06], [2.37609835e-06, 2.37609835e-06, 2.37609835e-06, ..., 2.37609835e-06, 2.37609835e-06, 2.37609835e-06], [3.46310752e-06, 3.46310752e-06, 3.46310752e-06, ..., 3.46310752e-06, 3.46310752e-06, 3.46310752e-06]]) 得到每个主题的前8个词 for i, topic_dist in enumerate(topic_word): print(np.array(vocab)[np.argsort(topic_dist)][:-9:-1])12['british' 'churchill' 'sale' 'million' 'major' 'letters' 'west' 'britain']['church' 'government' 'political' 'country' 'state' 'people' 'party' 'against']['elvis' 'dddxmf' 'fans' 'presley' 'life' 'concert' 'young' 'death']['yeltsin' 'russian' 'russia' 'president' 'kremlin' 'moscow' '醉熏的丝袜' 'operation']['pope' 'vatican' 'paul' 'john' 'surgery' 'hospital' 'pontiff' 'rome']['family' 'funeral' 'police' 'miami' 'versace' 'cunanan' 'city' 'service']['simpson' 'former' 'years' 'court' 'president' 'wife' 'south' 'church']['order' 'mother' 'successor' 'election' 'nuns' 'church' 'nirmala' 'head']['charles' 'prince' 'qjdjw' 'royal' 'dddxmf' 'queen' 'parker' 'bowles']['film' 'french' 'france' 'against' 'bardot' 'paris' 'poster' 'animal']['germany' 'german' 'war' 'nazi' 'letter' 'christian' 'book' 'jews']['east' 'peace' 'prize' 'award' 'timor' 'quebec' 'belo' 'leader']["n't" 'life' 'show' 'told' 'very' 'love' 'television' 'father']['years' 'year' 'time' 'last' 'church' 'world' 'people' 'say']['mother' 'teresa' 'heart' 'calcutta' 'charity' 'nun' 'hospital' 'missionaries']['city' 'salonika' 'capital' 'buddhist' 'cultural' 'vietnam' 'byzantine' 'show']['music' 'tour' 'opera' 'singer' 'israel' 'people' 'film' 'israeli']['church' 'catholic' 'bernardin' 'cardinal' 'bishop' 'wright' 'death' 'cancer']['harriman' 'clinton' 'u.s' 'ambassador' 'paris' 'president' 'churchill' 'france']['city' 'museum' 'art' 'exhibition' 'century' 'million' 'churches' 'set']1234567891011121314151617181920212223242526- 得到每句话在每个主题的分布,并得到每句话的最大主题1doc_topic = model.doc_topic_print(doc_topic.shape) # 主题分布式395行,20列的矩阵,其中每一行对应一个训练样本在20个主题上的分布print("第一个样本的主题分布是",doc_topic[0]) # 打印一下第一个样本的主题分布print("第一个样本的最终主题是",doc_topic[0].argmax())1234(395, 20)第一个样本的主题分布是 [wsdjz]第一个样本的最终主题是 8

转载至:https://blog.csdn.net/jiangzhenkang/article/details/84335646

javascript如何求3个数的和苹果ios是什么?Day1:找到字符串中所有字母异位词(LeetCode练习题)
gensim lda 获得文档主题分布 python使用jieba分词,r语言 jieba分词
相关内容