NLP introduction--NLTK

import nltk

统计词频

1
freq = nltk.FreqDist(tokens)

处理停用词

1
2
from nltk.corpus import stopwords
stopwords.words('english')

使用NLTK Tokenize文本

文本没有Tokenize之前是无法处理的,token化过程意味着将大的部件分割为小部件

你可以将段落tokenize成句子,将句子tokenize成单个词,NLTK分别提供了句子tokenizer和单词tokenizer

sentence_tokenizer

1
2
3
4
5
from nltk.tokenize import sent_tokenize
mytext = "Hello Adam, how are you? I hope everything is going well. Today is a good day, see you dude."
print(sent_tokenize(mytext))
['Hello Adam, how are you?', 'I hope everything is going well.', 'Today is a good day, see you dude.']

word_tokenizer

1
2
3
4
5
from nltk.tokenize import word_tokenize
mytext = "Hello Mr. Adam, how are you? I hope everything is going well. Today is a good day, see you dude."
print(word_tokenize(mytext))
['Hello', 'Mr.', 'Adam', ',', 'how', 'are', 'you', '?', 'I', 'hope', 'everything', 'is', 'going', 'well', '.', 'Today', 'is', 'a', 'good', 'day', ',', 'see', 'you', 'dude', '.']

词干提取

语言形态学和信息检索里,词干提取是去除词缀得到词根的过程,例如working的词干为work。

搜索引擎在索引页面时就会使用这种技术,所以很多人为相同的单词写出不同的版本。

有很多种算法可以避免这种情况,最常见的是波特词干算法。NLTK有一个名为PorterStemmer的类

1
2
3
4
5
6
7
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
print(stemmer.stem('working'))
print(stemmer.stem('worked'))
work
work