Welcome file

13th March, 2019
Generally textual data is only a set of characters at the initial stage. All processes in text analysis will require the words which are available in the data set. For this reason, the parser used is tokenization. Tokenizer provides the reliability for documents. This process may be trivial as the text is already stored in machine-readable formats. But, some problems are still left, for example the punctuation mark removal. But characters like brackets, hyphens, etc. are processing well. Tokenization helps to divide big quantity of text into smaller parts called tokens.

Natural language processing is used for building applications such as Text classification, intelligent chatbot, sentimental analysis, language translation, etc.

Natural Language toolkit has very important module tokenize which further compromises of sub-modules

Word_tokenize
Sentence_tokenize

Tokenization of words

word_tokenize() is used to split a sentence into words. The output of word tokenization can be stored into Data Frame for better text understanding in machine learning applications. It can also be used in further text cleaning steps such as punctuation removal, numeric character removal or stemming. Word tokenization helps in numeric data conversion.

Install NLTK package in anaconda prompt.

pip install nltk

Download necessary libraries in python script

import nltk
nltk.download()

Steps:

import word_tokenize module from NLTK library.
Initialize a variable to store the text data (Here variable is text).
This module breaks each word and punctuation as seen in the output.

Example:

from nltk.tokenize import word_tokenize

text = "The shining sun, the warm breeze, the cool surf, and palms blowing in the breeze are great!"

print(word_tokenize(text))

Output:

['The', 'shining', 'sun', ',', 'the', 'warm', 'breeze', ',', 'the', 'cool', 'surf', ',', 'and', 'palms', 'blowing', 'in', 'the', 'breeze', 'are', 'great', '!']

Tokenization of sentences

Sentence tokenization is different from words tokenization. It splits sentences by punctuations.

Steps:

import sent_tokenize module.
sent module parsed that sentences with punctuations. This function breaks each sentence.

Example:

from nltk.tokenize import sent_tokenize

text = "How awful! What a chaos!."

print(sent_tokenize(text))

Output:

['How awful!', 'What a chaos!.']

POS Tagging

Parts of speech Tagging is responsible for reading the text in a language and assigning some specific token (Parts of Speech) to each word.
Steps:

Tokenize text using word_tokenize
apply nltk.pos_tag(tokenize_text)
POS tagger adds linguistic (mostly grammatical) information of parts-of-speech such as verb, noun, preposition etc. to sub-sentential units (Tokens).

Example:

import nltk

tokenize_text = word_tokenize(text)

nltk.pos_tag(tokenize_text)

Output:

[('I', 'PRP'), ('got', 'VBD'), ('the', 'DT'), ('concert', 'JJ'), ('tickets', 'NNS'), ('!', '.')]

Some examples of POS tags are as below:

Abbreviation	Meaning
CC	coordinating conjuction
CD	cardinal digit
DT	Determiner
EX	existential there
FW	foreign word
IN	preposition/subordinating conjunction
JJ	adjective (large)
JJR	adjective, comparative (larger)
JSS	adjective, superlative (largest)
LS	list market
MD	modal (could, will)
NN	noun, singular (cat, tree)
NNS	noun plural (desks)
NNP	proper noun, singular (sarah)
NNPS	proper noun, plural (indians or americans)
PDT	predeterminer (all, both, half)
POS	possessive ending (parent\ 's)
PRP	personal pronoun (hers, herself, him,himself)
PRP$	possessive pronoun (her, his, mine, my, our )
RB	adverb (occasionally, swiftly)
RBR	adverb, comparative (greater)
RBS	adverb, superlative (biggest)
RP	particle (about)
TO	infinite marker (to)
UH	interjection (goodbye)
VB	verb (ask)
VBG	verb gerund (judging)
VBD	verb past tense (pleaded)
VBN	verb past participle (reunified)
VBP	verb, present tense not 3rd person singular(wrap)
VBZ	verb, present tense with 3rd person singular (bases)
WDT	wh-determiner (that, what)
WP	wh- pronoun (who)
WRB	wh- adverb (how)