The game
(To optimize this game, a test phase and a confusion matrix need to be added)
1. GATHER YOUR SAMPLE DATA: write 6 short sentences in the same style, of which 3 sentences are positive, 3 sentences negative.
2. PROCESS THE TEXT:
2.1. Decide the unit of analysis (word/character/bigram…).
2.2. Split your sentences in units.
2.3. Mark each unit as being positive or negative.
2.4. Create your vocabulary: a collection of all unique units of all 6 sentences.
3. PREPARE THE TRANSFORMATION from WORDS to NUMBERS:
3.1. Display the units in 1 row of a grid. These are the columns of your matrix.
3.2. For each sentence, display the units as 1 row in the columns of the grid.
3.3. Count the probability that a sentence in your model is positive:
number of positive sentences / total number of sentences
3.4. Count the probability that a sentence in your model is negative:
number of negative sentences / total number of sentences
3.5. Count all positive units.
3.6. Count all negative units.
3.7. Count all units (your vocabulary size).
4. The TRAINING starts! For each unit you make the following calculation:
4.1. if the unit is positive:
the probability that a sentence in your model is positive * the probability that the word is positive
This means:
number of positive sentences / total number of sentences
*
number of times that the word is used as a positive example + 1
/
total number of positive words + vocabulary size
4.2. else:
the probability that a sentence in your model is negative * the probability that the word is negative
This means:
number of negative sentences / total number of sentences
*
number of times that the word is used as a negative example + 1
/
total number of negative words + vocabulary size
5. SMOOTHING UNITS: each cell in your grid should have a number now. Add 0.000001 to cells that have 0. This avoids that calculations end up being zero.
6. SMOOTHING UNKNOWN UNITS: add one last column to your grid with the label ‘Unknown’. Fill this column with smoothing numbers.
7. THE PREDICTION CAN START!
7.1. Invent a new sentence in the same style as your training data.
7.2. Split your sentence in the type of units you chose in the beginning
7.3. Calculate the probability that the new sentence is positive:
7.3.4. Find the corresponding probabilities for each of the positive units in your grid
7.3.5. If the unit does not exist, pick the smoothing number of the ‘unknown unit’.
7.3.6. Multiply the probability that a sentence in your model is positive with all individual probabilities of units of your new sentence
7.4. Calculate the probability that the new sentence is negative:
7.4.4. Find the corresponding probabilities for each of the negative units in your grid
7.4.5. If the unit does not exist, pick the smoothing number of the ‘unknown unit’.
7.4.6. Multiply the probability that a sentence in your model is negative with all individual probabilities of units of your new sentence
8. Compare the outcome of 7.3.6. and 7.4.6.
9. ORACLE: the highest value of 11. is the prediction made by this model.
Leave a Reply
You must be logged in to post a comment.