Markov-chain sentence generator in PythonMarkov Chain generatorShakespeare and dictionariesMarkov chain-based word salad generatorMarkov chain text generation in PythonCumulative transition matrix Markov Chain simulationMarkov Chain in PythonSentence generation using Markov ChainsCalculate probability of word occurenceGiven a string and a word dict, find all possible sentences
Is there a command to install basic applications on Ubuntu 16.04?
Why did I get only 5 points even though I won?
Why is the result of ('b'+'a'+ + 'a' + 'a').toLowerCase() 'banana'?
Loading military units into ships optimally, using backtracking
Safest way to store environment variable value in a file
What does the phrase "pull off sick wheelies and flips" mean here?
Are differences between uniformly distributed numbers uniformly distributed?
How to disable "Completion time:..." in SQL Server Messages window
What gave Harry Potter the idea of writing in Tom Riddle's diary?
How do some PhD students get 10+ papers? Is that what I need for landing good faculty position?
Plotting octahedron inside the sphere and sphere inside the cube
Not going forward with internship interview process
Breadcrumb history decision
Is there any way to stop a user from creating executables and running them?
How to describe accents?
Do beef farmed pastures net remove carbon emissions?
What should I call bands of armed men in the Middle Ages?
PhD advisor lost funding, need advice
Simplification of numbers
The cat ate your input again!
Why does the standard fingering / strumming for a D maj chord leave out the 5th string?
Redis Cache Shared Session Configuration
When were the tantalum capacitors first used in computing?
What is the status of the F-1B engine development?
Markov-chain sentence generator in Python
Markov Chain generatorShakespeare and dictionariesMarkov chain-based word salad generatorMarkov chain text generation in PythonCumulative transition matrix Markov Chain simulationMarkov Chain in PythonSentence generation using Markov ChainsCalculate probability of word occurenceGiven a string and a word dict, find all possible sentences
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;
$begingroup$
I wrote a Markov-chain based sentence generator as my first non-trivial Python program. I mainly used C before, so I probably have ignored a lot of Python conventions and features, so any advice would be appreciated.
text-gen.py
import sys
import random
class MarkovChain:
# Class constant that serves as an initial state for the Markov chain
START = ""
# The Markov chain is modelled as a directed graph,
# with the START state acting as the only source,
# and the tranisition probabilities as the graph weights.
#
# The graph is implemented using an adjacency list,
# which in turn is implemented as a dictionary of dictionaries.
#
# self.adjList is a dictionary keyed by all words of the text
# or START (the states). For each key/state, it contains
# another dictionary indexed by the words of the text
# that succeed the key in the text (the next states in the chain),
# and for each of those words/next states the dictionary contains
# the transition probability from the present state to them.
#
# This implementation was chosen because of it being easy to code,
# and offering an easy way to iterate on both the probabilities and
# the words/next states of each dictionary using items().
def __init__(self, file):
self.adjList =
# The totals dictionary is used in calculating the probabilities,
# for every word in the text/chain state it contains the total
# number of transitions from it to another state.
totals =
# Start by insering the initial state to the structures
self.adjList[MarkovChain.START] =
totals[MarkovChain.START] = 0
# prev: Contains the previously encountered word or the START state,
# initialized to the START state.
prev = MarkovChain.START
for line in file:
for word in line.split():
# If the word ends with a terminating punctuation mark,
# ignore the mark, and treat the word as a terminating state as
# it does not preceed another word in the current sentence.
# So prev is set to START, in order for the text model
# to account for the fact that some words start sentences
# more frequently than others (not all words are next states of START).
endsTerm = word[-1] == "." or word[-1] == "?" or word[-1] == "!"
if (endsTerm):
word = word[0:-1]
# If this is the first time the word is encountered,
# add it to the adjacency list, and initialize its dictionary
# and transition total.
if (word not in self.adjList):
self.adjList[word] =
totals[word] = 0
# If this is the first time the prev->word transition
# was detected, initialize the prev->word transition frequency to 1,
# else increment it.
if (word in self.adjList[prev]):
self.adjList[prev][word] += 1
else:
self.adjList[prev][word] = 1
# There is a prev->word state transition, so increment
# the total transition number of the prev state.
totals[prev] += 1
if (endsTerm):
prev = START
# Using total, convert the transition frequencies
# to transition probabilities.
for word, neighbors in self.adjList.items():
for name in neighbors:
neighbors[name] /= totals[word]
# chooseNextWord: Chooses the next state/word,
# by sampling the non uniform transition probability distribution
# of the current word/state.
def chooseNextWord(self, curWord):
# Convert the dict_keys object to a list
# to use indexing
nextWords = list(self.adjList[curWord].keys())
# Sampling is done through linear search.
for word in nextWords[0:-1]:
prob = self.adjList[curWord][word]
roll = random.random()
if (roll <= prob):
return word
# If none of the first N-1 words were chosen,
# only the last one was left.
return nextWords[-1]
# genSentence: Generates a sentence. If a positive
# limit is not provided by the caller, the sentences grow to
# an arbitrary number of words, until the last word of a sentence/a terminal state
# is reached.
def genSentence(self, limit = 0):
sentence = ""
curWord = self.chooseNextWord(MarkovChain.START)
sentence += curWord + " "
if (limit > 0):
wordsUsed = 1
while (wordsUsed < limit and self.adjList[curWord]):
curWord = self.chooseNextWord(curWord)
sentence += curWord + " "
wordsUsed += 1
else:
while (self.adjList[curWord]):
curWord = self.chooseNextWord(curWord)
sentence += curWord + " "
return sentence
if (__name__ == "__main__"):
if (len(sys.argv) < 3):
print("Not enough arguements, run with python3 text-gen.py <input-filename> <sentence-num>")
sys.exit(1)
try:
with open(sys.argv[1], "r") as f:
markov = MarkovChain(f)
except OSError as error:
print(error.strerror)
sys.exit(1)
# Generate and print as many sentences as asked.
for k in range(0, int(sys.argv[2])):
print(markov.genSentence(20) + "n")
```
python python-3.x statistics markov-chain
New contributor
$endgroup$
add a comment |
$begingroup$
I wrote a Markov-chain based sentence generator as my first non-trivial Python program. I mainly used C before, so I probably have ignored a lot of Python conventions and features, so any advice would be appreciated.
text-gen.py
import sys
import random
class MarkovChain:
# Class constant that serves as an initial state for the Markov chain
START = ""
# The Markov chain is modelled as a directed graph,
# with the START state acting as the only source,
# and the tranisition probabilities as the graph weights.
#
# The graph is implemented using an adjacency list,
# which in turn is implemented as a dictionary of dictionaries.
#
# self.adjList is a dictionary keyed by all words of the text
# or START (the states). For each key/state, it contains
# another dictionary indexed by the words of the text
# that succeed the key in the text (the next states in the chain),
# and for each of those words/next states the dictionary contains
# the transition probability from the present state to them.
#
# This implementation was chosen because of it being easy to code,
# and offering an easy way to iterate on both the probabilities and
# the words/next states of each dictionary using items().
def __init__(self, file):
self.adjList =
# The totals dictionary is used in calculating the probabilities,
# for every word in the text/chain state it contains the total
# number of transitions from it to another state.
totals =
# Start by insering the initial state to the structures
self.adjList[MarkovChain.START] =
totals[MarkovChain.START] = 0
# prev: Contains the previously encountered word or the START state,
# initialized to the START state.
prev = MarkovChain.START
for line in file:
for word in line.split():
# If the word ends with a terminating punctuation mark,
# ignore the mark, and treat the word as a terminating state as
# it does not preceed another word in the current sentence.
# So prev is set to START, in order for the text model
# to account for the fact that some words start sentences
# more frequently than others (not all words are next states of START).
endsTerm = word[-1] == "." or word[-1] == "?" or word[-1] == "!"
if (endsTerm):
word = word[0:-1]
# If this is the first time the word is encountered,
# add it to the adjacency list, and initialize its dictionary
# and transition total.
if (word not in self.adjList):
self.adjList[word] =
totals[word] = 0
# If this is the first time the prev->word transition
# was detected, initialize the prev->word transition frequency to 1,
# else increment it.
if (word in self.adjList[prev]):
self.adjList[prev][word] += 1
else:
self.adjList[prev][word] = 1
# There is a prev->word state transition, so increment
# the total transition number of the prev state.
totals[prev] += 1
if (endsTerm):
prev = START
# Using total, convert the transition frequencies
# to transition probabilities.
for word, neighbors in self.adjList.items():
for name in neighbors:
neighbors[name] /= totals[word]
# chooseNextWord: Chooses the next state/word,
# by sampling the non uniform transition probability distribution
# of the current word/state.
def chooseNextWord(self, curWord):
# Convert the dict_keys object to a list
# to use indexing
nextWords = list(self.adjList[curWord].keys())
# Sampling is done through linear search.
for word in nextWords[0:-1]:
prob = self.adjList[curWord][word]
roll = random.random()
if (roll <= prob):
return word
# If none of the first N-1 words were chosen,
# only the last one was left.
return nextWords[-1]
# genSentence: Generates a sentence. If a positive
# limit is not provided by the caller, the sentences grow to
# an arbitrary number of words, until the last word of a sentence/a terminal state
# is reached.
def genSentence(self, limit = 0):
sentence = ""
curWord = self.chooseNextWord(MarkovChain.START)
sentence += curWord + " "
if (limit > 0):
wordsUsed = 1
while (wordsUsed < limit and self.adjList[curWord]):
curWord = self.chooseNextWord(curWord)
sentence += curWord + " "
wordsUsed += 1
else:
while (self.adjList[curWord]):
curWord = self.chooseNextWord(curWord)
sentence += curWord + " "
return sentence
if (__name__ == "__main__"):
if (len(sys.argv) < 3):
print("Not enough arguements, run with python3 text-gen.py <input-filename> <sentence-num>")
sys.exit(1)
try:
with open(sys.argv[1], "r") as f:
markov = MarkovChain(f)
except OSError as error:
print(error.strerror)
sys.exit(1)
# Generate and print as many sentences as asked.
for k in range(0, int(sys.argv[2])):
print(markov.genSentence(20) + "n")
```
python python-3.x statistics markov-chain
New contributor
$endgroup$
add a comment |
$begingroup$
I wrote a Markov-chain based sentence generator as my first non-trivial Python program. I mainly used C before, so I probably have ignored a lot of Python conventions and features, so any advice would be appreciated.
text-gen.py
import sys
import random
class MarkovChain:
# Class constant that serves as an initial state for the Markov chain
START = ""
# The Markov chain is modelled as a directed graph,
# with the START state acting as the only source,
# and the tranisition probabilities as the graph weights.
#
# The graph is implemented using an adjacency list,
# which in turn is implemented as a dictionary of dictionaries.
#
# self.adjList is a dictionary keyed by all words of the text
# or START (the states). For each key/state, it contains
# another dictionary indexed by the words of the text
# that succeed the key in the text (the next states in the chain),
# and for each of those words/next states the dictionary contains
# the transition probability from the present state to them.
#
# This implementation was chosen because of it being easy to code,
# and offering an easy way to iterate on both the probabilities and
# the words/next states of each dictionary using items().
def __init__(self, file):
self.adjList =
# The totals dictionary is used in calculating the probabilities,
# for every word in the text/chain state it contains the total
# number of transitions from it to another state.
totals =
# Start by insering the initial state to the structures
self.adjList[MarkovChain.START] =
totals[MarkovChain.START] = 0
# prev: Contains the previously encountered word or the START state,
# initialized to the START state.
prev = MarkovChain.START
for line in file:
for word in line.split():
# If the word ends with a terminating punctuation mark,
# ignore the mark, and treat the word as a terminating state as
# it does not preceed another word in the current sentence.
# So prev is set to START, in order for the text model
# to account for the fact that some words start sentences
# more frequently than others (not all words are next states of START).
endsTerm = word[-1] == "." or word[-1] == "?" or word[-1] == "!"
if (endsTerm):
word = word[0:-1]
# If this is the first time the word is encountered,
# add it to the adjacency list, and initialize its dictionary
# and transition total.
if (word not in self.adjList):
self.adjList[word] =
totals[word] = 0
# If this is the first time the prev->word transition
# was detected, initialize the prev->word transition frequency to 1,
# else increment it.
if (word in self.adjList[prev]):
self.adjList[prev][word] += 1
else:
self.adjList[prev][word] = 1
# There is a prev->word state transition, so increment
# the total transition number of the prev state.
totals[prev] += 1
if (endsTerm):
prev = START
# Using total, convert the transition frequencies
# to transition probabilities.
for word, neighbors in self.adjList.items():
for name in neighbors:
neighbors[name] /= totals[word]
# chooseNextWord: Chooses the next state/word,
# by sampling the non uniform transition probability distribution
# of the current word/state.
def chooseNextWord(self, curWord):
# Convert the dict_keys object to a list
# to use indexing
nextWords = list(self.adjList[curWord].keys())
# Sampling is done through linear search.
for word in nextWords[0:-1]:
prob = self.adjList[curWord][word]
roll = random.random()
if (roll <= prob):
return word
# If none of the first N-1 words were chosen,
# only the last one was left.
return nextWords[-1]
# genSentence: Generates a sentence. If a positive
# limit is not provided by the caller, the sentences grow to
# an arbitrary number of words, until the last word of a sentence/a terminal state
# is reached.
def genSentence(self, limit = 0):
sentence = ""
curWord = self.chooseNextWord(MarkovChain.START)
sentence += curWord + " "
if (limit > 0):
wordsUsed = 1
while (wordsUsed < limit and self.adjList[curWord]):
curWord = self.chooseNextWord(curWord)
sentence += curWord + " "
wordsUsed += 1
else:
while (self.adjList[curWord]):
curWord = self.chooseNextWord(curWord)
sentence += curWord + " "
return sentence
if (__name__ == "__main__"):
if (len(sys.argv) < 3):
print("Not enough arguements, run with python3 text-gen.py <input-filename> <sentence-num>")
sys.exit(1)
try:
with open(sys.argv[1], "r") as f:
markov = MarkovChain(f)
except OSError as error:
print(error.strerror)
sys.exit(1)
# Generate and print as many sentences as asked.
for k in range(0, int(sys.argv[2])):
print(markov.genSentence(20) + "n")
```
python python-3.x statistics markov-chain
New contributor
$endgroup$
I wrote a Markov-chain based sentence generator as my first non-trivial Python program. I mainly used C before, so I probably have ignored a lot of Python conventions and features, so any advice would be appreciated.
text-gen.py
import sys
import random
class MarkovChain:
# Class constant that serves as an initial state for the Markov chain
START = ""
# The Markov chain is modelled as a directed graph,
# with the START state acting as the only source,
# and the tranisition probabilities as the graph weights.
#
# The graph is implemented using an adjacency list,
# which in turn is implemented as a dictionary of dictionaries.
#
# self.adjList is a dictionary keyed by all words of the text
# or START (the states). For each key/state, it contains
# another dictionary indexed by the words of the text
# that succeed the key in the text (the next states in the chain),
# and for each of those words/next states the dictionary contains
# the transition probability from the present state to them.
#
# This implementation was chosen because of it being easy to code,
# and offering an easy way to iterate on both the probabilities and
# the words/next states of each dictionary using items().
def __init__(self, file):
self.adjList =
# The totals dictionary is used in calculating the probabilities,
# for every word in the text/chain state it contains the total
# number of transitions from it to another state.
totals =
# Start by insering the initial state to the structures
self.adjList[MarkovChain.START] =
totals[MarkovChain.START] = 0
# prev: Contains the previously encountered word or the START state,
# initialized to the START state.
prev = MarkovChain.START
for line in file:
for word in line.split():
# If the word ends with a terminating punctuation mark,
# ignore the mark, and treat the word as a terminating state as
# it does not preceed another word in the current sentence.
# So prev is set to START, in order for the text model
# to account for the fact that some words start sentences
# more frequently than others (not all words are next states of START).
endsTerm = word[-1] == "." or word[-1] == "?" or word[-1] == "!"
if (endsTerm):
word = word[0:-1]
# If this is the first time the word is encountered,
# add it to the adjacency list, and initialize its dictionary
# and transition total.
if (word not in self.adjList):
self.adjList[word] =
totals[word] = 0
# If this is the first time the prev->word transition
# was detected, initialize the prev->word transition frequency to 1,
# else increment it.
if (word in self.adjList[prev]):
self.adjList[prev][word] += 1
else:
self.adjList[prev][word] = 1
# There is a prev->word state transition, so increment
# the total transition number of the prev state.
totals[prev] += 1
if (endsTerm):
prev = START
# Using total, convert the transition frequencies
# to transition probabilities.
for word, neighbors in self.adjList.items():
for name in neighbors:
neighbors[name] /= totals[word]
# chooseNextWord: Chooses the next state/word,
# by sampling the non uniform transition probability distribution
# of the current word/state.
def chooseNextWord(self, curWord):
# Convert the dict_keys object to a list
# to use indexing
nextWords = list(self.adjList[curWord].keys())
# Sampling is done through linear search.
for word in nextWords[0:-1]:
prob = self.adjList[curWord][word]
roll = random.random()
if (roll <= prob):
return word
# If none of the first N-1 words were chosen,
# only the last one was left.
return nextWords[-1]
# genSentence: Generates a sentence. If a positive
# limit is not provided by the caller, the sentences grow to
# an arbitrary number of words, until the last word of a sentence/a terminal state
# is reached.
def genSentence(self, limit = 0):
sentence = ""
curWord = self.chooseNextWord(MarkovChain.START)
sentence += curWord + " "
if (limit > 0):
wordsUsed = 1
while (wordsUsed < limit and self.adjList[curWord]):
curWord = self.chooseNextWord(curWord)
sentence += curWord + " "
wordsUsed += 1
else:
while (self.adjList[curWord]):
curWord = self.chooseNextWord(curWord)
sentence += curWord + " "
return sentence
if (__name__ == "__main__"):
if (len(sys.argv) < 3):
print("Not enough arguements, run with python3 text-gen.py <input-filename> <sentence-num>")
sys.exit(1)
try:
with open(sys.argv[1], "r") as f:
markov = MarkovChain(f)
except OSError as error:
print(error.strerror)
sys.exit(1)
# Generate and print as many sentences as asked.
for k in range(0, int(sys.argv[2])):
print(markov.genSentence(20) + "n")
```
python python-3.x statistics markov-chain
python python-3.x statistics markov-chain
New contributor
New contributor
edited 7 hours ago
vnp
42.7k2 gold badges35 silver badges109 bronze badges
42.7k2 gold badges35 silver badges109 bronze badges
New contributor
asked 8 hours ago
HashewHashew
534 bronze badges
534 bronze badges
New contributor
New contributor
add a comment |
add a comment |
2 Answers
2
active
oldest
votes
$begingroup$
chooseNextWord
distorts the probabilities.For example, consider a list of 3 words with the inherent probabilities [⅓, ⅓, ⅓]. The first word is selected with the probability ⅓. The second, however is selected with probability ½ (⅓ that word in the second round is selected, divided by ⅔ that the word in the first round isn't selected). Following the same reasoning, selecting the third word is certain (⅓ that word in the third round is selected, divided by ⅓, the probability that both the first and the second word aren't chosen).
A standard approach is to compute an accumulated sums of probabilities (in the constructor), then to choose a word roll once, and search for a value just above the rolled one.
The code may benefit from using
defaultdict
rather than a plain dictionaries. Lesserif
s is better.Nitpicking. You may want to account for possible typos, such as a space between a word and a terminating punctuation.
$endgroup$
$begingroup$
I completely forgot about conditional probability, I will implement sampling using the CDF then. I will look into the rest, thank you.
$endgroup$
– Hashew
7 hours ago
2
$begingroup$
@Hashew I rolled back your last edit. It is against the CR chapter to edit the code after a review was posted, because it invalidates the review. You are welcome to post a separate follow-up question.
$endgroup$
– vnp
7 hours ago
$begingroup$
I think this approach based on conditional probability is correct: The probability that a word is chosen on the condition that all of the previous ones weren't isP = (probability that the word was chosen and the previous weren't) / (probability that the previous ones weren't chosen)
which is equal to(probability that the word was chosen) / (probability that the previous weren't)
. So if I initialize a variable (say notPrevProb) to 1 and subtract the probability of each word once I am done with it, I should be able to produce the correct probabilities. What do you think?
$endgroup$
– Hashew
7 hours ago
add a comment |
$begingroup$
endTerms
When a word ends with an endTerm, think you need to include an START or END symbol in adjList. Most words can appear anywhere in a sentence. So it is unlikely that you can end a sentence only when words don't have any follow-on words. Include the START/END symbol in adjList and the Markov process can also end a sentence.
the standard library
collections.defaultdict
provides a dictionary that when you attempt to access a new key automatically initializes the new key to a default value.
collections.Counter
provides a dictionary that counts things.
random.choices
selects items from a population according to specified weights.
import collections
import random
class MarkovChain:
START = ""
def __init__(self, file):
adjList = collections.defaultdict(collections.Counter)
# this inserts START into the defaultdict
adjList[MarkovChain.START]
prev = MarkovChain.START
for line in file:
for word in line.split():
endsTerm = word[-1] in ('.', '?', '!')
if (endsTerm):
word = word[:-1]
adjList[prev].update([word])
if endsTerm:
# mark the end of a sentence
adjList[word].update([MarkovChain.START])
prev = MarkovChain.START
else:
prev = word
# convert defaultdict to a regular dict
# the values are a tuple: ([follow words], [counts])
# for use in random.choices() in chooseNextWord()
self.adjList = k:(list(v.keys()), list(v.values()))
for k,v in adjList.items()
#print(self.adjList)
def chooseNextWord(self, word):
# random.choices returns a list, hence the [0]
return random.choices(*self.adjList[word])[0]
def genSentence(self, limit = 0):
sentence = []
curWord = MarkovChain.START
while True:
curWord = self.chooseNextWord(curWord)
sentence.append(curWord)
if 0 < limit < len(sentence) or curWord == MarkovChain.START:
break
return ' '.join(sentence).strip()
$endgroup$
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "196"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Hashew is a new contributor. Be nice, and check out our Code of Conduct.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f225992%2fmarkov-chain-sentence-generator-in-python%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
chooseNextWord
distorts the probabilities.For example, consider a list of 3 words with the inherent probabilities [⅓, ⅓, ⅓]. The first word is selected with the probability ⅓. The second, however is selected with probability ½ (⅓ that word in the second round is selected, divided by ⅔ that the word in the first round isn't selected). Following the same reasoning, selecting the third word is certain (⅓ that word in the third round is selected, divided by ⅓, the probability that both the first and the second word aren't chosen).
A standard approach is to compute an accumulated sums of probabilities (in the constructor), then to choose a word roll once, and search for a value just above the rolled one.
The code may benefit from using
defaultdict
rather than a plain dictionaries. Lesserif
s is better.Nitpicking. You may want to account for possible typos, such as a space between a word and a terminating punctuation.
$endgroup$
$begingroup$
I completely forgot about conditional probability, I will implement sampling using the CDF then. I will look into the rest, thank you.
$endgroup$
– Hashew
7 hours ago
2
$begingroup$
@Hashew I rolled back your last edit. It is against the CR chapter to edit the code after a review was posted, because it invalidates the review. You are welcome to post a separate follow-up question.
$endgroup$
– vnp
7 hours ago
$begingroup$
I think this approach based on conditional probability is correct: The probability that a word is chosen on the condition that all of the previous ones weren't isP = (probability that the word was chosen and the previous weren't) / (probability that the previous ones weren't chosen)
which is equal to(probability that the word was chosen) / (probability that the previous weren't)
. So if I initialize a variable (say notPrevProb) to 1 and subtract the probability of each word once I am done with it, I should be able to produce the correct probabilities. What do you think?
$endgroup$
– Hashew
7 hours ago
add a comment |
$begingroup$
chooseNextWord
distorts the probabilities.For example, consider a list of 3 words with the inherent probabilities [⅓, ⅓, ⅓]. The first word is selected with the probability ⅓. The second, however is selected with probability ½ (⅓ that word in the second round is selected, divided by ⅔ that the word in the first round isn't selected). Following the same reasoning, selecting the third word is certain (⅓ that word in the third round is selected, divided by ⅓, the probability that both the first and the second word aren't chosen).
A standard approach is to compute an accumulated sums of probabilities (in the constructor), then to choose a word roll once, and search for a value just above the rolled one.
The code may benefit from using
defaultdict
rather than a plain dictionaries. Lesserif
s is better.Nitpicking. You may want to account for possible typos, such as a space between a word and a terminating punctuation.
$endgroup$
$begingroup$
I completely forgot about conditional probability, I will implement sampling using the CDF then. I will look into the rest, thank you.
$endgroup$
– Hashew
7 hours ago
2
$begingroup$
@Hashew I rolled back your last edit. It is against the CR chapter to edit the code after a review was posted, because it invalidates the review. You are welcome to post a separate follow-up question.
$endgroup$
– vnp
7 hours ago
$begingroup$
I think this approach based on conditional probability is correct: The probability that a word is chosen on the condition that all of the previous ones weren't isP = (probability that the word was chosen and the previous weren't) / (probability that the previous ones weren't chosen)
which is equal to(probability that the word was chosen) / (probability that the previous weren't)
. So if I initialize a variable (say notPrevProb) to 1 and subtract the probability of each word once I am done with it, I should be able to produce the correct probabilities. What do you think?
$endgroup$
– Hashew
7 hours ago
add a comment |
$begingroup$
chooseNextWord
distorts the probabilities.For example, consider a list of 3 words with the inherent probabilities [⅓, ⅓, ⅓]. The first word is selected with the probability ⅓. The second, however is selected with probability ½ (⅓ that word in the second round is selected, divided by ⅔ that the word in the first round isn't selected). Following the same reasoning, selecting the third word is certain (⅓ that word in the third round is selected, divided by ⅓, the probability that both the first and the second word aren't chosen).
A standard approach is to compute an accumulated sums of probabilities (in the constructor), then to choose a word roll once, and search for a value just above the rolled one.
The code may benefit from using
defaultdict
rather than a plain dictionaries. Lesserif
s is better.Nitpicking. You may want to account for possible typos, such as a space between a word and a terminating punctuation.
$endgroup$
chooseNextWord
distorts the probabilities.For example, consider a list of 3 words with the inherent probabilities [⅓, ⅓, ⅓]. The first word is selected with the probability ⅓. The second, however is selected with probability ½ (⅓ that word in the second round is selected, divided by ⅔ that the word in the first round isn't selected). Following the same reasoning, selecting the third word is certain (⅓ that word in the third round is selected, divided by ⅓, the probability that both the first and the second word aren't chosen).
A standard approach is to compute an accumulated sums of probabilities (in the constructor), then to choose a word roll once, and search for a value just above the rolled one.
The code may benefit from using
defaultdict
rather than a plain dictionaries. Lesserif
s is better.Nitpicking. You may want to account for possible typos, such as a space between a word and a terminating punctuation.
edited 6 hours ago
Toby Speight
31.1k7 gold badges45 silver badges135 bronze badges
31.1k7 gold badges45 silver badges135 bronze badges
answered 8 hours ago
vnpvnp
42.7k2 gold badges35 silver badges109 bronze badges
42.7k2 gold badges35 silver badges109 bronze badges
$begingroup$
I completely forgot about conditional probability, I will implement sampling using the CDF then. I will look into the rest, thank you.
$endgroup$
– Hashew
7 hours ago
2
$begingroup$
@Hashew I rolled back your last edit. It is against the CR chapter to edit the code after a review was posted, because it invalidates the review. You are welcome to post a separate follow-up question.
$endgroup$
– vnp
7 hours ago
$begingroup$
I think this approach based on conditional probability is correct: The probability that a word is chosen on the condition that all of the previous ones weren't isP = (probability that the word was chosen and the previous weren't) / (probability that the previous ones weren't chosen)
which is equal to(probability that the word was chosen) / (probability that the previous weren't)
. So if I initialize a variable (say notPrevProb) to 1 and subtract the probability of each word once I am done with it, I should be able to produce the correct probabilities. What do you think?
$endgroup$
– Hashew
7 hours ago
add a comment |
$begingroup$
I completely forgot about conditional probability, I will implement sampling using the CDF then. I will look into the rest, thank you.
$endgroup$
– Hashew
7 hours ago
2
$begingroup$
@Hashew I rolled back your last edit. It is against the CR chapter to edit the code after a review was posted, because it invalidates the review. You are welcome to post a separate follow-up question.
$endgroup$
– vnp
7 hours ago
$begingroup$
I think this approach based on conditional probability is correct: The probability that a word is chosen on the condition that all of the previous ones weren't isP = (probability that the word was chosen and the previous weren't) / (probability that the previous ones weren't chosen)
which is equal to(probability that the word was chosen) / (probability that the previous weren't)
. So if I initialize a variable (say notPrevProb) to 1 and subtract the probability of each word once I am done with it, I should be able to produce the correct probabilities. What do you think?
$endgroup$
– Hashew
7 hours ago
$begingroup$
I completely forgot about conditional probability, I will implement sampling using the CDF then. I will look into the rest, thank you.
$endgroup$
– Hashew
7 hours ago
$begingroup$
I completely forgot about conditional probability, I will implement sampling using the CDF then. I will look into the rest, thank you.
$endgroup$
– Hashew
7 hours ago
2
2
$begingroup$
@Hashew I rolled back your last edit. It is against the CR chapter to edit the code after a review was posted, because it invalidates the review. You are welcome to post a separate follow-up question.
$endgroup$
– vnp
7 hours ago
$begingroup$
@Hashew I rolled back your last edit. It is against the CR chapter to edit the code after a review was posted, because it invalidates the review. You are welcome to post a separate follow-up question.
$endgroup$
– vnp
7 hours ago
$begingroup$
I think this approach based on conditional probability is correct: The probability that a word is chosen on the condition that all of the previous ones weren't is
P = (probability that the word was chosen and the previous weren't) / (probability that the previous ones weren't chosen)
which is equal to (probability that the word was chosen) / (probability that the previous weren't)
. So if I initialize a variable (say notPrevProb) to 1 and subtract the probability of each word once I am done with it, I should be able to produce the correct probabilities. What do you think?$endgroup$
– Hashew
7 hours ago
$begingroup$
I think this approach based on conditional probability is correct: The probability that a word is chosen on the condition that all of the previous ones weren't is
P = (probability that the word was chosen and the previous weren't) / (probability that the previous ones weren't chosen)
which is equal to (probability that the word was chosen) / (probability that the previous weren't)
. So if I initialize a variable (say notPrevProb) to 1 and subtract the probability of each word once I am done with it, I should be able to produce the correct probabilities. What do you think?$endgroup$
– Hashew
7 hours ago
add a comment |
$begingroup$
endTerms
When a word ends with an endTerm, think you need to include an START or END symbol in adjList. Most words can appear anywhere in a sentence. So it is unlikely that you can end a sentence only when words don't have any follow-on words. Include the START/END symbol in adjList and the Markov process can also end a sentence.
the standard library
collections.defaultdict
provides a dictionary that when you attempt to access a new key automatically initializes the new key to a default value.
collections.Counter
provides a dictionary that counts things.
random.choices
selects items from a population according to specified weights.
import collections
import random
class MarkovChain:
START = ""
def __init__(self, file):
adjList = collections.defaultdict(collections.Counter)
# this inserts START into the defaultdict
adjList[MarkovChain.START]
prev = MarkovChain.START
for line in file:
for word in line.split():
endsTerm = word[-1] in ('.', '?', '!')
if (endsTerm):
word = word[:-1]
adjList[prev].update([word])
if endsTerm:
# mark the end of a sentence
adjList[word].update([MarkovChain.START])
prev = MarkovChain.START
else:
prev = word
# convert defaultdict to a regular dict
# the values are a tuple: ([follow words], [counts])
# for use in random.choices() in chooseNextWord()
self.adjList = k:(list(v.keys()), list(v.values()))
for k,v in adjList.items()
#print(self.adjList)
def chooseNextWord(self, word):
# random.choices returns a list, hence the [0]
return random.choices(*self.adjList[word])[0]
def genSentence(self, limit = 0):
sentence = []
curWord = MarkovChain.START
while True:
curWord = self.chooseNextWord(curWord)
sentence.append(curWord)
if 0 < limit < len(sentence) or curWord == MarkovChain.START:
break
return ' '.join(sentence).strip()
$endgroup$
add a comment |
$begingroup$
endTerms
When a word ends with an endTerm, think you need to include an START or END symbol in adjList. Most words can appear anywhere in a sentence. So it is unlikely that you can end a sentence only when words don't have any follow-on words. Include the START/END symbol in adjList and the Markov process can also end a sentence.
the standard library
collections.defaultdict
provides a dictionary that when you attempt to access a new key automatically initializes the new key to a default value.
collections.Counter
provides a dictionary that counts things.
random.choices
selects items from a population according to specified weights.
import collections
import random
class MarkovChain:
START = ""
def __init__(self, file):
adjList = collections.defaultdict(collections.Counter)
# this inserts START into the defaultdict
adjList[MarkovChain.START]
prev = MarkovChain.START
for line in file:
for word in line.split():
endsTerm = word[-1] in ('.', '?', '!')
if (endsTerm):
word = word[:-1]
adjList[prev].update([word])
if endsTerm:
# mark the end of a sentence
adjList[word].update([MarkovChain.START])
prev = MarkovChain.START
else:
prev = word
# convert defaultdict to a regular dict
# the values are a tuple: ([follow words], [counts])
# for use in random.choices() in chooseNextWord()
self.adjList = k:(list(v.keys()), list(v.values()))
for k,v in adjList.items()
#print(self.adjList)
def chooseNextWord(self, word):
# random.choices returns a list, hence the [0]
return random.choices(*self.adjList[word])[0]
def genSentence(self, limit = 0):
sentence = []
curWord = MarkovChain.START
while True:
curWord = self.chooseNextWord(curWord)
sentence.append(curWord)
if 0 < limit < len(sentence) or curWord == MarkovChain.START:
break
return ' '.join(sentence).strip()
$endgroup$
add a comment |
$begingroup$
endTerms
When a word ends with an endTerm, think you need to include an START or END symbol in adjList. Most words can appear anywhere in a sentence. So it is unlikely that you can end a sentence only when words don't have any follow-on words. Include the START/END symbol in adjList and the Markov process can also end a sentence.
the standard library
collections.defaultdict
provides a dictionary that when you attempt to access a new key automatically initializes the new key to a default value.
collections.Counter
provides a dictionary that counts things.
random.choices
selects items from a population according to specified weights.
import collections
import random
class MarkovChain:
START = ""
def __init__(self, file):
adjList = collections.defaultdict(collections.Counter)
# this inserts START into the defaultdict
adjList[MarkovChain.START]
prev = MarkovChain.START
for line in file:
for word in line.split():
endsTerm = word[-1] in ('.', '?', '!')
if (endsTerm):
word = word[:-1]
adjList[prev].update([word])
if endsTerm:
# mark the end of a sentence
adjList[word].update([MarkovChain.START])
prev = MarkovChain.START
else:
prev = word
# convert defaultdict to a regular dict
# the values are a tuple: ([follow words], [counts])
# for use in random.choices() in chooseNextWord()
self.adjList = k:(list(v.keys()), list(v.values()))
for k,v in adjList.items()
#print(self.adjList)
def chooseNextWord(self, word):
# random.choices returns a list, hence the [0]
return random.choices(*self.adjList[word])[0]
def genSentence(self, limit = 0):
sentence = []
curWord = MarkovChain.START
while True:
curWord = self.chooseNextWord(curWord)
sentence.append(curWord)
if 0 < limit < len(sentence) or curWord == MarkovChain.START:
break
return ' '.join(sentence).strip()
$endgroup$
endTerms
When a word ends with an endTerm, think you need to include an START or END symbol in adjList. Most words can appear anywhere in a sentence. So it is unlikely that you can end a sentence only when words don't have any follow-on words. Include the START/END symbol in adjList and the Markov process can also end a sentence.
the standard library
collections.defaultdict
provides a dictionary that when you attempt to access a new key automatically initializes the new key to a default value.
collections.Counter
provides a dictionary that counts things.
random.choices
selects items from a population according to specified weights.
import collections
import random
class MarkovChain:
START = ""
def __init__(self, file):
adjList = collections.defaultdict(collections.Counter)
# this inserts START into the defaultdict
adjList[MarkovChain.START]
prev = MarkovChain.START
for line in file:
for word in line.split():
endsTerm = word[-1] in ('.', '?', '!')
if (endsTerm):
word = word[:-1]
adjList[prev].update([word])
if endsTerm:
# mark the end of a sentence
adjList[word].update([MarkovChain.START])
prev = MarkovChain.START
else:
prev = word
# convert defaultdict to a regular dict
# the values are a tuple: ([follow words], [counts])
# for use in random.choices() in chooseNextWord()
self.adjList = k:(list(v.keys()), list(v.values()))
for k,v in adjList.items()
#print(self.adjList)
def chooseNextWord(self, word):
# random.choices returns a list, hence the [0]
return random.choices(*self.adjList[word])[0]
def genSentence(self, limit = 0):
sentence = []
curWord = MarkovChain.START
while True:
curWord = self.chooseNextWord(curWord)
sentence.append(curWord)
if 0 < limit < len(sentence) or curWord == MarkovChain.START:
break
return ' '.join(sentence).strip()
answered 1 hour ago
RootTwoRootTwo
1,0593 silver badges6 bronze badges
1,0593 silver badges6 bronze badges
add a comment |
add a comment |
Hashew is a new contributor. Be nice, and check out our Code of Conduct.
Hashew is a new contributor. Be nice, and check out our Code of Conduct.
Hashew is a new contributor. Be nice, and check out our Code of Conduct.
Hashew is a new contributor. Be nice, and check out our Code of Conduct.
Thanks for contributing an answer to Code Review Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f225992%2fmarkov-chain-sentence-generator-in-python%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown