Markov-chain sentence generator in PythonMarkov Chain generatorShakespeare and dictionariesMarkov chain-based word salad generatorMarkov chain text generation in PythonCumulative transition matrix Markov Chain simulationMarkov Chain in PythonSentence generation using Markov ChainsCalculate probability of word occurenceGiven a string and a word dict, find all possible sentences

Is there a command to install basic applications on Ubuntu 16.04?

Why did I get only 5 points even though I won?

Why is the result of ('b'+'a'+ + 'a' + 'a').toLowerCase() 'banana'?

Loading military units into ships optimally, using backtracking

Safest way to store environment variable value in a file

What does the phrase "pull off sick wheelies and flips" mean here?

Are differences between uniformly distributed numbers uniformly distributed?

How to disable "Completion time:..." in SQL Server Messages window

What gave Harry Potter the idea of writing in Tom Riddle's diary?

How do some PhD students get 10+ papers? Is that what I need for landing good faculty position?

Plotting octahedron inside the sphere and sphere inside the cube

Not going forward with internship interview process

Breadcrumb history decision

Is there any way to stop a user from creating executables and running them?

How to describe accents?

Do beef farmed pastures net remove carbon emissions?

What should I call bands of armed men in the Middle Ages?

PhD advisor lost funding, need advice

Simplification of numbers

The cat ate your input again!

Why does the standard fingering / strumming for a D maj chord leave out the 5th string?

Redis Cache Shared Session Configuration

When were the tantalum capacitors first used in computing?

What is the status of the F-1B engine development?

Markov-chain sentence generator in Python

Markov Chain generatorShakespeare and dictionariesMarkov chain-based word salad generatorMarkov chain text generation in PythonCumulative transition matrix Markov Chain simulationMarkov Chain in PythonSentence generation using Markov ChainsCalculate probability of word occurenceGiven a string and a word dict, find all possible sentences

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;

I wrote a Markov-chain based sentence generator as my first non-trivial Python program. I mainly used C before, so I probably have ignored a lot of Python conventions and features, so any advice would be appreciated.

text-gen.py

import sys
import random

class MarkovChain:
 # Class constant that serves as an initial state for the Markov chain
 START = ""

 # The Markov chain is modelled as a directed graph,
 # with the START state acting as the only source,
 # and the tranisition probabilities as the graph weights.
 #
 # The graph is implemented using an adjacency list,
 # which in turn is implemented as a dictionary of dictionaries.
 #
 # self.adjList is a dictionary keyed by all words of the text
 # or START (the states). For each key/state, it contains
 # another dictionary indexed by the words of the text
 # that succeed the key in the text (the next states in the chain),
 # and for each of those words/next states the dictionary contains
 # the transition probability from the present state to them.
 #
 # This implementation was chosen because of it being easy to code,
 # and offering an easy way to iterate on both the probabilities and
 # the words/next states of each dictionary using items().
 def __init__(self, file):
 self.adjList = 

 # The totals dictionary is used in calculating the probabilities,
 # for every word in the text/chain state it contains the total
 # number of transitions from it to another state.
 totals = 


 # Start by insering the initial state to the structures
 self.adjList[MarkovChain.START] = 
 totals[MarkovChain.START] = 0

 # prev: Contains the previously encountered word or the START state,
 # initialized to the START state.
 prev = MarkovChain.START


 for line in file:
 for word in line.split():
 # If the word ends with a terminating punctuation mark,
 # ignore the mark, and treat the word as a terminating state as
 # it does not preceed another word in the current sentence.
 # So prev is set to START, in order for the text model
 # to account for the fact that some words start sentences
 # more frequently than others (not all words are next states of START).
 endsTerm = word[-1] == "." or word[-1] == "?" or word[-1] == "!"
 if (endsTerm):
 word = word[0:-1]

 # If this is the first time the word is encountered,
 # add it to the adjacency list, and initialize its dictionary
 # and transition total.
 if (word not in self.adjList):
 self.adjList[word] = 
 totals[word] = 0

 # If this is the first time the prev->word transition
 # was detected, initialize the prev->word transition frequency to 1,
 # else increment it.
 if (word in self.adjList[prev]):
 self.adjList[prev][word] += 1 
 else:
 self.adjList[prev][word] = 1

 # There is a prev->word state transition, so increment
 # the total transition number of the prev state.
 totals[prev] += 1

 if (endsTerm):
 prev = START

 # Using total, convert the transition frequencies
 # to transition probabilities.
 for word, neighbors in self.adjList.items():
 for name in neighbors:
 neighbors[name] /= totals[word]


 # chooseNextWord: Chooses the next state/word,
 # by sampling the non uniform transition probability distribution
 # of the current word/state. 
 def chooseNextWord(self, curWord):
 # Convert the dict_keys object to a list
 # to use indexing
 nextWords = list(self.adjList[curWord].keys())

 # Sampling is done through linear search.
 for word in nextWords[0:-1]:
 prob = self.adjList[curWord][word]
 roll = random.random()
 if (roll <= prob):
 return word

 # If none of the first N-1 words were chosen,
 # only the last one was left.
 return nextWords[-1]


 # genSentence: Generates a sentence. If a positive
 # limit is not provided by the caller, the sentences grow to
 # an arbitrary number of words, until the last word of a sentence/a terminal state
 # is reached.
 def genSentence(self, limit = 0):
 sentence = ""

 curWord = self.chooseNextWord(MarkovChain.START)
 sentence += curWord + " "

 if (limit > 0):
 wordsUsed = 1
 while (wordsUsed < limit and self.adjList[curWord]):
 curWord = self.chooseNextWord(curWord)
 sentence += curWord + " "
 wordsUsed += 1
 else:
 while (self.adjList[curWord]):
 curWord = self.chooseNextWord(curWord)
 sentence += curWord + " "

 return sentence



if (__name__ == "__main__"):
 if (len(sys.argv) < 3):
 print("Not enough arguements, run with python3 text-gen.py <input-filename> <sentence-num>")
 sys.exit(1)

 try:
 with open(sys.argv[1], "r") as f:
 markov = MarkovChain(f)

 except OSError as error:
 print(error.strerror)
 sys.exit(1)

 # Generate and print as many sentences as asked.
 for k in range(0, int(sys.argv[2])):
 print(markov.genSentence(20) + "n")
```

edited 7 hours ago

vnp

42.7k2 gold badges35 silver badges109 bronze badges

asked 8 hours ago

Hashew

534 bronze badges

New contributor

add a comment |

text-gen.py

import sys
import random

class MarkovChain:
 # Class constant that serves as an initial state for the Markov chain
 START = ""

 # The Markov chain is modelled as a directed graph,
 # with the START state acting as the only source,
 # and the tranisition probabilities as the graph weights.
 #
 # The graph is implemented using an adjacency list,
 # which in turn is implemented as a dictionary of dictionaries.
 #
 # self.adjList is a dictionary keyed by all words of the text
 # or START (the states). For each key/state, it contains
 # another dictionary indexed by the words of the text
 # that succeed the key in the text (the next states in the chain),
 # and for each of those words/next states the dictionary contains
 # the transition probability from the present state to them.
 #
 # This implementation was chosen because of it being easy to code,
 # and offering an easy way to iterate on both the probabilities and
 # the words/next states of each dictionary using items().
 def __init__(self, file):
 self.adjList = 

 # The totals dictionary is used in calculating the probabilities,
 # for every word in the text/chain state it contains the total
 # number of transitions from it to another state.
 totals = 


 # Start by insering the initial state to the structures
 self.adjList[MarkovChain.START] = 
 totals[MarkovChain.START] = 0

 # prev: Contains the previously encountered word or the START state,
 # initialized to the START state.
 prev = MarkovChain.START


 for line in file:
 for word in line.split():
 # If the word ends with a terminating punctuation mark,
 # ignore the mark, and treat the word as a terminating state as
 # it does not preceed another word in the current sentence.
 # So prev is set to START, in order for the text model
 # to account for the fact that some words start sentences
 # more frequently than others (not all words are next states of START).
 endsTerm = word[-1] == "." or word[-1] == "?" or word[-1] == "!"
 if (endsTerm):
 word = word[0:-1]

 # If this is the first time the word is encountered,
 # add it to the adjacency list, and initialize its dictionary
 # and transition total.
 if (word not in self.adjList):
 self.adjList[word] = 
 totals[word] = 0

 # If this is the first time the prev->word transition
 # was detected, initialize the prev->word transition frequency to 1,
 # else increment it.
 if (word in self.adjList[prev]):
 self.adjList[prev][word] += 1 
 else:
 self.adjList[prev][word] = 1

 # There is a prev->word state transition, so increment
 # the total transition number of the prev state.
 totals[prev] += 1

 if (endsTerm):
 prev = START

 # Using total, convert the transition frequencies
 # to transition probabilities.
 for word, neighbors in self.adjList.items():
 for name in neighbors:
 neighbors[name] /= totals[word]


 # chooseNextWord: Chooses the next state/word,
 # by sampling the non uniform transition probability distribution
 # of the current word/state. 
 def chooseNextWord(self, curWord):
 # Convert the dict_keys object to a list
 # to use indexing
 nextWords = list(self.adjList[curWord].keys())

 # Sampling is done through linear search.
 for word in nextWords[0:-1]:
 prob = self.adjList[curWord][word]
 roll = random.random()
 if (roll <= prob):
 return word

 # If none of the first N-1 words were chosen,
 # only the last one was left.
 return nextWords[-1]


 # genSentence: Generates a sentence. If a positive
 # limit is not provided by the caller, the sentences grow to
 # an arbitrary number of words, until the last word of a sentence/a terminal state
 # is reached.
 def genSentence(self, limit = 0):
 sentence = ""

 curWord = self.chooseNextWord(MarkovChain.START)
 sentence += curWord + " "

 if (limit > 0):
 wordsUsed = 1
 while (wordsUsed < limit and self.adjList[curWord]):
 curWord = self.chooseNextWord(curWord)
 sentence += curWord + " "
 wordsUsed += 1
 else:
 while (self.adjList[curWord]):
 curWord = self.chooseNextWord(curWord)
 sentence += curWord + " "

 return sentence



if (__name__ == "__main__"):
 if (len(sys.argv) < 3):
 print("Not enough arguements, run with python3 text-gen.py <input-filename> <sentence-num>")
 sys.exit(1)

 try:
 with open(sys.argv[1], "r") as f:
 markov = MarkovChain(f)

 except OSError as error:
 print(error.strerror)
 sys.exit(1)

 # Generate and print as many sentences as asked.
 for k in range(0, int(sys.argv[2])):
 print(markov.genSentence(20) + "n")
```

edited 7 hours ago

vnp

42.7k2 gold badges35 silver badges109 bronze badges

asked 8 hours ago

Hashew

534 bronze badges

New contributor

add a comment |

text-gen.py

import sys
import random

class MarkovChain:
 # Class constant that serves as an initial state for the Markov chain
 START = ""

 # The Markov chain is modelled as a directed graph,
 # with the START state acting as the only source,
 # and the tranisition probabilities as the graph weights.
 #
 # The graph is implemented using an adjacency list,
 # which in turn is implemented as a dictionary of dictionaries.
 #
 # self.adjList is a dictionary keyed by all words of the text
 # or START (the states). For each key/state, it contains
 # another dictionary indexed by the words of the text
 # that succeed the key in the text (the next states in the chain),
 # and for each of those words/next states the dictionary contains
 # the transition probability from the present state to them.
 #
 # This implementation was chosen because of it being easy to code,
 # and offering an easy way to iterate on both the probabilities and
 # the words/next states of each dictionary using items().
 def __init__(self, file):
 self.adjList = 

 # The totals dictionary is used in calculating the probabilities,
 # for every word in the text/chain state it contains the total
 # number of transitions from it to another state.
 totals = 


 # Start by insering the initial state to the structures
 self.adjList[MarkovChain.START] = 
 totals[MarkovChain.START] = 0

 # prev: Contains the previously encountered word or the START state,
 # initialized to the START state.
 prev = MarkovChain.START


 for line in file:
 for word in line.split():
 # If the word ends with a terminating punctuation mark,
 # ignore the mark, and treat the word as a terminating state as
 # it does not preceed another word in the current sentence.
 # So prev is set to START, in order for the text model
 # to account for the fact that some words start sentences
 # more frequently than others (not all words are next states of START).
 endsTerm = word[-1] == "." or word[-1] == "?" or word[-1] == "!"
 if (endsTerm):
 word = word[0:-1]

 # If this is the first time the word is encountered,
 # add it to the adjacency list, and initialize its dictionary
 # and transition total.
 if (word not in self.adjList):
 self.adjList[word] = 
 totals[word] = 0

 # If this is the first time the prev->word transition
 # was detected, initialize the prev->word transition frequency to 1,
 # else increment it.
 if (word in self.adjList[prev]):
 self.adjList[prev][word] += 1 
 else:
 self.adjList[prev][word] = 1

 # There is a prev->word state transition, so increment
 # the total transition number of the prev state.
 totals[prev] += 1

 if (endsTerm):
 prev = START

 # Using total, convert the transition frequencies
 # to transition probabilities.
 for word, neighbors in self.adjList.items():
 for name in neighbors:
 neighbors[name] /= totals[word]


 # chooseNextWord: Chooses the next state/word,
 # by sampling the non uniform transition probability distribution
 # of the current word/state. 
 def chooseNextWord(self, curWord):
 # Convert the dict_keys object to a list
 # to use indexing
 nextWords = list(self.adjList[curWord].keys())

 # Sampling is done through linear search.
 for word in nextWords[0:-1]:
 prob = self.adjList[curWord][word]
 roll = random.random()
 if (roll <= prob):
 return word

 # If none of the first N-1 words were chosen,
 # only the last one was left.
 return nextWords[-1]


 # genSentence: Generates a sentence. If a positive
 # limit is not provided by the caller, the sentences grow to
 # an arbitrary number of words, until the last word of a sentence/a terminal state
 # is reached.
 def genSentence(self, limit = 0):
 sentence = ""

 curWord = self.chooseNextWord(MarkovChain.START)
 sentence += curWord + " "

 if (limit > 0):
 wordsUsed = 1
 while (wordsUsed < limit and self.adjList[curWord]):
 curWord = self.chooseNextWord(curWord)
 sentence += curWord + " "
 wordsUsed += 1
 else:
 while (self.adjList[curWord]):
 curWord = self.chooseNextWord(curWord)
 sentence += curWord + " "

 return sentence



if (__name__ == "__main__"):
 if (len(sys.argv) < 3):
 print("Not enough arguements, run with python3 text-gen.py <input-filename> <sentence-num>")
 sys.exit(1)

 try:
 with open(sys.argv[1], "r") as f:
 markov = MarkovChain(f)

 except OSError as error:
 print(error.strerror)
 sys.exit(1)

 # Generate and print as many sentences as asked.
 for k in range(0, int(sys.argv[2])):
 print(markov.genSentence(20) + "n")
```

edited 7 hours ago

vnp

42.7k2 gold badges35 silver badges109 bronze badges

asked 8 hours ago

Hashew

534 bronze badges

New contributor

text-gen.py

import sys
import random

class MarkovChain:
 # Class constant that serves as an initial state for the Markov chain
 START = ""

 # The Markov chain is modelled as a directed graph,
 # with the START state acting as the only source,
 # and the tranisition probabilities as the graph weights.
 #
 # The graph is implemented using an adjacency list,
 # which in turn is implemented as a dictionary of dictionaries.
 #
 # self.adjList is a dictionary keyed by all words of the text
 # or START (the states). For each key/state, it contains
 # another dictionary indexed by the words of the text
 # that succeed the key in the text (the next states in the chain),
 # and for each of those words/next states the dictionary contains
 # the transition probability from the present state to them.
 #
 # This implementation was chosen because of it being easy to code,
 # and offering an easy way to iterate on both the probabilities and
 # the words/next states of each dictionary using items().
 def __init__(self, file):
 self.adjList = 

 # The totals dictionary is used in calculating the probabilities,
 # for every word in the text/chain state it contains the total
 # number of transitions from it to another state.
 totals = 


 # Start by insering the initial state to the structures
 self.adjList[MarkovChain.START] = 
 totals[MarkovChain.START] = 0

 # prev: Contains the previously encountered word or the START state,
 # initialized to the START state.
 prev = MarkovChain.START


 for line in file:
 for word in line.split():
 # If the word ends with a terminating punctuation mark,
 # ignore the mark, and treat the word as a terminating state as
 # it does not preceed another word in the current sentence.
 # So prev is set to START, in order for the text model
 # to account for the fact that some words start sentences
 # more frequently than others (not all words are next states of START).
 endsTerm = word[-1] == "." or word[-1] == "?" or word[-1] == "!"
 if (endsTerm):
 word = word[0:-1]

 # If this is the first time the word is encountered,
 # add it to the adjacency list, and initialize its dictionary
 # and transition total.
 if (word not in self.adjList):
 self.adjList[word] = 
 totals[word] = 0

 # If this is the first time the prev->word transition
 # was detected, initialize the prev->word transition frequency to 1,
 # else increment it.
 if (word in self.adjList[prev]):
 self.adjList[prev][word] += 1 
 else:
 self.adjList[prev][word] = 1

 # There is a prev->word state transition, so increment
 # the total transition number of the prev state.
 totals[prev] += 1

 if (endsTerm):
 prev = START

 # Using total, convert the transition frequencies
 # to transition probabilities.
 for word, neighbors in self.adjList.items():
 for name in neighbors:
 neighbors[name] /= totals[word]


 # chooseNextWord: Chooses the next state/word,
 # by sampling the non uniform transition probability distribution
 # of the current word/state. 
 def chooseNextWord(self, curWord):
 # Convert the dict_keys object to a list
 # to use indexing
 nextWords = list(self.adjList[curWord].keys())

 # Sampling is done through linear search.
 for word in nextWords[0:-1]:
 prob = self.adjList[curWord][word]
 roll = random.random()
 if (roll <= prob):
 return word

 # If none of the first N-1 words were chosen,
 # only the last one was left.
 return nextWords[-1]


 # genSentence: Generates a sentence. If a positive
 # limit is not provided by the caller, the sentences grow to
 # an arbitrary number of words, until the last word of a sentence/a terminal state
 # is reached.
 def genSentence(self, limit = 0):
 sentence = ""

 curWord = self.chooseNextWord(MarkovChain.START)
 sentence += curWord + " "

 if (limit > 0):
 wordsUsed = 1
 while (wordsUsed < limit and self.adjList[curWord]):
 curWord = self.chooseNextWord(curWord)
 sentence += curWord + " "
 wordsUsed += 1
 else:
 while (self.adjList[curWord]):
 curWord = self.chooseNextWord(curWord)
 sentence += curWord + " "

 return sentence



if (__name__ == "__main__"):
 if (len(sys.argv) < 3):
 print("Not enough arguements, run with python3 text-gen.py <input-filename> <sentence-num>")
 sys.exit(1)

 try:
 with open(sys.argv[1], "r") as f:
 markov = MarkovChain(f)

 except OSError as error:
 print(error.strerror)
 sys.exit(1)

 # Generate and print as many sentences as asked.
 for k in range(0, int(sys.argv[2])):
 print(markov.genSentence(20) + "n")
```

python python-3.x statistics markov-chain

edited 7 hours ago

vnp

42.7k2 gold badges35 silver badges109 bronze badges

asked 8 hours ago

Hashew

534 bronze badges

New contributor

edited 7 hours ago

vnp

42.7k2 gold badges35 silver badges109 bronze badges

asked 8 hours ago

Hashew

534 bronze badges

New contributor

edited 7 hours ago

vnp

42.7k2 gold badges35 silver badges109 bronze badges

edited 7 hours ago

vnp

42.7k2 gold badges35 silver badges109 bronze badges

edited 7 hours ago

vnp

42.7k2 gold badges35 silver badges109 bronze badges

asked 8 hours ago

Hashew

534 bronze badges

New contributor

asked 8 hours ago

Hashew

534 bronze badges

asked 8 hours ago

Hashew

534 bronze badges

New contributor

add a comment |

2 Answers
2

active

oldest

votes

chooseNextWord distorts the probabilities.

For example, consider a list of 3 words with the inherent probabilities [⅓, ⅓, ⅓]. The first word is selected with the probability ⅓. The second, however is selected with probability ½ (⅓ that word in the second round is selected, divided by ⅔ that the word in the first round isn't selected). Following the same reasoning, selecting the third word is certain (⅓ that word in the third round is selected, divided by ⅓, the probability that both the first and the second word aren't chosen).

A standard approach is to compute an accumulated sums of probabilities (in the constructor), then to choose a word roll once, and search for a value just above the rolled one.

The code may benefit from using defaultdict rather than a plain dictionaries. Lesser ifs is better.

Nitpicking. You may want to account for possible typos, such as a space between a word and a terminating punctuation.

edited 6 hours ago

Toby Speight

31.1k7 gold badges45 silver badges135 bronze badges

answered 8 hours ago

vnp

42.7k2 gold badges35 silver badges109 bronze badges

$begingroup$
I completely forgot about conditional probability, I will implement sampling using the CDF then. I will look into the rest, thank you.
$endgroup$
– Hashew
7 hours ago

2

$begingroup$
@Hashew I rolled back your last edit. It is against the CR chapter to edit the code after a review was posted, because it invalidates the review. You are welcome to post a separate follow-up question.
$endgroup$
– vnp
7 hours ago

$begingroup$
I think this approach based on conditional probability is correct: The probability that a word is chosen on the condition that all of the previous ones weren't is P = (probability that the word was chosen and the previous weren't) / (probability that the previous ones weren't chosen) which is equal to (probability that the word was chosen) / (probability that the previous weren't). So if I initialize a variable (say notPrevProb) to 1 and subtract the probability of each word once I am done with it, I should be able to produce the correct probabilities. What do you think?
$endgroup$
– Hashew
7 hours ago

add a comment |

endTerms

When a word ends with an endTerm, think you need to include an START or END symbol in adjList. Most words can appear anywhere in a sentence. So it is unlikely that you can end a sentence only when words don't have any follow-on words. Include the START/END symbol in adjList and the Markov process can also end a sentence.

the standard library

collections.defaultdict provides a dictionary that when you attempt to access a new key automatically initializes the new key to a default value.

collections.Counter provides a dictionary that counts things.

random.choices selects items from a population according to specified weights.

import collections
import random

class MarkovChain:
 START = ""

 def __init__(self, file):
 adjList = collections.defaultdict(collections.Counter)

 # this inserts START into the defaultdict
 adjList[MarkovChain.START]

 prev = MarkovChain.START

 for line in file:
 for word in line.split():
 endsTerm = word[-1] in ('.', '?', '!')

 if (endsTerm):
 word = word[:-1]

 adjList[prev].update([word])

 if endsTerm:
 # mark the end of a sentence
 adjList[word].update([MarkovChain.START])
 prev = MarkovChain.START
 else:
 prev = word

 # convert defaultdict to a regular dict
 # the values are a tuple: ([follow words], [counts])
 # for use in random.choices() in chooseNextWord()
 self.adjList = k:(list(v.keys()), list(v.values()))
 for k,v in adjList.items()

 #print(self.adjList)


 def chooseNextWord(self, word):
 # random.choices returns a list, hence the [0]
 return random.choices(*self.adjList[word])[0]


 def genSentence(self, limit = 0):
 sentence = []
 curWord = MarkovChain.START

 while True:
 curWord = self.chooseNextWord(curWord)
 sentence.append(curWord)

 if 0 < limit < len(sentence) or curWord == MarkovChain.START:
 break

 return ' '.join(sentence).strip()

answered 1 hour ago

RootTwo

1,0593 silver badges6 bronze badges

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "196"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

Hashew is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f225992%2fmarkov-chain-sentence-generator-in-python%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

chooseNextWord distorts the probabilities.

For example, consider a list of 3 words with the inherent probabilities [⅓, ⅓, ⅓]. The first word is selected with the probability ⅓. The second, however is selected with probability ½ (⅓ that word in the second round is selected, divided by ⅔ that the word in the first round isn't selected). Following the same reasoning, selecting the third word is certain (⅓ that word in the third round is selected, divided by ⅓, the probability that both the first and the second word aren't chosen).

A standard approach is to compute an accumulated sums of probabilities (in the constructor), then to choose a word roll once, and search for a value just above the rolled one.

The code may benefit from using defaultdict rather than a plain dictionaries. Lesser ifs is better.

Nitpicking. You may want to account for possible typos, such as a space between a word and a terminating punctuation.

edited 6 hours ago

Toby Speight

31.1k7 gold badges45 silver badges135 bronze badges

answered 8 hours ago

vnp

42.7k2 gold badges35 silver badges109 bronze badges

$begingroup$
I completely forgot about conditional probability, I will implement sampling using the CDF then. I will look into the rest, thank you.
$endgroup$
– Hashew
7 hours ago

2

$begingroup$
@Hashew I rolled back your last edit. It is against the CR chapter to edit the code after a review was posted, because it invalidates the review. You are welcome to post a separate follow-up question.
$endgroup$
– vnp
7 hours ago

$begingroup$
I think this approach based on conditional probability is correct: The probability that a word is chosen on the condition that all of the previous ones weren't is P = (probability that the word was chosen and the previous weren't) / (probability that the previous ones weren't chosen) which is equal to (probability that the word was chosen) / (probability that the previous weren't). So if I initialize a variable (say notPrevProb) to 1 and subtract the probability of each word once I am done with it, I should be able to produce the correct probabilities. What do you think?
$endgroup$
– Hashew
7 hours ago

add a comment |

chooseNextWord distorts the probabilities.

For example, consider a list of 3 words with the inherent probabilities [⅓, ⅓, ⅓]. The first word is selected with the probability ⅓. The second, however is selected with probability ½ (⅓ that word in the second round is selected, divided by ⅔ that the word in the first round isn't selected). Following the same reasoning, selecting the third word is certain (⅓ that word in the third round is selected, divided by ⅓, the probability that both the first and the second word aren't chosen).

A standard approach is to compute an accumulated sums of probabilities (in the constructor), then to choose a word roll once, and search for a value just above the rolled one.

The code may benefit from using defaultdict rather than a plain dictionaries. Lesser ifs is better.

Nitpicking. You may want to account for possible typos, such as a space between a word and a terminating punctuation.

edited 6 hours ago

Toby Speight

31.1k7 gold badges45 silver badges135 bronze badges

answered 8 hours ago

vnp

42.7k2 gold badges35 silver badges109 bronze badges

$begingroup$
I completely forgot about conditional probability, I will implement sampling using the CDF then. I will look into the rest, thank you.
$endgroup$
– Hashew
7 hours ago

2

$begingroup$
@Hashew I rolled back your last edit. It is against the CR chapter to edit the code after a review was posted, because it invalidates the review. You are welcome to post a separate follow-up question.
$endgroup$
– vnp
7 hours ago

$begingroup$
I think this approach based on conditional probability is correct: The probability that a word is chosen on the condition that all of the previous ones weren't is P = (probability that the word was chosen and the previous weren't) / (probability that the previous ones weren't chosen) which is equal to (probability that the word was chosen) / (probability that the previous weren't). So if I initialize a variable (say notPrevProb) to 1 and subtract the probability of each word once I am done with it, I should be able to produce the correct probabilities. What do you think?
$endgroup$
– Hashew
7 hours ago

add a comment |

chooseNextWord distorts the probabilities.

For example, consider a list of 3 words with the inherent probabilities [⅓, ⅓, ⅓]. The first word is selected with the probability ⅓. The second, however is selected with probability ½ (⅓ that word in the second round is selected, divided by ⅔ that the word in the first round isn't selected). Following the same reasoning, selecting the third word is certain (⅓ that word in the third round is selected, divided by ⅓, the probability that both the first and the second word aren't chosen).

A standard approach is to compute an accumulated sums of probabilities (in the constructor), then to choose a word roll once, and search for a value just above the rolled one.

The code may benefit from using defaultdict rather than a plain dictionaries. Lesser ifs is better.

Nitpicking. You may want to account for possible typos, such as a space between a word and a terminating punctuation.

edited 6 hours ago

Toby Speight

31.1k7 gold badges45 silver badges135 bronze badges

answered 8 hours ago

vnp

42.7k2 gold badges35 silver badges109 bronze badges

chooseNextWord distorts the probabilities.

For example, consider a list of 3 words with the inherent probabilities [⅓, ⅓, ⅓]. The first word is selected with the probability ⅓. The second, however is selected with probability ½ (⅓ that word in the second round is selected, divided by ⅔ that the word in the first round isn't selected). Following the same reasoning, selecting the third word is certain (⅓ that word in the third round is selected, divided by ⅓, the probability that both the first and the second word aren't chosen).

A standard approach is to compute an accumulated sums of probabilities (in the constructor), then to choose a word roll once, and search for a value just above the rolled one.

The code may benefit from using defaultdict rather than a plain dictionaries. Lesser ifs is better.

Nitpicking. You may want to account for possible typos, such as a space between a word and a terminating punctuation.

edited 6 hours ago

Toby Speight

31.1k7 gold badges45 silver badges135 bronze badges

answered 8 hours ago

vnp

42.7k2 gold badges35 silver badges109 bronze badges

edited 6 hours ago

Toby Speight

31.1k7 gold badges45 silver badges135 bronze badges

edited 6 hours ago

Toby Speight

31.1k7 gold badges45 silver badges135 bronze badges

edited 6 hours ago

Toby Speight

31.1k7 gold badges45 silver badges135 bronze badges

answered 8 hours ago

vnp

42.7k2 gold badges35 silver badges109 bronze badges

answered 8 hours ago

vnp

42.7k2 gold badges35 silver badges109 bronze badges

answered 8 hours ago

vnp

42.7k2 gold badges35 silver badges109 bronze badges

$begingroup$
I completely forgot about conditional probability, I will implement sampling using the CDF then. I will look into the rest, thank you.
$endgroup$
– Hashew
7 hours ago

2

$begingroup$
@Hashew I rolled back your last edit. It is against the CR chapter to edit the code after a review was posted, because it invalidates the review. You are welcome to post a separate follow-up question.
$endgroup$
– vnp
7 hours ago

$begingroup$
I think this approach based on conditional probability is correct: The probability that a word is chosen on the condition that all of the previous ones weren't is P = (probability that the word was chosen and the previous weren't) / (probability that the previous ones weren't chosen) which is equal to (probability that the word was chosen) / (probability that the previous weren't). So if I initialize a variable (say notPrevProb) to 1 and subtract the probability of each word once I am done with it, I should be able to produce the correct probabilities. What do you think?
$endgroup$
– Hashew
7 hours ago

add a comment |

$begingroup$
I completely forgot about conditional probability, I will implement sampling using the CDF then. I will look into the rest, thank you.
$endgroup$
– Hashew
7 hours ago

2

$begingroup$
@Hashew I rolled back your last edit. It is against the CR chapter to edit the code after a review was posted, because it invalidates the review. You are welcome to post a separate follow-up question.
$endgroup$
– vnp
7 hours ago

$begingroup$
I think this approach based on conditional probability is correct: The probability that a word is chosen on the condition that all of the previous ones weren't is P = (probability that the word was chosen and the previous weren't) / (probability that the previous ones weren't chosen) which is equal to (probability that the word was chosen) / (probability that the previous weren't). So if I initialize a variable (say notPrevProb) to 1 and subtract the probability of each word once I am done with it, I should be able to produce the correct probabilities. What do you think?
$endgroup$
– Hashew
7 hours ago

I completely forgot about conditional probability, I will implement sampling using the CDF then. I will look into the rest, thank you.

– Hashew
7 hours ago

@Hashew I rolled back your last edit. It is against the CR chapter to edit the code after a review was posted, because it invalidates the review. You are welcome to post a separate follow-up question.

– vnp
7 hours ago

I think this approach based on conditional probability is correct: The probability that a word is chosen on the condition that all of the previous ones weren't is

P = (probability that the word was chosen and the previous weren't) / (probability that the previous ones weren't chosen)

which is equal to (probability that the word was chosen) / (probability that the previous weren't). So if I initialize a variable (say notPrevProb) to 1 and subtract the probability of each word once I am done with it, I should be able to produce the correct probabilities. What do you think?

– Hashew
7 hours ago

I think this approach based on conditional probability is correct: The probability that a word is chosen on the condition that all of the previous ones weren't is

P = (probability that the word was chosen and the previous weren't) / (probability that the previous ones weren't chosen)

add a comment |

endTerms

the standard library

collections.defaultdict provides a dictionary that when you attempt to access a new key automatically initializes the new key to a default value.

collections.Counter provides a dictionary that counts things.

random.choices selects items from a population according to specified weights.

import collections
import random

class MarkovChain:
 START = ""

 def __init__(self, file):
 adjList = collections.defaultdict(collections.Counter)

 # this inserts START into the defaultdict
 adjList[MarkovChain.START]

 prev = MarkovChain.START

 for line in file:
 for word in line.split():
 endsTerm = word[-1] in ('.', '?', '!')

 if (endsTerm):
 word = word[:-1]

 adjList[prev].update([word])

 if endsTerm:
 # mark the end of a sentence
 adjList[word].update([MarkovChain.START])
 prev = MarkovChain.START
 else:
 prev = word

 # convert defaultdict to a regular dict
 # the values are a tuple: ([follow words], [counts])
 # for use in random.choices() in chooseNextWord()
 self.adjList = k:(list(v.keys()), list(v.values()))
 for k,v in adjList.items()

 #print(self.adjList)


 def chooseNextWord(self, word):
 # random.choices returns a list, hence the [0]
 return random.choices(*self.adjList[word])[0]


 def genSentence(self, limit = 0):
 sentence = []
 curWord = MarkovChain.START

 while True:
 curWord = self.chooseNextWord(curWord)
 sentence.append(curWord)

 if 0 < limit < len(sentence) or curWord == MarkovChain.START:
 break

 return ' '.join(sentence).strip()

answered 1 hour ago

RootTwo

1,0593 silver badges6 bronze badges

add a comment |

endTerms

the standard library

collections.defaultdict provides a dictionary that when you attempt to access a new key automatically initializes the new key to a default value.

collections.Counter provides a dictionary that counts things.

random.choices selects items from a population according to specified weights.

import collections
import random

class MarkovChain:
 START = ""

 def __init__(self, file):
 adjList = collections.defaultdict(collections.Counter)

 # this inserts START into the defaultdict
 adjList[MarkovChain.START]

 prev = MarkovChain.START

 for line in file:
 for word in line.split():
 endsTerm = word[-1] in ('.', '?', '!')

 if (endsTerm):
 word = word[:-1]

 adjList[prev].update([word])

 if endsTerm:
 # mark the end of a sentence
 adjList[word].update([MarkovChain.START])
 prev = MarkovChain.START
 else:
 prev = word

 # convert defaultdict to a regular dict
 # the values are a tuple: ([follow words], [counts])
 # for use in random.choices() in chooseNextWord()
 self.adjList = k:(list(v.keys()), list(v.values()))
 for k,v in adjList.items()

 #print(self.adjList)


 def chooseNextWord(self, word):
 # random.choices returns a list, hence the [0]
 return random.choices(*self.adjList[word])[0]


 def genSentence(self, limit = 0):
 sentence = []
 curWord = MarkovChain.START

 while True:
 curWord = self.chooseNextWord(curWord)
 sentence.append(curWord)

 if 0 < limit < len(sentence) or curWord == MarkovChain.START:
 break

 return ' '.join(sentence).strip()

answered 1 hour ago

RootTwo

1,0593 silver badges6 bronze badges

add a comment |

endTerms

the standard library

collections.defaultdict provides a dictionary that when you attempt to access a new key automatically initializes the new key to a default value.

collections.Counter provides a dictionary that counts things.

random.choices selects items from a population according to specified weights.

import collections
import random

class MarkovChain:
 START = ""

 def __init__(self, file):
 adjList = collections.defaultdict(collections.Counter)

 # this inserts START into the defaultdict
 adjList[MarkovChain.START]

 prev = MarkovChain.START

 for line in file:
 for word in line.split():
 endsTerm = word[-1] in ('.', '?', '!')

 if (endsTerm):
 word = word[:-1]

 adjList[prev].update([word])

 if endsTerm:
 # mark the end of a sentence
 adjList[word].update([MarkovChain.START])
 prev = MarkovChain.START
 else:
 prev = word

 # convert defaultdict to a regular dict
 # the values are a tuple: ([follow words], [counts])
 # for use in random.choices() in chooseNextWord()
 self.adjList = k:(list(v.keys()), list(v.values()))
 for k,v in adjList.items()

 #print(self.adjList)


 def chooseNextWord(self, word):
 # random.choices returns a list, hence the [0]
 return random.choices(*self.adjList[word])[0]


 def genSentence(self, limit = 0):
 sentence = []
 curWord = MarkovChain.START

 while True:
 curWord = self.chooseNextWord(curWord)
 sentence.append(curWord)

 if 0 < limit < len(sentence) or curWord == MarkovChain.START:
 break

 return ' '.join(sentence).strip()

answered 1 hour ago

RootTwo

1,0593 silver badges6 bronze badges

endTerms

the standard library

collections.defaultdict provides a dictionary that when you attempt to access a new key automatically initializes the new key to a default value.

collections.Counter provides a dictionary that counts things.

random.choices selects items from a population according to specified weights.

import collections
import random

class MarkovChain:
 START = ""

 def __init__(self, file):
 adjList = collections.defaultdict(collections.Counter)

 # this inserts START into the defaultdict
 adjList[MarkovChain.START]

 prev = MarkovChain.START

 for line in file:
 for word in line.split():
 endsTerm = word[-1] in ('.', '?', '!')

 if (endsTerm):
 word = word[:-1]

 adjList[prev].update([word])

 if endsTerm:
 # mark the end of a sentence
 adjList[word].update([MarkovChain.START])
 prev = MarkovChain.START
 else:
 prev = word

 # convert defaultdict to a regular dict
 # the values are a tuple: ([follow words], [counts])
 # for use in random.choices() in chooseNextWord()
 self.adjList = k:(list(v.keys()), list(v.values()))
 for k,v in adjList.items()

 #print(self.adjList)


 def chooseNextWord(self, word):
 # random.choices returns a list, hence the [0]
 return random.choices(*self.adjList[word])[0]


 def genSentence(self, limit = 0):
 sentence = []
 curWord = MarkovChain.START

 while True:
 curWord = self.chooseNextWord(curWord)
 sentence.append(curWord)

 if 0 < limit < len(sentence) or curWord == MarkovChain.START:
 break

 return ' '.join(sentence).strip()

answered 1 hour ago

RootTwo

1,0593 silver badges6 bronze badges

answered 1 hour ago

RootTwo

1,0593 silver badges6 bronze badges

answered 1 hour ago

RootTwo

1,0593 silver badges6 bronze badges

answered 1 hour ago

RootTwo

1,0593 silver badges6 bronze badges

add a comment |

Hashew is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

Hashew is a new contributor. Be nice, and check out our Code of Conduct.

Thanks for contributing an answer to Code Review Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Mfcttrf

text-gen.py

text-gen.py

text-gen.py

text-gen.py

2 Answers
2

endTerms

the standard library

Your Answer

Post as a guest

2 Answers
2

2 Answers
2

endTerms

the standard library

endTerms

the standard library

endTerms

the standard library

endTerms

the standard library

Post as a guest

Popular posts from this blog

text-gen.py

text-gen.py

text-gen.py

text-gen.py

2 Answers 2

endTerms

the standard library

Your Answer

Sign up or log in

Post as a guest

Post as a guest

2 Answers 2

2 Answers 2

endTerms

the standard library

endTerms

the standard library

endTerms

the standard library

endTerms

the standard library

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

2 Answers
2

2 Answers
2

2 Answers
2