Markov-chain sentence generator in PythonMarkov Chain generatorShakespeare and dictionariesMarkov chain-based word salad generatorMarkov chain text generation in PythonCumulative transition matrix Markov Chain simulationMarkov Chain in PythonSentence generation using Markov ChainsCalculate probability of word occurenceGiven a string and a word dict, find all possible sentences

Is there a command to install basic applications on Ubuntu 16.04?

Why did I get only 5 points even though I won?

Why is the result of ('b'+'a'+ + 'a' + 'a').toLowerCase() 'banana'?

Loading military units into ships optimally, using backtracking

Safest way to store environment variable value in a file

What does the phrase "pull off sick wheelies and flips" mean here?

Are differences between uniformly distributed numbers uniformly distributed?

How to disable "Completion time:..." in SQL Server Messages window

What gave Harry Potter the idea of writing in Tom Riddle's diary?

How do some PhD students get 10+ papers? Is that what I need for landing good faculty position?

Plotting octahedron inside the sphere and sphere inside the cube

Not going forward with internship interview process

Breadcrumb history decision

Is there any way to stop a user from creating executables and running them?

How to describe accents?

Do beef farmed pastures net remove carbon emissions?

What should I call bands of armed men in the Middle Ages?

PhD advisor lost funding, need advice

Simplification of numbers

The cat ate your input again!

Why does the standard fingering / strumming for a D maj chord leave out the 5th string?

Redis Cache Shared Session Configuration

When were the tantalum capacitors first used in computing?

What is the status of the F-1B engine development?



Markov-chain sentence generator in Python


Markov Chain generatorShakespeare and dictionariesMarkov chain-based word salad generatorMarkov chain text generation in PythonCumulative transition matrix Markov Chain simulationMarkov Chain in PythonSentence generation using Markov ChainsCalculate probability of word occurenceGiven a string and a word dict, find all possible sentences






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;








5












$begingroup$


I wrote a Markov-chain based sentence generator as my first non-trivial Python program. I mainly used C before, so I probably have ignored a lot of Python conventions and features, so any advice would be appreciated.



text-gen.py



import sys
import random

class MarkovChain:
# Class constant that serves as an initial state for the Markov chain
START = ""

# The Markov chain is modelled as a directed graph,
# with the START state acting as the only source,
# and the tranisition probabilities as the graph weights.
#
# The graph is implemented using an adjacency list,
# which in turn is implemented as a dictionary of dictionaries.
#
# self.adjList is a dictionary keyed by all words of the text
# or START (the states). For each key/state, it contains
# another dictionary indexed by the words of the text
# that succeed the key in the text (the next states in the chain),
# and for each of those words/next states the dictionary contains
# the transition probability from the present state to them.
#
# This implementation was chosen because of it being easy to code,
# and offering an easy way to iterate on both the probabilities and
# the words/next states of each dictionary using items().
def __init__(self, file):
self.adjList =

# The totals dictionary is used in calculating the probabilities,
# for every word in the text/chain state it contains the total
# number of transitions from it to another state.
totals =


# Start by insering the initial state to the structures
self.adjList[MarkovChain.START] =
totals[MarkovChain.START] = 0

# prev: Contains the previously encountered word or the START state,
# initialized to the START state.
prev = MarkovChain.START


for line in file:
for word in line.split():
# If the word ends with a terminating punctuation mark,
# ignore the mark, and treat the word as a terminating state as
# it does not preceed another word in the current sentence.
# So prev is set to START, in order for the text model
# to account for the fact that some words start sentences
# more frequently than others (not all words are next states of START).
endsTerm = word[-1] == "." or word[-1] == "?" or word[-1] == "!"
if (endsTerm):
word = word[0:-1]

# If this is the first time the word is encountered,
# add it to the adjacency list, and initialize its dictionary
# and transition total.
if (word not in self.adjList):
self.adjList[word] =
totals[word] = 0

# If this is the first time the prev->word transition
# was detected, initialize the prev->word transition frequency to 1,
# else increment it.
if (word in self.adjList[prev]):
self.adjList[prev][word] += 1
else:
self.adjList[prev][word] = 1

# There is a prev->word state transition, so increment
# the total transition number of the prev state.
totals[prev] += 1

if (endsTerm):
prev = START

# Using total, convert the transition frequencies
# to transition probabilities.
for word, neighbors in self.adjList.items():
for name in neighbors:
neighbors[name] /= totals[word]


# chooseNextWord: Chooses the next state/word,
# by sampling the non uniform transition probability distribution
# of the current word/state.
def chooseNextWord(self, curWord):
# Convert the dict_keys object to a list
# to use indexing
nextWords = list(self.adjList[curWord].keys())

# Sampling is done through linear search.
for word in nextWords[0:-1]:
prob = self.adjList[curWord][word]
roll = random.random()
if (roll <= prob):
return word

# If none of the first N-1 words were chosen,
# only the last one was left.
return nextWords[-1]


# genSentence: Generates a sentence. If a positive
# limit is not provided by the caller, the sentences grow to
# an arbitrary number of words, until the last word of a sentence/a terminal state
# is reached.
def genSentence(self, limit = 0):
sentence = ""

curWord = self.chooseNextWord(MarkovChain.START)
sentence += curWord + " "

if (limit > 0):
wordsUsed = 1
while (wordsUsed < limit and self.adjList[curWord]):
curWord = self.chooseNextWord(curWord)
sentence += curWord + " "
wordsUsed += 1
else:
while (self.adjList[curWord]):
curWord = self.chooseNextWord(curWord)
sentence += curWord + " "

return sentence



if (__name__ == "__main__"):
if (len(sys.argv) < 3):
print("Not enough arguements, run with python3 text-gen.py <input-filename> <sentence-num>")
sys.exit(1)

try:
with open(sys.argv[1], "r") as f:
markov = MarkovChain(f)

except OSError as error:
print(error.strerror)
sys.exit(1)

# Generate and print as many sentences as asked.
for k in range(0, int(sys.argv[2])):
print(markov.genSentence(20) + "n")
```









share|improve this question









New contributor



Hashew is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.






$endgroup$




















    5












    $begingroup$


    I wrote a Markov-chain based sentence generator as my first non-trivial Python program. I mainly used C before, so I probably have ignored a lot of Python conventions and features, so any advice would be appreciated.



    text-gen.py



    import sys
    import random

    class MarkovChain:
    # Class constant that serves as an initial state for the Markov chain
    START = ""

    # The Markov chain is modelled as a directed graph,
    # with the START state acting as the only source,
    # and the tranisition probabilities as the graph weights.
    #
    # The graph is implemented using an adjacency list,
    # which in turn is implemented as a dictionary of dictionaries.
    #
    # self.adjList is a dictionary keyed by all words of the text
    # or START (the states). For each key/state, it contains
    # another dictionary indexed by the words of the text
    # that succeed the key in the text (the next states in the chain),
    # and for each of those words/next states the dictionary contains
    # the transition probability from the present state to them.
    #
    # This implementation was chosen because of it being easy to code,
    # and offering an easy way to iterate on both the probabilities and
    # the words/next states of each dictionary using items().
    def __init__(self, file):
    self.adjList =

    # The totals dictionary is used in calculating the probabilities,
    # for every word in the text/chain state it contains the total
    # number of transitions from it to another state.
    totals =


    # Start by insering the initial state to the structures
    self.adjList[MarkovChain.START] =
    totals[MarkovChain.START] = 0

    # prev: Contains the previously encountered word or the START state,
    # initialized to the START state.
    prev = MarkovChain.START


    for line in file:
    for word in line.split():
    # If the word ends with a terminating punctuation mark,
    # ignore the mark, and treat the word as a terminating state as
    # it does not preceed another word in the current sentence.
    # So prev is set to START, in order for the text model
    # to account for the fact that some words start sentences
    # more frequently than others (not all words are next states of START).
    endsTerm = word[-1] == "." or word[-1] == "?" or word[-1] == "!"
    if (endsTerm):
    word = word[0:-1]

    # If this is the first time the word is encountered,
    # add it to the adjacency list, and initialize its dictionary
    # and transition total.
    if (word not in self.adjList):
    self.adjList[word] =
    totals[word] = 0

    # If this is the first time the prev->word transition
    # was detected, initialize the prev->word transition frequency to 1,
    # else increment it.
    if (word in self.adjList[prev]):
    self.adjList[prev][word] += 1
    else:
    self.adjList[prev][word] = 1

    # There is a prev->word state transition, so increment
    # the total transition number of the prev state.
    totals[prev] += 1

    if (endsTerm):
    prev = START

    # Using total, convert the transition frequencies
    # to transition probabilities.
    for word, neighbors in self.adjList.items():
    for name in neighbors:
    neighbors[name] /= totals[word]


    # chooseNextWord: Chooses the next state/word,
    # by sampling the non uniform transition probability distribution
    # of the current word/state.
    def chooseNextWord(self, curWord):
    # Convert the dict_keys object to a list
    # to use indexing
    nextWords = list(self.adjList[curWord].keys())

    # Sampling is done through linear search.
    for word in nextWords[0:-1]:
    prob = self.adjList[curWord][word]
    roll = random.random()
    if (roll <= prob):
    return word

    # If none of the first N-1 words were chosen,
    # only the last one was left.
    return nextWords[-1]


    # genSentence: Generates a sentence. If a positive
    # limit is not provided by the caller, the sentences grow to
    # an arbitrary number of words, until the last word of a sentence/a terminal state
    # is reached.
    def genSentence(self, limit = 0):
    sentence = ""

    curWord = self.chooseNextWord(MarkovChain.START)
    sentence += curWord + " "

    if (limit > 0):
    wordsUsed = 1
    while (wordsUsed < limit and self.adjList[curWord]):
    curWord = self.chooseNextWord(curWord)
    sentence += curWord + " "
    wordsUsed += 1
    else:
    while (self.adjList[curWord]):
    curWord = self.chooseNextWord(curWord)
    sentence += curWord + " "

    return sentence



    if (__name__ == "__main__"):
    if (len(sys.argv) < 3):
    print("Not enough arguements, run with python3 text-gen.py <input-filename> <sentence-num>")
    sys.exit(1)

    try:
    with open(sys.argv[1], "r") as f:
    markov = MarkovChain(f)

    except OSError as error:
    print(error.strerror)
    sys.exit(1)

    # Generate and print as many sentences as asked.
    for k in range(0, int(sys.argv[2])):
    print(markov.genSentence(20) + "n")
    ```









    share|improve this question









    New contributor



    Hashew is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
    Check out our Code of Conduct.






    $endgroup$
















      5












      5








      5





      $begingroup$


      I wrote a Markov-chain based sentence generator as my first non-trivial Python program. I mainly used C before, so I probably have ignored a lot of Python conventions and features, so any advice would be appreciated.



      text-gen.py



      import sys
      import random

      class MarkovChain:
      # Class constant that serves as an initial state for the Markov chain
      START = ""

      # The Markov chain is modelled as a directed graph,
      # with the START state acting as the only source,
      # and the tranisition probabilities as the graph weights.
      #
      # The graph is implemented using an adjacency list,
      # which in turn is implemented as a dictionary of dictionaries.
      #
      # self.adjList is a dictionary keyed by all words of the text
      # or START (the states). For each key/state, it contains
      # another dictionary indexed by the words of the text
      # that succeed the key in the text (the next states in the chain),
      # and for each of those words/next states the dictionary contains
      # the transition probability from the present state to them.
      #
      # This implementation was chosen because of it being easy to code,
      # and offering an easy way to iterate on both the probabilities and
      # the words/next states of each dictionary using items().
      def __init__(self, file):
      self.adjList =

      # The totals dictionary is used in calculating the probabilities,
      # for every word in the text/chain state it contains the total
      # number of transitions from it to another state.
      totals =


      # Start by insering the initial state to the structures
      self.adjList[MarkovChain.START] =
      totals[MarkovChain.START] = 0

      # prev: Contains the previously encountered word or the START state,
      # initialized to the START state.
      prev = MarkovChain.START


      for line in file:
      for word in line.split():
      # If the word ends with a terminating punctuation mark,
      # ignore the mark, and treat the word as a terminating state as
      # it does not preceed another word in the current sentence.
      # So prev is set to START, in order for the text model
      # to account for the fact that some words start sentences
      # more frequently than others (not all words are next states of START).
      endsTerm = word[-1] == "." or word[-1] == "?" or word[-1] == "!"
      if (endsTerm):
      word = word[0:-1]

      # If this is the first time the word is encountered,
      # add it to the adjacency list, and initialize its dictionary
      # and transition total.
      if (word not in self.adjList):
      self.adjList[word] =
      totals[word] = 0

      # If this is the first time the prev->word transition
      # was detected, initialize the prev->word transition frequency to 1,
      # else increment it.
      if (word in self.adjList[prev]):
      self.adjList[prev][word] += 1
      else:
      self.adjList[prev][word] = 1

      # There is a prev->word state transition, so increment
      # the total transition number of the prev state.
      totals[prev] += 1

      if (endsTerm):
      prev = START

      # Using total, convert the transition frequencies
      # to transition probabilities.
      for word, neighbors in self.adjList.items():
      for name in neighbors:
      neighbors[name] /= totals[word]


      # chooseNextWord: Chooses the next state/word,
      # by sampling the non uniform transition probability distribution
      # of the current word/state.
      def chooseNextWord(self, curWord):
      # Convert the dict_keys object to a list
      # to use indexing
      nextWords = list(self.adjList[curWord].keys())

      # Sampling is done through linear search.
      for word in nextWords[0:-1]:
      prob = self.adjList[curWord][word]
      roll = random.random()
      if (roll <= prob):
      return word

      # If none of the first N-1 words were chosen,
      # only the last one was left.
      return nextWords[-1]


      # genSentence: Generates a sentence. If a positive
      # limit is not provided by the caller, the sentences grow to
      # an arbitrary number of words, until the last word of a sentence/a terminal state
      # is reached.
      def genSentence(self, limit = 0):
      sentence = ""

      curWord = self.chooseNextWord(MarkovChain.START)
      sentence += curWord + " "

      if (limit > 0):
      wordsUsed = 1
      while (wordsUsed < limit and self.adjList[curWord]):
      curWord = self.chooseNextWord(curWord)
      sentence += curWord + " "
      wordsUsed += 1
      else:
      while (self.adjList[curWord]):
      curWord = self.chooseNextWord(curWord)
      sentence += curWord + " "

      return sentence



      if (__name__ == "__main__"):
      if (len(sys.argv) < 3):
      print("Not enough arguements, run with python3 text-gen.py <input-filename> <sentence-num>")
      sys.exit(1)

      try:
      with open(sys.argv[1], "r") as f:
      markov = MarkovChain(f)

      except OSError as error:
      print(error.strerror)
      sys.exit(1)

      # Generate and print as many sentences as asked.
      for k in range(0, int(sys.argv[2])):
      print(markov.genSentence(20) + "n")
      ```









      share|improve this question









      New contributor



      Hashew is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.






      $endgroup$




      I wrote a Markov-chain based sentence generator as my first non-trivial Python program. I mainly used C before, so I probably have ignored a lot of Python conventions and features, so any advice would be appreciated.



      text-gen.py



      import sys
      import random

      class MarkovChain:
      # Class constant that serves as an initial state for the Markov chain
      START = ""

      # The Markov chain is modelled as a directed graph,
      # with the START state acting as the only source,
      # and the tranisition probabilities as the graph weights.
      #
      # The graph is implemented using an adjacency list,
      # which in turn is implemented as a dictionary of dictionaries.
      #
      # self.adjList is a dictionary keyed by all words of the text
      # or START (the states). For each key/state, it contains
      # another dictionary indexed by the words of the text
      # that succeed the key in the text (the next states in the chain),
      # and for each of those words/next states the dictionary contains
      # the transition probability from the present state to them.
      #
      # This implementation was chosen because of it being easy to code,
      # and offering an easy way to iterate on both the probabilities and
      # the words/next states of each dictionary using items().
      def __init__(self, file):
      self.adjList =

      # The totals dictionary is used in calculating the probabilities,
      # for every word in the text/chain state it contains the total
      # number of transitions from it to another state.
      totals =


      # Start by insering the initial state to the structures
      self.adjList[MarkovChain.START] =
      totals[MarkovChain.START] = 0

      # prev: Contains the previously encountered word or the START state,
      # initialized to the START state.
      prev = MarkovChain.START


      for line in file:
      for word in line.split():
      # If the word ends with a terminating punctuation mark,
      # ignore the mark, and treat the word as a terminating state as
      # it does not preceed another word in the current sentence.
      # So prev is set to START, in order for the text model
      # to account for the fact that some words start sentences
      # more frequently than others (not all words are next states of START).
      endsTerm = word[-1] == "." or word[-1] == "?" or word[-1] == "!"
      if (endsTerm):
      word = word[0:-1]

      # If this is the first time the word is encountered,
      # add it to the adjacency list, and initialize its dictionary
      # and transition total.
      if (word not in self.adjList):
      self.adjList[word] =
      totals[word] = 0

      # If this is the first time the prev->word transition
      # was detected, initialize the prev->word transition frequency to 1,
      # else increment it.
      if (word in self.adjList[prev]):
      self.adjList[prev][word] += 1
      else:
      self.adjList[prev][word] = 1

      # There is a prev->word state transition, so increment
      # the total transition number of the prev state.
      totals[prev] += 1

      if (endsTerm):
      prev = START

      # Using total, convert the transition frequencies
      # to transition probabilities.
      for word, neighbors in self.adjList.items():
      for name in neighbors:
      neighbors[name] /= totals[word]


      # chooseNextWord: Chooses the next state/word,
      # by sampling the non uniform transition probability distribution
      # of the current word/state.
      def chooseNextWord(self, curWord):
      # Convert the dict_keys object to a list
      # to use indexing
      nextWords = list(self.adjList[curWord].keys())

      # Sampling is done through linear search.
      for word in nextWords[0:-1]:
      prob = self.adjList[curWord][word]
      roll = random.random()
      if (roll <= prob):
      return word

      # If none of the first N-1 words were chosen,
      # only the last one was left.
      return nextWords[-1]


      # genSentence: Generates a sentence. If a positive
      # limit is not provided by the caller, the sentences grow to
      # an arbitrary number of words, until the last word of a sentence/a terminal state
      # is reached.
      def genSentence(self, limit = 0):
      sentence = ""

      curWord = self.chooseNextWord(MarkovChain.START)
      sentence += curWord + " "

      if (limit > 0):
      wordsUsed = 1
      while (wordsUsed < limit and self.adjList[curWord]):
      curWord = self.chooseNextWord(curWord)
      sentence += curWord + " "
      wordsUsed += 1
      else:
      while (self.adjList[curWord]):
      curWord = self.chooseNextWord(curWord)
      sentence += curWord + " "

      return sentence



      if (__name__ == "__main__"):
      if (len(sys.argv) < 3):
      print("Not enough arguements, run with python3 text-gen.py <input-filename> <sentence-num>")
      sys.exit(1)

      try:
      with open(sys.argv[1], "r") as f:
      markov = MarkovChain(f)

      except OSError as error:
      print(error.strerror)
      sys.exit(1)

      # Generate and print as many sentences as asked.
      for k in range(0, int(sys.argv[2])):
      print(markov.genSentence(20) + "n")
      ```






      python python-3.x statistics markov-chain






      share|improve this question









      New contributor



      Hashew is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.










      share|improve this question









      New contributor



      Hashew is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.








      share|improve this question




      share|improve this question








      edited 7 hours ago









      vnp

      42.7k2 gold badges35 silver badges109 bronze badges




      42.7k2 gold badges35 silver badges109 bronze badges






      New contributor



      Hashew is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.








      asked 8 hours ago









      HashewHashew

      534 bronze badges




      534 bronze badges




      New contributor



      Hashew is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.




      New contributor




      Hashew is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.

























          2 Answers
          2






          active

          oldest

          votes


















          5












          $begingroup$


          • chooseNextWord distorts the probabilities.



            For example, consider a list of 3 words with the inherent probabilities [⅓, ⅓, ⅓]. The first word is selected with the probability ⅓. The second, however is selected with probability ½ (⅓ that word in the second round is selected, divided by ⅔ that the word in the first round isn't selected). Following the same reasoning, selecting the third word is certain (⅓ that word in the third round is selected, divided by ⅓, the probability that both the first and the second word aren't chosen).



            A standard approach is to compute an accumulated sums of probabilities (in the constructor), then to choose a word roll once, and search for a value just above the rolled one.



          • The code may benefit from using defaultdict rather than a plain dictionaries. Lesser ifs is better.


          • Nitpicking. You may want to account for possible typos, such as a space between a word and a terminating punctuation.






          share|improve this answer











          $endgroup$














          • $begingroup$
            I completely forgot about conditional probability, I will implement sampling using the CDF then. I will look into the rest, thank you.
            $endgroup$
            – Hashew
            7 hours ago







          • 2




            $begingroup$
            @Hashew I rolled back your last edit. It is against the CR chapter to edit the code after a review was posted, because it invalidates the review. You are welcome to post a separate follow-up question.
            $endgroup$
            – vnp
            7 hours ago










          • $begingroup$
            I think this approach based on conditional probability is correct: The probability that a word is chosen on the condition that all of the previous ones weren't is P = (probability that the word was chosen and the previous weren't) / (probability that the previous ones weren't chosen) which is equal to (probability that the word was chosen) / (probability that the previous weren't). So if I initialize a variable (say notPrevProb) to 1 and subtract the probability of each word once I am done with it, I should be able to produce the correct probabilities. What do you think?
            $endgroup$
            – Hashew
            7 hours ago



















          0












          $begingroup$

          endTerms



          When a word ends with an endTerm, think you need to include an START or END symbol in adjList. Most words can appear anywhere in a sentence. So it is unlikely that you can end a sentence only when words don't have any follow-on words. Include the START/END symbol in adjList and the Markov process can also end a sentence.



          the standard library



          collections.defaultdict provides a dictionary that when you attempt to access a new key automatically initializes the new key to a default value.



          collections.Counter provides a dictionary that counts things.



          random.choices selects items from a population according to specified weights.



          import collections
          import random

          class MarkovChain:
          START = ""

          def __init__(self, file):
          adjList = collections.defaultdict(collections.Counter)

          # this inserts START into the defaultdict
          adjList[MarkovChain.START]

          prev = MarkovChain.START

          for line in file:
          for word in line.split():
          endsTerm = word[-1] in ('.', '?', '!')

          if (endsTerm):
          word = word[:-1]

          adjList[prev].update([word])

          if endsTerm:
          # mark the end of a sentence
          adjList[word].update([MarkovChain.START])
          prev = MarkovChain.START
          else:
          prev = word

          # convert defaultdict to a regular dict
          # the values are a tuple: ([follow words], [counts])
          # for use in random.choices() in chooseNextWord()
          self.adjList = k:(list(v.keys()), list(v.values()))
          for k,v in adjList.items()

          #print(self.adjList)


          def chooseNextWord(self, word):
          # random.choices returns a list, hence the [0]
          return random.choices(*self.adjList[word])[0]


          def genSentence(self, limit = 0):
          sentence = []
          curWord = MarkovChain.START

          while True:
          curWord = self.chooseNextWord(curWord)
          sentence.append(curWord)

          if 0 < limit < len(sentence) or curWord == MarkovChain.START:
          break

          return ' '.join(sentence).strip()





          share|improve this answer









          $endgroup$

















            Your Answer






            StackExchange.ifUsing("editor", function ()
            StackExchange.using("externalEditor", function ()
            StackExchange.using("snippets", function ()
            StackExchange.snippets.init();
            );
            );
            , "code-snippets");

            StackExchange.ready(function()
            var channelOptions =
            tags: "".split(" "),
            id: "196"
            ;
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function()
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled)
            StackExchange.using("snippets", function()
            createEditor();
            );

            else
            createEditor();

            );

            function createEditor()
            StackExchange.prepareEditor(
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: false,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: null,
            bindNavPrevention: true,
            postfix: "",
            imageUploader:
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            ,
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            );



            );






            Hashew is a new contributor. Be nice, and check out our Code of Conduct.









            draft saved

            draft discarded


















            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f225992%2fmarkov-chain-sentence-generator-in-python%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown

























            2 Answers
            2






            active

            oldest

            votes








            2 Answers
            2






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            5












            $begingroup$


            • chooseNextWord distorts the probabilities.



              For example, consider a list of 3 words with the inherent probabilities [⅓, ⅓, ⅓]. The first word is selected with the probability ⅓. The second, however is selected with probability ½ (⅓ that word in the second round is selected, divided by ⅔ that the word in the first round isn't selected). Following the same reasoning, selecting the third word is certain (⅓ that word in the third round is selected, divided by ⅓, the probability that both the first and the second word aren't chosen).



              A standard approach is to compute an accumulated sums of probabilities (in the constructor), then to choose a word roll once, and search for a value just above the rolled one.



            • The code may benefit from using defaultdict rather than a plain dictionaries. Lesser ifs is better.


            • Nitpicking. You may want to account for possible typos, such as a space between a word and a terminating punctuation.






            share|improve this answer











            $endgroup$














            • $begingroup$
              I completely forgot about conditional probability, I will implement sampling using the CDF then. I will look into the rest, thank you.
              $endgroup$
              – Hashew
              7 hours ago







            • 2




              $begingroup$
              @Hashew I rolled back your last edit. It is against the CR chapter to edit the code after a review was posted, because it invalidates the review. You are welcome to post a separate follow-up question.
              $endgroup$
              – vnp
              7 hours ago










            • $begingroup$
              I think this approach based on conditional probability is correct: The probability that a word is chosen on the condition that all of the previous ones weren't is P = (probability that the word was chosen and the previous weren't) / (probability that the previous ones weren't chosen) which is equal to (probability that the word was chosen) / (probability that the previous weren't). So if I initialize a variable (say notPrevProb) to 1 and subtract the probability of each word once I am done with it, I should be able to produce the correct probabilities. What do you think?
              $endgroup$
              – Hashew
              7 hours ago
















            5












            $begingroup$


            • chooseNextWord distorts the probabilities.



              For example, consider a list of 3 words with the inherent probabilities [⅓, ⅓, ⅓]. The first word is selected with the probability ⅓. The second, however is selected with probability ½ (⅓ that word in the second round is selected, divided by ⅔ that the word in the first round isn't selected). Following the same reasoning, selecting the third word is certain (⅓ that word in the third round is selected, divided by ⅓, the probability that both the first and the second word aren't chosen).



              A standard approach is to compute an accumulated sums of probabilities (in the constructor), then to choose a word roll once, and search for a value just above the rolled one.



            • The code may benefit from using defaultdict rather than a plain dictionaries. Lesser ifs is better.


            • Nitpicking. You may want to account for possible typos, such as a space between a word and a terminating punctuation.






            share|improve this answer











            $endgroup$














            • $begingroup$
              I completely forgot about conditional probability, I will implement sampling using the CDF then. I will look into the rest, thank you.
              $endgroup$
              – Hashew
              7 hours ago







            • 2




              $begingroup$
              @Hashew I rolled back your last edit. It is against the CR chapter to edit the code after a review was posted, because it invalidates the review. You are welcome to post a separate follow-up question.
              $endgroup$
              – vnp
              7 hours ago










            • $begingroup$
              I think this approach based on conditional probability is correct: The probability that a word is chosen on the condition that all of the previous ones weren't is P = (probability that the word was chosen and the previous weren't) / (probability that the previous ones weren't chosen) which is equal to (probability that the word was chosen) / (probability that the previous weren't). So if I initialize a variable (say notPrevProb) to 1 and subtract the probability of each word once I am done with it, I should be able to produce the correct probabilities. What do you think?
              $endgroup$
              – Hashew
              7 hours ago














            5












            5








            5





            $begingroup$


            • chooseNextWord distorts the probabilities.



              For example, consider a list of 3 words with the inherent probabilities [⅓, ⅓, ⅓]. The first word is selected with the probability ⅓. The second, however is selected with probability ½ (⅓ that word in the second round is selected, divided by ⅔ that the word in the first round isn't selected). Following the same reasoning, selecting the third word is certain (⅓ that word in the third round is selected, divided by ⅓, the probability that both the first and the second word aren't chosen).



              A standard approach is to compute an accumulated sums of probabilities (in the constructor), then to choose a word roll once, and search for a value just above the rolled one.



            • The code may benefit from using defaultdict rather than a plain dictionaries. Lesser ifs is better.


            • Nitpicking. You may want to account for possible typos, such as a space between a word and a terminating punctuation.






            share|improve this answer











            $endgroup$




            • chooseNextWord distorts the probabilities.



              For example, consider a list of 3 words with the inherent probabilities [⅓, ⅓, ⅓]. The first word is selected with the probability ⅓. The second, however is selected with probability ½ (⅓ that word in the second round is selected, divided by ⅔ that the word in the first round isn't selected). Following the same reasoning, selecting the third word is certain (⅓ that word in the third round is selected, divided by ⅓, the probability that both the first and the second word aren't chosen).



              A standard approach is to compute an accumulated sums of probabilities (in the constructor), then to choose a word roll once, and search for a value just above the rolled one.



            • The code may benefit from using defaultdict rather than a plain dictionaries. Lesser ifs is better.


            • Nitpicking. You may want to account for possible typos, such as a space between a word and a terminating punctuation.







            share|improve this answer














            share|improve this answer



            share|improve this answer








            edited 6 hours ago









            Toby Speight

            31.1k7 gold badges45 silver badges135 bronze badges




            31.1k7 gold badges45 silver badges135 bronze badges










            answered 8 hours ago









            vnpvnp

            42.7k2 gold badges35 silver badges109 bronze badges




            42.7k2 gold badges35 silver badges109 bronze badges














            • $begingroup$
              I completely forgot about conditional probability, I will implement sampling using the CDF then. I will look into the rest, thank you.
              $endgroup$
              – Hashew
              7 hours ago







            • 2




              $begingroup$
              @Hashew I rolled back your last edit. It is against the CR chapter to edit the code after a review was posted, because it invalidates the review. You are welcome to post a separate follow-up question.
              $endgroup$
              – vnp
              7 hours ago










            • $begingroup$
              I think this approach based on conditional probability is correct: The probability that a word is chosen on the condition that all of the previous ones weren't is P = (probability that the word was chosen and the previous weren't) / (probability that the previous ones weren't chosen) which is equal to (probability that the word was chosen) / (probability that the previous weren't). So if I initialize a variable (say notPrevProb) to 1 and subtract the probability of each word once I am done with it, I should be able to produce the correct probabilities. What do you think?
              $endgroup$
              – Hashew
              7 hours ago

















            • $begingroup$
              I completely forgot about conditional probability, I will implement sampling using the CDF then. I will look into the rest, thank you.
              $endgroup$
              – Hashew
              7 hours ago







            • 2




              $begingroup$
              @Hashew I rolled back your last edit. It is against the CR chapter to edit the code after a review was posted, because it invalidates the review. You are welcome to post a separate follow-up question.
              $endgroup$
              – vnp
              7 hours ago










            • $begingroup$
              I think this approach based on conditional probability is correct: The probability that a word is chosen on the condition that all of the previous ones weren't is P = (probability that the word was chosen and the previous weren't) / (probability that the previous ones weren't chosen) which is equal to (probability that the word was chosen) / (probability that the previous weren't). So if I initialize a variable (say notPrevProb) to 1 and subtract the probability of each word once I am done with it, I should be able to produce the correct probabilities. What do you think?
              $endgroup$
              – Hashew
              7 hours ago
















            $begingroup$
            I completely forgot about conditional probability, I will implement sampling using the CDF then. I will look into the rest, thank you.
            $endgroup$
            – Hashew
            7 hours ago





            $begingroup$
            I completely forgot about conditional probability, I will implement sampling using the CDF then. I will look into the rest, thank you.
            $endgroup$
            – Hashew
            7 hours ago





            2




            2




            $begingroup$
            @Hashew I rolled back your last edit. It is against the CR chapter to edit the code after a review was posted, because it invalidates the review. You are welcome to post a separate follow-up question.
            $endgroup$
            – vnp
            7 hours ago




            $begingroup$
            @Hashew I rolled back your last edit. It is against the CR chapter to edit the code after a review was posted, because it invalidates the review. You are welcome to post a separate follow-up question.
            $endgroup$
            – vnp
            7 hours ago












            $begingroup$
            I think this approach based on conditional probability is correct: The probability that a word is chosen on the condition that all of the previous ones weren't is P = (probability that the word was chosen and the previous weren't) / (probability that the previous ones weren't chosen) which is equal to (probability that the word was chosen) / (probability that the previous weren't). So if I initialize a variable (say notPrevProb) to 1 and subtract the probability of each word once I am done with it, I should be able to produce the correct probabilities. What do you think?
            $endgroup$
            – Hashew
            7 hours ago





            $begingroup$
            I think this approach based on conditional probability is correct: The probability that a word is chosen on the condition that all of the previous ones weren't is P = (probability that the word was chosen and the previous weren't) / (probability that the previous ones weren't chosen) which is equal to (probability that the word was chosen) / (probability that the previous weren't). So if I initialize a variable (say notPrevProb) to 1 and subtract the probability of each word once I am done with it, I should be able to produce the correct probabilities. What do you think?
            $endgroup$
            – Hashew
            7 hours ago














            0












            $begingroup$

            endTerms



            When a word ends with an endTerm, think you need to include an START or END symbol in adjList. Most words can appear anywhere in a sentence. So it is unlikely that you can end a sentence only when words don't have any follow-on words. Include the START/END symbol in adjList and the Markov process can also end a sentence.



            the standard library



            collections.defaultdict provides a dictionary that when you attempt to access a new key automatically initializes the new key to a default value.



            collections.Counter provides a dictionary that counts things.



            random.choices selects items from a population according to specified weights.



            import collections
            import random

            class MarkovChain:
            START = ""

            def __init__(self, file):
            adjList = collections.defaultdict(collections.Counter)

            # this inserts START into the defaultdict
            adjList[MarkovChain.START]

            prev = MarkovChain.START

            for line in file:
            for word in line.split():
            endsTerm = word[-1] in ('.', '?', '!')

            if (endsTerm):
            word = word[:-1]

            adjList[prev].update([word])

            if endsTerm:
            # mark the end of a sentence
            adjList[word].update([MarkovChain.START])
            prev = MarkovChain.START
            else:
            prev = word

            # convert defaultdict to a regular dict
            # the values are a tuple: ([follow words], [counts])
            # for use in random.choices() in chooseNextWord()
            self.adjList = k:(list(v.keys()), list(v.values()))
            for k,v in adjList.items()

            #print(self.adjList)


            def chooseNextWord(self, word):
            # random.choices returns a list, hence the [0]
            return random.choices(*self.adjList[word])[0]


            def genSentence(self, limit = 0):
            sentence = []
            curWord = MarkovChain.START

            while True:
            curWord = self.chooseNextWord(curWord)
            sentence.append(curWord)

            if 0 < limit < len(sentence) or curWord == MarkovChain.START:
            break

            return ' '.join(sentence).strip()





            share|improve this answer









            $endgroup$



















              0












              $begingroup$

              endTerms



              When a word ends with an endTerm, think you need to include an START or END symbol in adjList. Most words can appear anywhere in a sentence. So it is unlikely that you can end a sentence only when words don't have any follow-on words. Include the START/END symbol in adjList and the Markov process can also end a sentence.



              the standard library



              collections.defaultdict provides a dictionary that when you attempt to access a new key automatically initializes the new key to a default value.



              collections.Counter provides a dictionary that counts things.



              random.choices selects items from a population according to specified weights.



              import collections
              import random

              class MarkovChain:
              START = ""

              def __init__(self, file):
              adjList = collections.defaultdict(collections.Counter)

              # this inserts START into the defaultdict
              adjList[MarkovChain.START]

              prev = MarkovChain.START

              for line in file:
              for word in line.split():
              endsTerm = word[-1] in ('.', '?', '!')

              if (endsTerm):
              word = word[:-1]

              adjList[prev].update([word])

              if endsTerm:
              # mark the end of a sentence
              adjList[word].update([MarkovChain.START])
              prev = MarkovChain.START
              else:
              prev = word

              # convert defaultdict to a regular dict
              # the values are a tuple: ([follow words], [counts])
              # for use in random.choices() in chooseNextWord()
              self.adjList = k:(list(v.keys()), list(v.values()))
              for k,v in adjList.items()

              #print(self.adjList)


              def chooseNextWord(self, word):
              # random.choices returns a list, hence the [0]
              return random.choices(*self.adjList[word])[0]


              def genSentence(self, limit = 0):
              sentence = []
              curWord = MarkovChain.START

              while True:
              curWord = self.chooseNextWord(curWord)
              sentence.append(curWord)

              if 0 < limit < len(sentence) or curWord == MarkovChain.START:
              break

              return ' '.join(sentence).strip()





              share|improve this answer









              $endgroup$

















                0












                0








                0





                $begingroup$

                endTerms



                When a word ends with an endTerm, think you need to include an START or END symbol in adjList. Most words can appear anywhere in a sentence. So it is unlikely that you can end a sentence only when words don't have any follow-on words. Include the START/END symbol in adjList and the Markov process can also end a sentence.



                the standard library



                collections.defaultdict provides a dictionary that when you attempt to access a new key automatically initializes the new key to a default value.



                collections.Counter provides a dictionary that counts things.



                random.choices selects items from a population according to specified weights.



                import collections
                import random

                class MarkovChain:
                START = ""

                def __init__(self, file):
                adjList = collections.defaultdict(collections.Counter)

                # this inserts START into the defaultdict
                adjList[MarkovChain.START]

                prev = MarkovChain.START

                for line in file:
                for word in line.split():
                endsTerm = word[-1] in ('.', '?', '!')

                if (endsTerm):
                word = word[:-1]

                adjList[prev].update([word])

                if endsTerm:
                # mark the end of a sentence
                adjList[word].update([MarkovChain.START])
                prev = MarkovChain.START
                else:
                prev = word

                # convert defaultdict to a regular dict
                # the values are a tuple: ([follow words], [counts])
                # for use in random.choices() in chooseNextWord()
                self.adjList = k:(list(v.keys()), list(v.values()))
                for k,v in adjList.items()

                #print(self.adjList)


                def chooseNextWord(self, word):
                # random.choices returns a list, hence the [0]
                return random.choices(*self.adjList[word])[0]


                def genSentence(self, limit = 0):
                sentence = []
                curWord = MarkovChain.START

                while True:
                curWord = self.chooseNextWord(curWord)
                sentence.append(curWord)

                if 0 < limit < len(sentence) or curWord == MarkovChain.START:
                break

                return ' '.join(sentence).strip()





                share|improve this answer









                $endgroup$



                endTerms



                When a word ends with an endTerm, think you need to include an START or END symbol in adjList. Most words can appear anywhere in a sentence. So it is unlikely that you can end a sentence only when words don't have any follow-on words. Include the START/END symbol in adjList and the Markov process can also end a sentence.



                the standard library



                collections.defaultdict provides a dictionary that when you attempt to access a new key automatically initializes the new key to a default value.



                collections.Counter provides a dictionary that counts things.



                random.choices selects items from a population according to specified weights.



                import collections
                import random

                class MarkovChain:
                START = ""

                def __init__(self, file):
                adjList = collections.defaultdict(collections.Counter)

                # this inserts START into the defaultdict
                adjList[MarkovChain.START]

                prev = MarkovChain.START

                for line in file:
                for word in line.split():
                endsTerm = word[-1] in ('.', '?', '!')

                if (endsTerm):
                word = word[:-1]

                adjList[prev].update([word])

                if endsTerm:
                # mark the end of a sentence
                adjList[word].update([MarkovChain.START])
                prev = MarkovChain.START
                else:
                prev = word

                # convert defaultdict to a regular dict
                # the values are a tuple: ([follow words], [counts])
                # for use in random.choices() in chooseNextWord()
                self.adjList = k:(list(v.keys()), list(v.values()))
                for k,v in adjList.items()

                #print(self.adjList)


                def chooseNextWord(self, word):
                # random.choices returns a list, hence the [0]
                return random.choices(*self.adjList[word])[0]


                def genSentence(self, limit = 0):
                sentence = []
                curWord = MarkovChain.START

                while True:
                curWord = self.chooseNextWord(curWord)
                sentence.append(curWord)

                if 0 < limit < len(sentence) or curWord == MarkovChain.START:
                break

                return ' '.join(sentence).strip()






                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered 1 hour ago









                RootTwoRootTwo

                1,0593 silver badges6 bronze badges




                1,0593 silver badges6 bronze badges























                    Hashew is a new contributor. Be nice, and check out our Code of Conduct.









                    draft saved

                    draft discarded


















                    Hashew is a new contributor. Be nice, and check out our Code of Conduct.












                    Hashew is a new contributor. Be nice, and check out our Code of Conduct.











                    Hashew is a new contributor. Be nice, and check out our Code of Conduct.














                    Thanks for contributing an answer to Code Review Stack Exchange!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid


                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.

                    Use MathJax to format equations. MathJax reference.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function ()
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f225992%2fmarkov-chain-sentence-generator-in-python%23new-answer', 'question_page');

                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    Invision Community Contents History See also References External links Navigation menuProprietaryinvisioncommunity.comIPS Community ForumsIPS Community Forumsthis blog entry"License Changes, IP.Board 3.4, and the Future""Interview -- Matt Mecham of Ibforums""CEO Invision Power Board, Matt Mecham Is a Liar, Thief!"IPB License Explanation 1.3, 1.3.1, 2.0, and 2.1ArchivedSecurity Fixes, Updates And Enhancements For IPB 1.3.1Archived"New Demo Accounts - Invision Power Services"the original"New Default Skin"the original"Invision Power Board 3.0.0 and Applications Released"the original"Archived copy"the original"Perpetual licenses being done away with""Release Notes - Invision Power Services""Introducing: IPS Community Suite 4!"Invision Community Release Notes

                    Canceling a color specificationRandomly assigning color to Graphics3D objects?Default color for Filling in Mathematica 9Coloring specific elements of sets with a prime modified order in an array plotHow to pick a color differing significantly from the colors already in a given color list?Detection of the text colorColor numbers based on their valueCan color schemes for use with ColorData include opacity specification?My dynamic color schemes

                    Tom Holland Mục lục Đầu đời và giáo dục | Sự nghiệp | Cuộc sống cá nhân | Phim tham gia | Giải thưởng và đề cử | Chú thích | Liên kết ngoài | Trình đơn chuyển hướngProfile“Person Details for Thomas Stanley Holland, "England and Wales Birth Registration Index, 1837-2008" — FamilySearch.org”"Meet Tom Holland... the 16-year-old star of The Impossible""Schoolboy actor Tom Holland finds himself in Oscar contention for role in tsunami drama"“Naomi Watts on the Prince William and Harry's reaction to her film about the late Princess Diana”lưu trữ"Holland and Pflueger Are West End's Two New 'Billy Elliots'""I'm so envious of my son, the movie star! British writer Dominic Holland's spent 20 years trying to crack Hollywood - but he's been beaten to it by a very unlikely rival"“Richard and Margaret Povey of Jersey, Channel Islands, UK: Information about Thomas Stanley Holland”"Tom Holland to play Billy Elliot""New Billy Elliot leaving the garage"Billy Elliot the Musical - Tom Holland - Billy"A Tale of four Billys: Tom Holland""The Feel Good Factor""Thames Christian College schoolboys join Myleene Klass for The Feelgood Factor""Government launches £600,000 arts bursaries pilot""BILLY's Chapman, Holland, Gardner & Jackson-Keen Visit Prime Minister""Elton John 'blown away' by Billy Elliot fifth birthday" (video with John's interview and fragments of Holland's performance)"First News interviews Arrietty's Tom Holland"“33rd Critics' Circle Film Awards winners”“National Board of Review Current Awards”Bản gốc"Ron Howard Whaling Tale 'In The Heart Of The Sea' Casts Tom Holland"“'Spider-Man' Finds Tom Holland to Star as New Web-Slinger”lưu trữ“Captain America: Civil War (2016)”“Film Review: ‘Captain America: Civil War’”lưu trữ“‘Captain America: Civil War’ review: Choose your own avenger”lưu trữ“The Lost City of Z reviews”“Sony Pictures and Marvel Studios Find Their 'Spider-Man' Star and Director”“‘Mary Magdalene’, ‘Current War’ & ‘Wind River’ Get 2017 Release Dates From Weinstein”“Lionsgate Unleashing Daisy Ridley & Tom Holland Starrer ‘Chaos Walking’ In Cannes”“PTA's 'Master' Leads Chicago Film Critics Nominations, UPDATED: Houston and Indiana Critics Nominations”“Nominaciones Goya 2013 Telecinco Cinema – ENG”“Jameson Empire Film Awards: Martin Freeman wins best actor for performance in The Hobbit”“34th Annual Young Artist Awards”Bản gốc“Teen Choice Awards 2016—Captain America: Civil War Leads Second Wave of Nominations”“BAFTA Film Award Nominations: ‘La La Land’ Leads Race”“Saturn Awards Nominations 2017: 'Rogue One,' 'Walking Dead' Lead”Tom HollandTom HollandTom HollandTom Hollandmedia.gettyimages.comWorldCat Identities300279794no20130442900000 0004 0355 42791085670554170004732cb16706349t(data)XX5557367