0 Replies - 494 Views - Last Post: 21 February 2013 - 12:58 PM Rate Topic: -----

#1 donkatsu  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 8
  • Joined: 08-February 13

MapReduce/MRjob

Posted 21 February 2013 - 12:58 PM

How would I go about getting something like this:

"act according" 1
"act accordingly" 1
"act against" 1
"act as" 1
"act but" 1
"act if" 1
"act in" 1
"act of" 7
"act on" 2
"act with" 1
"act without" 1
"acted as" 1
"acted for" 1
"acted he" 1
"acted powerfully" 1
"acted wisely" 1
"acting thus" 2
"action as" 1
"action beyond" 1
"action caring" 1
"action gideon" 1
"action he" 1
"action must" 1
"action of" 6
"action on" 1
"action one" 1
"action unavailing" 1
"action was" 1
"action which" 3

to this:
"act"   [["of", 7, 0.3888888888888889], ["on", 2, 0.1111111111111111], ["according", 1, 0.05555555555555555]["accordingly", 1, 0.05555555555555555], ["against", 1, 0.05555555555555555], ["as", 1, 0.05555555555555555], ["but", 1, 0.05555555555555555], ["if", 1, 0.05555555555555555], ["in", 1, 0.05555555555555555], ["with", 1, 0.05555555555555555], ["without", 1, 0.05555555555555555]]
"acted" [["as", 1, 0.2], ["for", 1, 0.2], ["he", 1, 0.2], ["powerfully", 1, 0.2], ["wisely", 1, 0.2]]
"acting"        [["thus", 2, 1.0]]
"action"        [["of", 6, 0.3157894736842105], ["which", 3, 0.15789473684210525], ["as", 1, 0.05263157894736842], ["beyond", 1, 0.05263157894736842], ["caring", 1, 0.05263157894736842], ["gideon", 1, 0.05263157894736842], ["he", 1, 0.05263157894736842], ["must", 1, 0.05263157894736842], ["on", 1, 0.05263157894736842], ["one", 1, 0.05263157894736842], ["unavailing", 1, 0.05263157894736842], ["was", 1, 0.05263157894736842]]


using mapreduce/mrjob? by computing the number of occurrences and the probability of each of the second word for the first word in all bigrams?

The first column lists all the words that appear as the first word in a bigram. The second column is a JSON string representing a list of lists. Each item in this list is a list of three elements: the second word, the number of occurrences of the second word in bigrams following the word in the first column, the probability of each of the second word for the first word in all bigrams. So for example, ["of", 7, 0.3888888888888889] means that 'of' appeared 7 times after act in bigram, and because act appeared all together 18 times in bigrams, 7/18 = 0.389 which is the probability we are talking about. Also, the list should be sorted in decreasing order of the probability.

I'm quite new to python and just started learning this so I apologize if this is suppose to be extremely easy to get.
Any help would be greatly appreciated!

Is This A Good Question/Topic? 0
  • +

Page 1 of 1