My Profile Photo

All The Language


Welcome! I'm Pepijn. I study natural language, programming languages and logical languages. Basically, all the language...


Breadboxes, Plenty Questions and Distributional Semantics

Quite a while ago, UnicornPower introduced me to a game called Breadbox. It’s an experimental cousin of 20 Questions, also known as Plenty Questions, which is played by two players—or more, really—who we’ll name Allie and Blake:

  1. Allie thinks of something.
  2. As their first question, Blake asks “Is it a breadbox?”
  3. Allie—who, seeing the mandatory first question, obviously wouldn’t choose a breadbox—answers “No, it’s not!”

From there on out, all Blake’s questions have to be of the form…

Is it more like a breadbox, or more like…?

…where breadbox is replaced by whatever the current guess is, and the dots are filled in with whatever Blake wants. For instance, a quick game might look like this:

Allie
I’m thinking of something…
Blake
Is it a breadbox?
Allie
No, it’s not.
Blake
Is it more like a breadbox or more like a dog?
Allie
It’s more like a dog…
Blake
Is it more like a dog or more like a cat?
Allie
It’s more like a cat…
Blake
Is it more like a cat or more like a unicorn?
Allie
It’s more like a cat…
Blake
Is it more like a cat or more like a garden?
Allie
It’s more like a garden…
Blake
Is it more like a garden or more like a house?
Allie
It’s more like a house…
Blake
Is it more like a house or more like a friend?
Allie
It’s more like a friend…
Blake
Is it more like a friend or more like a lover?
Allie
It’s more like a friend…
Blake
Is it more like a friend or more like a relative?
Allie
It’s more like a friend…
Blake
Is it more like a friend or more like a neighbour?
Allie
That’s exactly what I was thinking of!

Since this game tends to bring out the… best in people, Blake might just have to give up. After this, Allie may be called on to explain what the word she chose even means:

Blake
I give up… what was it?
Allie
Reverberation!
Allie
It’s the repetition of a sound resulting from reflection of the sound waves!

The game does a fantastic job of revealing how hierarchical people tend to think. For instance, I would say that a dolphin is more like an orca than it is like a mammal, even though a dolphin isn’t an orca. But while playing the game, I often find myself slipping into hierarchical thinking: “Oh, it’s not like an animal… so we can exclude all animals.”

It is exactly this—the fact that the game forces you to consider a distance metric for your internal ontology, insofar as one exists—which makes it so fascinating to play. I heartily recommend you try it!

If you’re upset because Breadbox is hard, or because you think that the choices it makes are weird or wrong… try playing it with a person! ^^


In the summer of 2014, I took a course on distributional semantics (by Marco Baroni and Georgiana Dinu) and the first thing I thought to do was to use their dataset to implement an AI for Breadbox:

Welcome to Breadbox 1.0! Type 'help' for instructions, or simply start guessing!
 
>

So how does it work? The core hypothesis of distributional semantics is

You can know the meaning of a word by the company it keeps

Have a look at the two sentences below:1

  1. He filled the wampimuk, passed it around and we all drunk some.
  2. We found a little, hairy wampimuk sleeping behind the tree.

While you’ve probably never read the word wampimuk before, it’s likely that either sentence will give you a pretty clear idea of what a wampurnik is in that context.

So if you want to know what a word (e.g. “bear”) means, you take a huge corpus, and you look for all the occurances of that word…2

       over the mountains. A bear also features prominentl
    rnejakt" (An Unfortunate Bear Hunt) by Theodor Kittels
       to his hagiography, a bear killed Saint Corbinian's
         however, he let the bear go. The saddled "bear
       bear go. The saddled "bear of St. Corbinian" the
    tails on this topic, see Bear in heraldry. The British
         Cat and the Russian Bear (see The Great Game)
     Great Game) The Russian bear is a common national
    Soviet Union). The brown bear is also Finland's nation
    animals and had the same bear carry him from his hermi
         thus Christianised, bear clasping each gable
     evidence of prehistoric bear worship. Anthropologists
     peoples, considered the bear as the spirit of one's
    fathers. This is why the bear (karhu) was a greatly
    ikämmen and kontio). The bear is the national animal
      tries to kill a mother bear and her cubs—and is
         society. "The Brown Bear of Norway" is a Scottish
     magically turned into a bear, and who managed to get
     television. Evidence of bear worship has been found
      mythology identify the bear as their ancestor and
    shopric of Freising, the bear is the dangerous totem

…and then you count, for every other word, how often it occurs together with your word. The above text, for instance, gives us a number of obvious co-occurances for bear: mountain, kill, hunt, brown, animal and cub. The idea is that, given a large enough corpus, these co-occurances will drown out the noisier ones.

By doing this for every word, you build up a co-occurance matrix, which lists how often every word occurs with every other word. For instance,1

      leash   walk   run   owner   pet   bark  
  dog   3   5   2   5   3   2  
  cat   0   3   3   2   3   0  
  lion   0   3   2   0   1   0  
  light   0   0   0   0   0   0  
  bark   1   0   0   2   1   0  
  car   0   0   1   3   0   0  

Now the meaning for a word is determined by its co-occurances with some select group of words. For instance, if we look at ‘dog’ we see that it strongly co-occurs with things such as ‘leash’, ‘walk’, ‘owner’, ‘pet’ and ‘bark’. For ‘cat’, we lose ‘leash’ and ‘bark’—since cats don’t bark, and are rarely leashed. And for ‘lion’, we also lose ‘owner’ and ‘pet’—while a lion could conceivably be a pet, it’d be a lot rarer than having a cat or dog… and we could never really feel like we owned the lion.

You can think of the rows in this matrix as points, in a six dimensional space—one dimension for every column. And because it’s a space, you can measure the distance between words. And this is where we get back to Breadbox: in order to play this bizarre game, we needed a distance metric for meanings, to be able to compare and order any two objects. And this is exactly what a co-occurance matrix gives us!

Obviously, there’s a lot more to distributional semantics than just this. For instance, the matrices that you derive this way tend to be huge—one axis per word, one point per word—so there’s a whole bunch of work which goes into selecting exactly which set of words should be the axes. Then there’s the difficultly of composing word meanings. You may have noticed that my breadbox implementation doesn’t work all too well for compound nouns: that’s because I’m not taking the effort to compose meaning vectors.3

If you wish to read more about distributional semantics, there’s a pretty good overview of introductions and surveys here. Additionally, there’s a whole branch of work which uses neural networks to learn the word meanings: for instance, have a look at Word2Vec.


  1. Taken from https://www.cs.utexas.edu/~mooney/cs388/slides/dist-sem-intro-NLP-class-UT.pdf. 2

  2. Taken from http://www.webcorp.org.uk.

  3. A bit of a shame, really, since the course I took was about composing meaning vectors. Also, full disclosure, the vectors that I used were created using Word2Vec, using neural networks. Such vectors generally outperform counting vectors in tasks of relatedness (see Don’t Count, Predict! by Baroni et al.).

comments powered by Disqus