H. Chase Stevens Logo

Most Common Tags

  1. programming
  2. python
  3. code
  4. philosophy
  5. evolution
  6. game design
  7. probability
  8. video games
  9. genetic algorithms
  10. government



Posts with tag "information theory":

How complex is English, and why?

posted on: Saturday, March 29, 2014 (2:38 pm) by Chase Stevens

Last night, I was reading a thread on reddit titled "When did the use of gender start showing up in language, and what purpose did it serve?". One of the top posts, by a user named ggchappell, said that the original poster seemed to have it the wrong way around: languages "start complicated and sometimes get simpler", especially when those languages are used for trade. In response to this, user mojitz asked,

"The simplest languages are those that ended up being used for trade between multiple groups of people."
Does English conform to this generalization by your estimate?
I thought that was a really exciting question, and was prompted to write the response below.

Depending on how you define "simple", English is a relatively simple language, but probably not because of trade.

The question of how to define language complexity (or, in fact, if languages differ in complexity in the first place) has been a hotly contested, very political issue in the world of linguistics. Because of this, it’s really only been in the last few years that we’ve made substantive advances toward finding quantitative metrics of language complexity. Some really exciting work was done a few years ago by Max Bane, who applied Kolmogorov complexity – a concept from information theory that measures how densely you can compress some data without losing any information – to the inflectional and affixal morphology of various languages [1]. This is to say, instead of measuring how complex a language’s sound system is, or how difficult that language is to learn, Bane was looking at how complex the words of a language are in terms of their internal structure: the rules dictating which prefixes and suffixes you can (or must) add to a "base" word form, e.g. "type" becoming "typed" to indicate the past tense.

If we use Bane’s metric for complexity, we end up with languages like Hungarian and Latin (which, as others have pointed out, has a very rich inflectional system) being ranked as most complex, while morphologically "isolating" languages like Vietnamese and Chinese – languages that have on average fewer units of meaning per word (contrast "establish" with "anti-dis-establish-ment-arian-ism") – are ranked as the least complex.

Slightly more complex than isolating languages are a class of languages called "pidgins". Pidgins are communicative systems formed when two or more groups of people without a common language need to communicate. Getting back to what /u/ggchappel said, pidgins are very frequently the result of different language speakers coming together to engage in trade. Moreover, pidgins are very widely held as being particularly simple languages - in pidgins, we very often observe things like the reduplication of words or morphemes to augment their intensity (e.g., in Nigerian Pidgin, "sikisiki", meaning "someone who is always sickly", and "E big well well" to mean "It is very big" [2]). When children grow up learning a pidgin, it transitions into what linguists call a "creole", and usually as a result of this undergoes an increase in regularization and linguistic complexity. However, children don’t normally grow up learning a pidgin as a result of trade – this is much more likely to be indicative of one of two situations: either multiple groups of people without a common language have been brought together as a result of slavery, or one group of people has invaded another. In the latter scenario, we see the establishment of a "prestige" language – a language of the elite and the learned, the language of the conquerors – and a "vulgar" language, the language of the conquered masses.

Now, this is where things start to get really neat. Let’s take a quick look at the history of England (and English) from the 9th century through to the 13th century. In 9th century England, Old English was being spoken, a Germanic language strongly related to other continental European languages. In northern England, however, English was hardly being spoken at all: the Old Norse-speaking Danish had invaded, and established Danelaw. At this time Old Norse and Old English would have been mutually intelligible, much like modern-day Swedish and Norwegian – a speaker of Old English would have been able to understand a speaker of Old Norse, and vice-versa, although probably with some difficulty. It’s very hard to overstate the amount of influence Old Norse had on Old English, but to give an example, the pronouns "they", "their" and "them" – core components of Modern English – derive directly from the Old Norse "þeir" ("their") [3]. But the amount of change Old English experienced interacting with Old Norse pales in comparison to what happened next.

In 1066, William the Conqueror invaded England and, in doing so, established French as a prestige language. Now suddenly we see incredible amounts of French influence on Old English vocabulary and grammar, so much so that by the 12th or 13th century Old English isn’t even Old English anymore, it’s Middle English. To give you an idea of how radical a change this was, Modern English speakers can understand Middle English, whereas Old English might as well have no relation to what we think of as English today. Middle English is the language of Chaucer, who in the 14th century wrote lines like

"Whan that Aprill, with his shoures soote The droghte of March hath perced to the roote And bathed every veyne in swich licour" [4].
Middle English is the language of Orm, who wrote in the introduction of his 12th century book Orrmulum,
"[Th]iss boc iss nemmned Orrmulum, for[th]i [th]att Orrm itt wrohhte" [5].
Old English is the language Beowulf is written in, which contains such unintelligible passages as
"Hwæt! We Gardena in geardagum [th]eodcyninga [th]rym gefrunon hu [th]a æ[th]elingas elle fremedon" [6].
About 90% of the words in Modern English are from either Latin or Greek, with the vast majority of these having come through French during this time period or later [7] (although roughly half of those used in everyday speech are still of Germanic origin [8]).

Let’s get back to Bane. You might have noticed that I didn’t mention where English ranked in Bane’s complexity survey. Let me tell you, English is much less complex than Latin. English is much less complex than Icelandic, which has changed so little as to be mutually intelligible with Old Norse. English is less complex than French or German. In fact, in Bane’s ranking of 20 languages using his complexity metric, English was less complex than any other European language. Instead, English falls somewhere between Dutch and Maori. More notably, though, the English language’s complexity is, at 16.88%, closer to Nigerian Pidgin (at 9.8%) than French (at 23.05%). Old English underwent creolization to become Middle English [9].

In short, and to answer your question, yes, English can be construed as being one of the least complex European languages, not because of trade – but because of conquest.

  1. Bane, Max, 2008, "Quantifying and measuring morphological complexity", Proceedings of the 26th West Coast Conference on Formal Linguistics, http://www.lingref.com/cpp/wccfl/26/paper1657.pdf .
  2. Ugot, Mercy and Ogundipe, Afolabi, 2011, "Reduplication in Nigerian Pidgin: A Versatile Communication Tool?", Pakistan Journal of Social Sciences, http://www.medwelljournals.com/fulltext/?doi=pjssci.2011.227.233 .
  3. Harper, Douglas, n.d., "they (pron.)", Online Etymology Dictionary, http://www.etymonline.com/index.php?term=they .
  4. Eliot, Charles W., 1909, "English poetry I: from Chaucher to Gray", The Harvard Classics, http://www.bartleby.com/40/0101.html .
  5. Breen, Katharine, 2010, "Imagining an English Reading Public, 1150-1400", Volume 79 of Cambridge Studies in Medieval Literature, http://books.google.co.uk/books?id=HPW8KrD88OUC&lpg=PA116&dq=boc%20is%20nemmned%20Orrmulum&pg=PA116#v=onepage&q&f=false .
  6. Slade, Benjamin, 2012, "Beowulf: diacritically-marked text and facing translation", http://www.heorot.dk/beowulf-rede-text.html .
  7. Dictionary.com, "Word FAQs", http://dictionary.reference.com/help/faq/language/t16.html .
  8. Nation, I.S.P., 2001, "Learning Vocabulary in Another Language", Cambridge Applied Linguistics.
  9. Danchev, Adrei, 1997, "The Middle English creolization hypothesis revisited", Trends in Linguistics Studies and Monographs 103.

Tags: english, history, information theory, linguistics

Reconstructing Images Using Damaged Copies (in Python)

posted on: Thursday, September 13, 2012 (4:01 pm) by Chase Stevens

I've recently been working on implementing various methods of image reconstruction in python. The idea is to, given several imperfect (that is to say, noisy, incomplete, photoshopped, or otherwise damaged) copies of some original image, attempt to arrive at something close to (if not precisely) the original image by combining said copies. Through these means, an approximation of the original image can be generated should the original be lost, itself damaged, irretrievable, or otherwise unavailable or useless. In writing the various functions for doing this, I implemented techniques used in signal averaging and variants thereof. I also implemented a “modal image" function which, for each pixel, uses the “modal" RGB values across all image copies or, failing that, performs a simple mean of values.

Examples and Analysis

Original Christian Bale photograph

For the following examples, I modified the above image of actor Christian Bale. Ironically enough, in testing for this article, I overwrote the original image and had to employ the use of Google's reverse image-search in order to find it.

Damaged images
Damaged Christian Bale #1 Damaged Christian Bale #2 Damaged Christian Bale #3
Delta: 105.122025 Delta: 88.026425 Delta: 68.5942
Function results (listed with difference from original image as given by get_delta, lower is better):
<function average_images_add at 0x00000000030E8F28> : 155.35693875
<function average_images_sub at 0x00000000030E95F8> : 72.4844575
<function average_images at 0x0000000002EF0208> : 43.92254625
<function average_noisefilter_all_sub at 0x0000000002EF0278> : 51.1805645833
<function average_noisefilter_all_delta at 0x0000000002EF02E8> : 36.9071316667
<function modal_image at 0x0000000002EF0358> : 42.53322
Christian Bale Reconstructed Using modal_image Christian Bale Reconstructed Using average_noisefilter_all_delta
Function: modal_image
Delta: 42.53322
Function: average_noisefilter_all_delta
Delta: 36.9071316667
As is readily visible, in this example the naïve “voting”-type approach used by modal_image is deceived in any case where forms of damage across multiple images “agree” with each other. Given a larger number of copies or a less artificial form of damage, this would likely cease to be an issue; in theory, modal_image could even go so far as to reconstruct the original image perfectly. average_noisefilter_all_delta produced the best results on average, although, to its detriment, its output relies on the order of the list of image copies passed to it. In addition, while it manages to be closer to the original image, the subtraction of image differences it employs creates a slightly “jarring” visual effect (as seen above). The inherent issue in reconstructing damaged images is one of knowledge. To humans, it seems obvious that the original image of Christian Bale didn't have streaks of white across it. However, this is a mere assumption based on our vast pool of experience in viewing photographs. Who's to say that the original photo didn't have white scribbles on it? The computer is hampered, from our perspective, by its inability to make these assumptions when reconstructing the original image, so although it produces results better than a mere superposition of the copies, they rarely are better than what could be accomplished by human hands.

Noisy images
Noisy Christian Bale #1 Noisy Christian Bale #2 Noisy Christian Bale #3
Delta: 92.715475 Delta: 93.352175 Delta: 93.08885
Function results (listed with difference from original image as given by get_delta, lower is better):
<function average_images_add at 0x00000000030E8F28> : 103.01672125
<function average_images_sub at 0x00000000030E95F8> : 81.929801875
<function average_images at 0x0000000002EF0208> : 82.971985
<function average_noisefilter_all_sub at 0x0000000002EF0278> : 69.6356495833
<function average_noisefilter_all_delta at 0x0000000002EF02E8> : 75.23682
<function modal_image at 0x0000000002EF0358> : 65.2880416667
Christian Bale De-Noised Using modal_image Christian Bale De-Noised Using average_noisefilter_all_sub
Function: modal_image
Delta: 65.2880416667
Function: average_noisefilter_all_sub
Delta: 69.6356495833
In this case, modal_image produced the best results, although its output is still quite noisy. Again, having more copies would significantly improve the end product. The output also appears to be slightly brighter than would be expected. average_noisefilter_all_sub comes in at a close second, although (to the human eye) its results appear to be quite a bit noisier than modal_image.

I tried to use these image reconstruction methods on compressed images with jpeg artifacts as well, although the results were much poorer. Output deltas ended up being worse than the least compressed copy of a given set, although better than an average copy. modal_image and average_images_add seemed to perform best. The overall problem was again one of knowledge: could less compressed images be prioritized or weighted given their closeness to the source, results would likely be much better. However, the computer has no means by which to determine what is closest to the original, and thus fails to leverage this. Some kind of programmatic detection of jpeg artifacts could be of great help in improving this situation.

As a final note before my code, I just recently launched Broverbs. Please do check it out if you've got the time!

from __future__ import division
from PIL import Image as PILImage
from Tkinter import *
from tkFileDialog import askopenfilename as openfile
from itertools import product
from random import shuffle

class Image():
    def __init__(self,filename=None,dimensions=None):
        if filename!=None:
            self.pic = PILImage.open(filename)
            self.imgData = self.pic.load()
            self.size = self.pic.size
            if dimensions!=None:
                self.size = dimensions
                self.size = [0,0]
            self.pic = None
            self.imgData = {}
            for x,y in product(range(self.size[0]),range(self.size[1])):
    def setpixel(self,x,y,r,g,b):
        self.imgData[x,y] = tuple(map(max,zip((r,g,b),(0,0,0))))
    def getpixellist(self):
        return product(range(self.size[0]),range(self.size[1]))
    def show(self):
        tempImage = PILImage.new('RGB',self.size)
        for x,y in self.getpixellist():

def get_delta(image1,image2):
    if image1.size != image2.size: raise ValueError("Image sizes do not match")
    delta = 0
    for x,y in image1.getpixellist():
        delta += sum(abs_pixel_diff(image1,image2,x,y))
    return delta / (image1.size[0] * image1.size[1])

mode = (lambda l: max([(l.count(i), i) for i in set(l)])[1] if len(set(l)) < len(l) else sum(l)/float(len(l)))

pixel_operation = lambda f: lambda image1,image2,x,y: map(f,zip(image1.imgData[x,y],image2.imgData[x,y]))
abs_pixel_diff = pixel_operation(lambda (x,y): abs(x-y))
pixel_diff = pixel_operation(lambda (x,y): x-y)
pixel_avg = pixel_operation(lambda (x,y): (x+y)/2)
pixel_sum = pixel_operation(lambda (x,y): x+y)

def image_operation(operation,image1,image2):
    if image1.size != image2.size: raise ValueError("Image sizes do not match")
    result = Image(dimensions=image1.size)
    for x,y in result.getpixellist():
        r,g,b = operation(image1,image2,x,y)
    return result

get_delta_image = lambda image1,image2: image_operation(abs_pixel_diff,image1,image2)
subtract_image = lambda minuend,subtrahend: image_operation(pixel_diff,minuend,subtrahend)
average_image = lambda image1,image2: image_operation(pixel_avg,image1,image2)
add_image = lambda image1,image2: image_operation(pixel_sum,image1,image2)

def get_average_image(image_list):
    output = Image(dimensions=set1[0].size)
    for x,y in output.getpixellist():
        r,g,b = reduce(lambda a,b: (a[0]+b[0],a[1]+b[1],a[2]+b[2]),[image.imgData[x,y] for image in image_list])
    return output

def generalized_average(image_list, combination_function, function1, function2):
    set1 = image_list[0::2]
    image1 = get_average_image(set1)
    set2 = image_list[1::2]
    image2 = get_average_image(set2)
    return combination_function(function1(image1,image2),function2(image1,image2))

def average_images_add(image_list): 
    return generalized_average(image_list,add_image,average_image,subtract_image)

def average_images_sub(image_list):
    return generalized_average(image_list,subtract_image,average_image,subtract_image)

def average_images(image_list): 
    return generalized_average(image_list,subtract_image,average_image,get_delta_image)

def generalized_noisefilter(image_list,function):
    set1 = image_list[0::2]
    image1 = get_average_image(set1)
    set2 = image_list[1::2]
    image2 = get_average_image(set2)
    noise = function(image1,image2)
    denoise_set = [subtract_image(image,noise) for image in image_list]
    return get_average_image(denoise_set)

def average_noisefilter_all_sub(image_list):
    return generalized_noisefilter(image_list,subtract_image)

def average_noisefilter_all_delta(image_list):
    return generalized_noisefilter(image_list,get_delta_image)

def modal_image(image_list):
    result = Image(dimensions=image_list[0].size)
    for x,y in result.getpixellist():
        r = mode([image.imgData[x,y][0] for image in image_list])
        g = mode([image.imgData[x,y][1] for image in image_list])
        b = mode([image.imgData[x,y][2] for image in image_list])
    return result

def test_func(f,images,original,trials=25):
    results = list()
    for x in range(trials):
        print "Trial #"+str(x+1)
    return sum(results)/len(results)

function_list = [average_images_add,average_images_sub,average_images,average_noisefilter_all_sub,average_noisefilter_all_delta,modal_image]

def test_functions(image_list,original,functions=function_list,trials=10):
    out = ''
    for f in functions:
        out += (str(f) + ' ')
        out += (': ')
        out += (str(test_func(f,image_list,original,trials)))
        out += ('n')
    print out

Tags: code, image processing, information theory, programming, python, voting