Information
Philosophy of Information deals with the philosophical analysis of the notion of information both from a historical and a systematic perspective. With the emergence of the empiricist theory of knowledge in early modern philosophy, the development of various mathematical theories of information in the twentieth century and the rise of information technology, the concept of “information” has conquered a central place in the sciences and in society. This interest also led to the emergence of a separate branch of philosophy that analyzes information in all its guises (Adriaans & van Benthem 2008a,b; Lenski 2010; Floridi 2002, 2011).
Information has become a central category in both the sciences and the humanities and the reflection on information influences a broad range of philosophical disciplines varying from logic (Dretske 1981; van Benthem & van Rooij 2003; van Benthem 2006, see the entry on logic and information), epistemology (Simondon 1989) to ethics (Floridi 1999) and esthetics (Schmidhuber 1997a; Adriaans 2008) to ontology (Zuse 1969; Wheeler 1990; Schmidhuber 1997b; Wolfram 2002; Hutter 2010). There is no consensus about the exact nature of the field of philosophy of information. Several authors have proposed a more or less coherent philosophy of information as an attempt to rethink philosophy from a new perspective: e.g., quantum physics (Mugur-Schächter 2002), logic (Brenner 2008), semantic information (Floridi 2011; Adams & de Moraes 2016, see the entry on semantic conceptions of information), communication and message systems (Capurro & Holgate 2011) and meta-philosophy (Wu 2010, 2016).
Others (Adriaans & van Benthem 2008a; Lenski 2010) see it more as a technical discipline with deep roots in the history of philosophy and consequences for various disciplines like methodology, epistemology and ethics. Whatever one’s interpretation of the nature of philosophy of information is, it seems to imply an ambitious research program consisting of many sub-projects varying from the reinterpretation of the history of philosophy in the context of modern theories of information, to an in depth analysis of the role of information in science, the humanities and society as a whole. The term “information” in colloquial speech is currently predominantly used as an abstract mass-noun used to denote any amount of data, code or text that is stored, sent, received or manipulated in any medium.
The detailed history of both the term “information” and the various concepts that come with it is complex and for the larger part still has to be written (Seiffert 1968; Schnelle 1976; Capurro 1978, 2009; Capurro & Hjørland 2003). The exact meaning of the term “information” varies in different philosophical traditions and its colloquial use varies geographically and over different pragmatic contexts. Although an analysis of the notion of information has been a theme in Western philosophy from its early inception, the explicit analysis of information as a philosophical concept is recent, and dates back to the second half of the twentieth century.
At this moment it is clear that information is a pivotal concept in the sciences and humanities and in our every day life. Everything we know about the world is based on information we received or gathered and every science in principle deals with information. There is a network of related concepts of information, with roots in various disciplines like physics, mathematics, logic, biology, economy and epistemology.
All these notions cluster around two central properties: The elegance of this formula however does not shield us from the conceptual problems it harbors. In the twentieth century various proposals for formalization of concepts of information were made: Until recently the possibility of a unification of these theories was generally doubted (Adriaans & van Benthem 2008a), but after two decades of research, perspectives for unification seem better. The contours of a unified concept of information emerges along the following lines: The emergence of a coherent theory to measure information quantitatively in the twentieth century is closely related to the development of the theory of computing.
Central in this context are the notions of Universality, Turing equivalence and Invariance: because the concept of a Turing system defines the notion of a universal programmable computer, all universal models of computation seem to have the same power. This implies that all possible measures of information definable for universal models of computation (Recursive Functions, Turing Machine, Lambda Calculus etc.) are invariant modulo an additive constant. This gives a perspective on a unified theory of information that might dominate the research program for the years to come.
📑 Contents
1. Information in Colloquial Speech
The lack of preciseness and the universal usefulness of the term “information” go hand in hand. In our society, in which we explore reality by means of instruments and installations of ever increasing complexity (telescopes, cyclotrons) and communicate via more advanced media (newspapers, radio, television, SMS, the Internet), it is useful to have an abstract mass-noun for the “stuff” that is created by the instruments and that “flows” through these media. Historically this general meaning emerged rather late and seems to be associated with the rise of mass media and intelligence agencies (Devlin & Rosenberg 2008; Adriaans & van Benthem 2008b).
In present colloquial speech the term information is used in various loosely defined and often even conflicting ways. Most people, for instance, would consider the following inference prima facie to be valid:
The same people would probably have no problems with the statement that “Secret services sometimes distribute false information”, or with the sentence “The information provided by the witnesses of the accident was vague and conflicting”. The first statement implies that information necessarily is true, while the other statements allow for the possibility that information is false, conflicting and vague. In everyday communication these inconsistencies do not seem to create great trouble and in general it is clear from the pragmatic context what type of information is designated. These examples suffice to argue that references to our intuitions as speakers of the English language are of little help in the development of a rigorous philosophical theory of information. There seems to be no pragmatic pressure in everyday communication to converge to a more exact definition of the notion of information.
2. History of the Term and the Concept of Information
Until the second half of the twentieth century almost no modern philosopher considered “information” to be an important philosophical concept. The term has no lemma in the well-known encyclopedia of Edwards (1967) and is not mentioned in Windelband (1903). In this context the interest in “Philosophy of Information” is a recent development. Yet, with hindsight from the perspective of a history of ideas, reflection on the notion of “information” has been a predominant theme in the history of philosophy. The reconstruction of this history is relevant for the study of information.
A problem with any “history of ideas” approach is the validation of the underlying assumption that the concept one is studying has indeed continuity over the history of philosophy. In the case of the historical analysis of information one might ask whether the concept of “informatio” discussed by Augustine has any connection to Shannon information, other than a resemblance of the terms. At the same time one might ask whether Locke’s “historical, plain method” is an important contribution to the emergence of the modern concept of information although in his writings Locke hardly uses the term “information” in a technical sense. As is shown below, there is a conglomerate of ideas involving a notion of information that has developed from antiquity till recent times, but further study of the history of the concept of information is necessary.
An important recurring theme in the early philosophical analysis of knowledge is the paradigm of manipulating a piece of wax: either by simply deforming it, by imprinting a signet ring in it or by writing characters on it. The fact that wax can take different shapes and secondary qualities (temperature, smell, touch) while the volume (extension) stays the same, make it a rich source of analogies, natural to Greek, Roman and medieval culture, where wax was used both for sculpture, writing (wax tablets) and encaustic painting. One finds this topic in writings of such diverse authors as Democritus, Plato, Aristotle, Theophrastus, Cicero, Augustine, Avicenna, Duns Scotus, Aquinas, Descartes and Locke.
2.1 Classical Philosophy
In classical philosophy “information” was a technical notion associated with a theory of knowledge and ontology that originated in Plato’s (427–347 BCE) theory of forms, developed in a number of his dialogues (Phaedo, Phaedrus, Symposium, Timaeus, Republic). Various imperfect individual horses in the physical world could be identified as horses, because they participated in the static atemporal and aspatial idea of “horseness” in the world of ideas or forms. When later authors like Cicero (106–43 BCE) and Augustine (354–430 CE) discussed Platonic concepts in Latin they used the terms informare and informatio as a translation for technical Greek terms like eidos (essence), idea (idea), typos (type), morphe (form) and prolepsis (representation). The root “form” still is recognizable in the word in-form-ation (Capurro & Hjørland 2003). Plato’s theory of forms was an attempt to formulate a solution for various philosophical problems: the theory of forms mediates between a static (Parmenides, ca. 450 BCE) and a dynamic (Herakleitos, ca. 535–475 BCE) ontological conception of reality and it offers a model to the study of the theory of human knowledge. According to Theophrastus (371–287 BCE) the analogy of the wax tablet goes back to Democritos (ca. 460–380/370 BCE) (De Sensibus 50). In the Theaetetus (191c,d) Plato compares the function of our memory with a wax tablet in which our perceptions and thoughts are imprinted like a signet ring stamps impressions in wax. Note that the metaphor of imprinting symbols in wax is essentially spatial (extensive) and can not easily be reconciled with the aspatial interpretation of ideas supported by Plato.
One gets a picture of the role the notion of “form” plays in classical methodology if one considers Aristotle’s (384–322 BCE) doctrine of the four causes. In Aristotelian methodology understanding an object implied understanding four different aspects of it:
Note that Aristotle, who rejects Plato’s theory of forms as atemporal aspatial entities, still uses “form” as a technical concept. This passage states that knowing the form or structure of an object, i.e., the information, is a necessary condition for understanding it. In this sense information is a crucial aspect of classical epistemology.
The fact that the ratio 2:1 is cited as an example also illustrates the deep connection between the notion of forms and the idea that the world was governed by mathematical principles. Plato believed under influence of an older Pythagorean (Pythagoras 572–ca. 500 BCE) tradition that “everything that emerges and happens in the world” could be measured by means of numbers (Politicus 285a). On various occasions Aristotle mentions the fact that Plato associated ideas with numbers (Vogel 1968: 139). Although formal mathematical theories about information only emerged in the twentieth century, and one has to be careful not to interpret the Greek notion of a number in any modern sense, the idea that information was essentially a mathematical notion, dates back to classical philosophy: the form of an entity was conceived as a structure or pattern that could be described in terms of numbers. Such a form had both an ontological and an epistemological aspect: it explains the essence as well as the understandability of the object. The concept of information thus from the very start of philosophical reflection was already associated with epistemology, ontology and mathematics.
Two fundamental problems that are not explained by the classical theory of ideas or forms are 1) the actual act of knowing an object (i.e., if I see a horse in what way is the idea of a horse activated in my mind) and 2) the process of thinking as manipulation of ideas. Aristotle treats these issues in De Anime, invoking the signet-ring-impression-in-wax analogy:
By a “sense” is meant what has the power of receiving into itself the sensible forms of things without the matter. This must be conceived of as taking place in the way in which a piece of wax takes on the impress of a signet-ring without the iron or gold; we say that what produces the impression is a signet of bronze or gold, but its particular metallic constitution makes no difference: in a similar way the sense is affected by what is coloured or flavoured or sounding, but it is indifferent what in each case the substance is; what alone matters is what quality it has, i.e., in what ratio its constituents are combined. (De Anime, Book II, Chp. 12)
Have not we already disposed of the difficulty about interaction involving a common element, when we said that mind is in a sense potentially whatever is thinkable, though actually it is nothing until it has thought? What it thinks must be in it just as characters may be said to be on a writing-tablet on which as yet nothing actually stands written: this is exactly what happens with mind. (De Anime, Book III, Chp. 4)
These passages are rich in influential ideas and can with hindsight be read as programmatic for a philosophy of information: the process of informatio can be conceived as the imprint of characters on a wax tablet (tabula rasa), thinking can be analyzed in terms of manipulation of symbols.
2.2 Medieval Philosophy
Throughout the Middle Ages the reflection on the concept of informatio is taken up by successive thinkers. Illustrative for the Aristotelian influence is the passage of Augustine in De Trinitate book XI. Here he analyzes vision as an analogy for the understanding of the Trinity. There are three aspects: the corporeal form in the outside world, the informatio by the sense of vision, and the resulting form in the mind. For this process of information Augustine uses the image of a signet ring making an impression in wax (De Trinitate, XI Cap 2 par 3). Capurro (2009) observes that this analysis can be interpreted as an early version of the technical concept of “sending a message” in modern information theory, but the idea is older and is a common topic in Greek thought (Plato Theaetetus 191c,d; Aristotle De Anime, Book II, Chp. 12, Book III, Chp. 4; Theophrastus De Sensibus 50).
The tabula rasa notion was later further developed in the theory of knowledge of Avicenna (c. 980–1037 CE):
The human intellect at birth is rather like a tabula rasa, a pure potentiality that is actualized through education and comes to know. Knowledge is attained through empirical familiarity with objects in this world from which one abstracts universal concepts. (Sajjad 2006 [Other Internet Resources [hereafter OIR]])
The idea of a tabula rasa development of the human mind was the topic of a novel Hayy ibn Yaqdhan by the Arabic Andalusian philosopher Ibn Tufail (1105–1185 CE, known as “Abubacer” or “Ebn Tophail” in the West). This novel describes the development of an isolated child on a deserted island. A later translation in Latin under the title Philosophus Autodidactus (1761) influenced the empiricist John Locke in the formulation of his tabula rasa doctrine.
Apart from the permanent creative tension between theology and philosophy, medieval thought, after the rediscovery of Aristotle’s Metaphysics in the twelfth century inspired by Arabic scholars, can be characterized as an elaborate and subtle interpretation and development of, mainly Aristotelian, classical theory. Reflection on the notion of informatio is taken up, under influence of Avicenna, by thinkers like Aquinas (1225–1274 CE) and Duns Scotus (1265/66–1308 CE). When Aquinas discusses the question whether angels can interact with matter he refers to the Aristotelian doctrine of hylomorphism (i.e., the theory that substance consists of matter (hylo (wood), matter) and form (morphè)). Here Aquinas translates this as the in-formation of matter (informatio materiae) (Summa Theologiae, 1a 110 2; Capurro 2009). Duns Scotus refers to informatio in the technical sense when he discusses Augustine’s theory of vision in De Trinitate, XI Cap 2 par 3 (Duns Scotus, 1639, “De imagine”, Ordinatio, I, d.3, p.3).
The tension that already existed in classical philosophy between Platonic idealism(universalia ante res) and Aristotelian realism (universalia in rebus) is recaptured as the problem of universals: do universal qualities like “humanity” or the idea of a horse exist apart from the individual entities that instantiate them? It is in the context of his rejection of universals that Ockham (c. 1287–1347 CE) introduces his well-known razor: entities should not be multiplied beyond necessity. Throughout their writings Aquinas and Scotus use the Latin terms informatio and informare in a technical sense, although this terminology is not used by Ockham.
2.3 Modern Philosophy
The history of the concept of information in modern philosophy is complicated. Probably starting in the fourteenth century the term “information” emerged in various developing European languages in the general meaning of “education” and “inquiry”. The French historical dictionary by Godefroy (1881) gives action de former, instruction, enquête, science, talent as early meanings of “information”. The term was also used explicitly for legal inquiries (Dictionnaire du Moyen Français (1330–1500) 2015). Because of this colloquial use the term “information” loses its association with the concept of “form” gradually and appears less and less in a formal sense in philosophical texts.
At the end of the Middle Ages society and science are changing fundamentally (Hazard 1935; Ong 1958; Dijksterhuis 1986). In a long complex process the Aristotelian methodology of the four causes was transformed to serve the needs of experimental science:
In this changing context the analogy of the wax-impression is reinterpreted. A proto-version of the modern concept of information as the structure of a set or sequence of simple ideas is developed by the empiricists, but since the technical meaning of the term “information” is lost, this theory of knowledge is never identified as a new “theory of information”.
The consequence of this shift in methodology is that only phenomena that can be explained in terms of mechanical interaction between material bodies can be studied scientifically. This implies in a modern sense: the reduction of intensive properties to measurable extensive properties. For Galileo this insight is programmatic:
To excite in us tastes, odors, and sounds I believe that nothing is required in external bodies except shapes, numbers, and slow or rapid movements. (Galileo 1623 [1960: 276)
These insights later led to the doctrine of the difference between primary qualities (space, shape, velocity) and secondary qualities (heat, taste, color etc.). In the context of philosophy of information Galileo’s observations on the secondary quality of “heat” is of particular importance since they lay the foundations for the study of thermodynamics in the nineteenth century:
Having shown that many sensations which are supposed to be qualities residing in external objects have no real existence save in us, and outside ourselves are mere names, I now say that I am inclined to believe heat to be of this character. Those materials which produce heat in us and make us feel warmth, which are known by the general name of “fire,” would then be a multitude of minute particles having certain shapes and moving with certain velocities. (Galileo 1623 [1960: 277)
A pivotal thinker in this transformation is René Descartes (1596–1650 CE). In his Meditationes, after “proving” that the matter (res extensa) and mind (res cogitans) are different substances (i.e., forms of being existing independently), the question of the interaction between these substances becomes an issue. The malleability of wax is for Descartes an explicit argument against influence of the res extensa on the res cogitans (Meditationes II, 15). The fact that a piece of wax loses its form and other qualities easily when heated, implies that the senses are not adequate for the identification of objects in the world. True knowledge thus can only be reached via “inspection of the mind”. Here the wax metaphor that for more than 1500 years was used to explain sensory impression is used to argue against the possibility to reach knowledge via the senses. Since the essence of the res extensa is extension, thinking fundamentally can not be understood as a spatial process. Descartes still uses the terms “form” and “idea” in the original scholastic non-geometric (atemporal, aspatial) sense. An example is the short formal proof of God’s existence in the second answer to Mersenne in the Meditationes de Prima Philosophia
I use the term idea to refer to the form of any given thought, immediate perception of which makes me aware of the thought.
(Idea nomine intelligo cujuslibet cogitationis formam illam, per cujus immediatam perceptionem ipsius ejusdem cogitationis conscious sum)
I call them “ideas” says Descartes
only in so far as they make a difference to the mind itself when they inform that part of the brain.
(sed tantum quatenus mentem ipsam in illam cerebri partem conversam informant). (Descartes, 1641, Ad Secundas Objections, Rationes, Dei existentiam & anime distinctionem probantes, more Geometrico dispositae.)
Because the res extensa and the res cogitans are different substances, the act of thinking can never be emulated in space: machines can not have the universal faculty of reason. Descartes gives two separate motivations:
Of these the first is that they could never use words or other signs arranged in such a manner as is competent to us in order to declare our thoughts to others: (…) The second test is, that although such machines might execute many things with equal or perhaps greater perfection than any of us, they would, without doubt, fail in certain others from which it could be discovered that they did not act from knowledge, but solely from the disposition of their organs: for while reason is an universal instrument that is alike available on every occasion, these organs, on the contrary, need a particular arrangement for each particular action; whence it must be morally impossible that there should exist in any machine a diversity of organs sufficient to enable it to act in all the occurrences of life, in the way in which our reason enables us to act. (Discourse de la méthode, 1647)
The passage is relevant since it directly argues against the possibility of artificial intelligence and it even might be interpreted as arguing against the possibility of a universal Turing machine: reason as a universal instrument can never be emulated in space. This conception is in opposition to the modern concept of information which as a measurable quantity is essentially spatial, i.e., extensive (but in a sense different from that of Descartes).
Descartes does not present a new interpretation of the notions of form and idea, but he sets the stage for a debate about the nature of ideas that evolves around two opposite positions:
Locke’s reinterpretation of the notion of idea as a “structural placeholder” for any entity present in the mind is an essential step in the emergence of the modern concept of information. Since these ideas are not involved in the justification of apodeictic knowledge, the necessity to stress the atemporal and aspatial nature of ideas vanishes. The construction of concepts on the basis of a collection of elementary ideas based in sensorial experience opens the gate to a reconstruction of knowledge as an extensive property of an agent: more ideas implies more probable knowledge.
In the second half of the seventeenth century formal theory of probability is developed by researchers like Pascal (1623–1662), Fermat (1601 or 1606–1665) and Christiaan Huygens (1629–1695). The work De ratiociniis in ludo aleae of Huygens was translated in to English by John Arbuthnot (1692). For these authors, the world was essentially mechanistic and thus deterministic, probability was a quality of human knowledge caused by its imperfection:
It is impossible for a Die, with such determin’d force and direction, not to fall on such determin’d side, only I don’t know the force and direction which makes it fall on such determin’d side, and therefore I call it Chance, wich is nothing but the want of art;… (John Arbuthnot Of the Laws of Chance (1692), preface)
This text probably influenced Hume, who was the first to marry formal probability theory with theory of knowledge:
Though there be no such thing as Chance in the world; our ignorance of the real cause of any event has the same influence on the understanding, and begets a like species of belief or opinion. (…) If a dye were marked with one figure or number of spots on four sides, and with another figure or number of spots on the two remaining sides, it would be more probable, that the former would turn up than the latter; though, if it had a thousand sides marked in the same manner, and only one side different, the probability would be much higher, and our belief or expectation of the event more steady and secure. This process of the thought or reasoning may seem trivial and obvious; but to those who consider it more narrowly, it may, perhaps, afford matter for curious speculation. (Hume 1748: Section VI, “On probability” 1)
Here knowledge about the future as a degree of belief is measured in terms of probability, which in its turn is explained in terms of the number of configurations a deterministic system in the world can have. The basic building blocks of a modern theory of information are in place. With this new concept of knowledge empiricists laid the foundation for the later development of thermodynamics as a reduction of the secondary quality of heat to the primary qualities of bodies.
At the same time the term “information” seems to have lost much of its technical meaning in the writings of the empiricists so this new development is not designated as a new interpretation of the notion of “information”. Locke sometimes uses the phrase that our senses “inform” us about the world and occasionally uses the word “information”.
For what information, what knowledge, carries this proposition in it, viz. “Lead is a metal” to a man who knows the complex idea the name lead stands for? (Locke 1689: bk IV, ch 8, para 4)
Hume seems to use information in the same casual way when he observes:
Two objects, though perfectly resembling each other, and even appearing in the same place at different times, may be numerically different: And as the power, by which one object produces another, is never discoverable merely from their idea, it is evident cause and effect are relations, of which we receive information from experience, and not from any abstract reasoning or reflection. (Hume 1739: Part III, section 1)
The empiricists methodology is not without problems. The biggest issue is that all knowledge becomes probabilistic and a posteriori. Immanuel Kant (1724–1804) was one of the first to point out that the human mind has a grasp of the meta-concepts of space, time and causality that itself can never be understood as the result of a mere combination of “ideas”. What is more, these intuitions allow us to formulate scientific insights with certainty: i.e., the fact that the sum of the angles of a triangle in Euclidean space is 180 degrees. This issue cannot be explained in the empirical framework. If knowledge is created by means of combination of ideas then there must exist an a priori synthesis of ideas in the human mind. According to Kant, this implies that the human mind can evaluate its own capability to formulate scientific judgments. In his Kritik der reinen Vernunft (1781) Kant developed transcendental philosophy as an investigation of the necessary conditions of human knowledge. Although Kant’s transcendental program did not contribute directly to the development of the concept of information, he did influence research in to the foundations of mathematics and knowledge relevant for this subject in the nineteenth and twentieth century: e.g., the work of Frege, Husserl, Russell, Brouwer, L. Wittgenstein, Gödel, Carnap, Popper and Quine.
2.4 Historical Development of the Meaning of the Term “Information”
The history of the term “information” is intricately related to the study of central problems in epistemology and ontology in Western philosophy. After a start as a technical term in classical and medieval texts the term “information” almost vanished from the philosophical discourse in modern philosophy, but gained popularity in colloquial speech. Gradually the term obtained the status of an abstract mass-noun, a meaning that is orthogonal to the classical process-oriented meaning. In this form it was picked up by several researchers (Fisher 1925; Shannon 1948) in the twentieth century who introduced formal methods to measure “information”. This, in its turn, lead to a revival of the philosophical interest in the concept of information. This complex history seems to be one of the main reasons for the difficulties in formulating a definition of a unified concept of information that satisfies all our intuitions. At least three different meanings of the word “information” are historically relevant:
3. Building Blocks of Modern Theories of Information
With hindsight many notions that have to do with optimal code systems, ideal languages and the association between computing and processing language have been recurrent themes in the philosophical reflection since the seventeenth century.
3.1 Languages
One of the most elaborate proposals for a universal “philosophical” language was made by bishop John Wilkins (Maat 2004): “An Essay towards a Real Character, and a Philosophical Language” (1668). Wilkins’ project consisted of an elaborate system of symbols that supposedly were associated with unambiguous concepts in reality. Proposals such as these made philosophers sensitive to the deep connections between language and thought. The empiricist methodology made it possible to conceive the development of language as a system of conventional signs in terms of associations between ideas in the human mind. The issue that currently is known as the symbol grounding problem (how do arbitrary signs acquire their inter-subjective meaning) was one of the most heavily debated questions in the eighteenth century in the context of the problem of the origin of languages. Diverse thinkers as Vico, Condillac, Rousseau, Diderot, Herder and Haman made contributions. The central question was whether language was given a priori (by God) or whether it was constructed and hence an invention of man himself. Typical was the contest issued by the Royal Prussian Academy of Sciences in 1769:
En supposant les hommes abandonnés à leurs facultés naturelles, sont-ils en état d’inventer le langage? Et par quels moyens parviendront-ils d’eux-mêmes à cette invention?
Assuming men abandoned to their natural faculties, are they able to invent language and by what means will they come to this invention?[1]
The controversy raged on for over a century without any conclusion and in 1866 the Linguistic Society of Paris (Société de Linguistique de Paris) banished the issue from its arena. [2]
Philosophically more relevant is the work of Leibniz (1646–1716) on a so-called characteristica universalis: the notion of a universal logical calculus that would be the perfect vehicle for scientific reasoning. A central presupposition in Leibniz’ philosophy is that such a perfect language of science is in principle possible because of the perfect nature of the world as God’s creation (ratio essendi = ration cognoscendi, the origin of being is the origin of knowing). This principle was rejected by Wolff (1679–1754) who suggested more heuristically oriented characteristica combinatoria (van Peursen 1987). These ideas had to wait for thinkers like Boole (1854, An Investigation of the Laws of Thought), Frege (1879, Begriffsschrift), Peirce (who in 1886 already suggested that electrical circuits could be used to process logical operations) and Whitehead and Russell (1910–1913, Principia Mathematica) to find a more fruitful treatment.
3.2 Optimal Codes
The fact that frequencies of letters vary in a language was known since the invention of book printing. Printers needed many more “e”s and “t”s than “x”s or “q”s to typeset an English text. This knowledge was used extensively to decode ciphers since the seventeenth century (Kahn 1967; Singh 1999). In 1844 an assistant of Samuel Morse, Alfred Vail, determined the frequency of letters used in a local newspaper in Morristown, New Jersey, and used them to optimize Morse code. Thus the core of theory of optimal codes was already established long before Shannon developed its mathematical foundation (Shannon 1948; Shannon & Weaver 1949). Historically important but philosophically less relevant are the efforts of Charles Babbage to construct computing machines (Difference Engine in 1821, and the Analytical Engine 1834–1871) and the attempt of Ada Lovelace (1815–1852) to design what is considered to be the first programming language for the Analytical Engine.
3.3 Numbers
The simplest way of representing numbers is via a unary system. Here the length of the representation of a number is equal to the size of the number itself, i.e., the number “ten” is represented as “\\\\\\\\\\”. The classical Roman number system is an improvement since it contains different symbols for different orders of magnitude (one = I, ten = X, hundred = C, thousand = M). This system has enormous drawbacks since in principle one needs an infinite amount of symbols to code the natural numbers and because of this the same mathematical operations (adding, multiplication etc.) take different forms at different orders of magnitude. Around 500 CE the number zero was invented in India. Using zero as a placeholder we can code an infinity of numbers with a finite set of symbols (one = I, ten = 10, hundred = 100, thousand = 1000 etc.). From a modern perspective an infinite number of position systems is possible as long as we have 0 as a placeholder and a finite number of other symbols. Our normal decimal number system has ten digits “0, 1, 2, 3, 4, 5, 6, 7, 8, 9” and represents the number two-hundred-and-fifty-five as “255”. In a binary number system we only have the symbols “0” and “1”. Here two-hundred-and-fifty-five is represented as “11111111”. In a hexadecimal system with 16 symbols (0, 1, 2, 3, 4, 5, 6, 7, 8, 9, a, b, c, d, e, f) the same number can be written as “ff”. Note that the length of these representations differs considerable. Using this representation, mathematical operations can be standardized irrespective of the order of magnitude of numbers we are dealing with, i.e., the possibility of a uniform algorithmic treatment of mathematical functions (addition, subtraction, multiplication and division etc.) is associated with such a position system.
The concept of a positional number system was brought to Europe by the Persian mathematician al-Khwarizmi (ca. 780–ca. 850 CE). His main work on numbers (ca. 820 CE) was translated into Latin as Liber Algebrae et Almucabola in the twelfth century, which gave us amongst other things the term “algebra”. Our word “algorithm” is derived from Algoritmi, the Latin form of his name. Positional number systems simplified commercial and scientific calculations.
In 1544 Michael Stifel introduced the concept of the exponent of a number in Arithmetica integra (1544). Thus 8 can be written as \(2^3\) and 25 as \(5^2\). The notion of an exponent immediately suggests the notion of a logarithm as its inverse function: \(\log_b b^a) = a\). Stifel compared the arithmetic sequence:
in which the term 1 have a difference of 1 with the geometric sequence:
in which the terms have a ratio of 2. The exponent notation allowed him to rewrite the values of the second table as:
which combines the two tables. This arguably was the first logarithmic table. A more definitive and practical theory of logarithms is developed by John Napier (1550–1617) in his main work (Napier 1614). He coined the term logarithm (logos + arithmetic: ratio of numbers). As is clear from the match between arithmetic and geometric progressions, logarithms reduce products to sums:
They also reduce divisions to differences:
and powers to products:
After publication of the logarithmic tables by Briggs (1624) this new technique of facilitating complex calculations rapidly gained popularity.
3.4 Physics
Galileo (1623) already had suggested that the analysis of phenomena like heat and pressure could be reduced to the study of movements of elementary particles. Within the empirical methodology this could be conceived as the question how the sensory experience of the secondary quality of heat of an object or a gas could be reduced to movements of particles. Bernoulli (Hydrodynamica published in 1738) was the first to develop a kinetic theory of gases in which macroscopically observable phenomena are described in terms of microstates of systems of particles that obey the laws of Newtonian mechanics, but it was quite an intellectual effort to come up with an adequate mathematical treatment. Clausius (1850) made a conclusive step when he introduced the notion of the mean free path of a particle between two collisions. This opened the way for a statistical treatment by Maxwell who formulated his distribution in 1857, which was the first statistical law in physics. The definitive formula that tied all notions together (and that is engraved on his tombstone, though the actual formula is due to Planck) was developed by Boltzmann:
It describes the entropy S of a system in terms of the logarithm of the number of possible microstates W, consistent with the observable macroscopic states of the system, where k is the well-known Boltzmann constant. In all its simplicity the value of this formula for modern science can hardly be overestimated. The expression “\(\log W\)” can, from the perspective of information theory, be interpreted in various ways:
Thus it connects the additive nature of logarithm with the extensive qualities of entropy, probability, typicality and information and it is a fundamental step in the use of mathematics to analyze nature. Later Gibbs (1906) refined the formula:
where \(p_i\) is the probability that the system is in the \(i^{\textrm{th}}\) microstate. This formula was adopted by Shannon (1948; Shannon & Weaver 1949) to characterize the communication entropy of a system of messages. Although there is a close connection between the mathematical treatment of entropy and information, the exact interpretation of this fact has been a source of controversy ever since (Harremoës & Topsøe 2008; Bais & Farmer 2008).
4. Developments in Philosophy of Information
The modern theories of information emerged in the middle of the twentieth century in a specific intellectual climate in which the distance between the sciences and parts of academic philosophy was quite big. Some philosophers displayed a specific anti-scientific attitude: Heidegger, “Die Wissenschaft denkt nicht.” On the other hand the philosophers from the Wiener Kreis overtly discredited traditional philosophy as dealing with illusionary problems (Carnap 1928). The research program of logical positivism was a rigorous reconstruction of philosophy based on a combination of empiricism and the recent advances in logic. It is perhaps because of this intellectual climate that early important developments in the theory of information took place in isolation from mainstream philosophical reflection. A landmark is the work of Dretske in the early eighties (Dretske 1981). Since the turn of the century, interest in Philosophy of Information has grown considerably, largely under the influence of the work of Luciano Floridi on semantic information. Also the rapid theoretical development of quantum computing and the associated notion of quantum information have had it repercussions on philosophical reflection.
4.1 Popper: Information as Degree of Falsifiability
The research program of logical positivism of the Wiener Kreis in the first half of the twentieth century revitalized the older project of empiricism. Its ambition was to reconstruct scientific knowledge on the basis of direct observations and logical relation between statements about those observations. The old criticism of Kant on empiricism was revitalized by Quine (1951). Within the framework of logical positivism induction was invalid and causation could never be established objectively. In his Logik der Forschung (1934) Popper formulates his well-known demarcation criterion and he positions this explicitly as a solution to Hume’s problem of induction (Popper 1934 [1977: 42]). Scientific theories formulated as general laws can never be verified definitively, but they can be falsified by only one observation. This implies that a theory is “more” scientific if it is richer and provides more opportunity to be falsified:
Thus it can be said that the amount of empirical information conveyed by a theory, or its empirical content, increases with its degree of falsifiability. (Popper 1934 [1977: 113], emphasis in original)
This quote, in the context of Popper’s research program, shows that the ambition to measure the amount of empirical information in scientific theory conceived as a set of logical statements was already recognized as a philosophical problem more than a decade before Shannon formulated his theory of information. Popper is aware of the fact that the empirical content of a theory is related to its falsifiability and that this in its turn has a relation with the probability of the statements in the theory. Theories with more empirical information are less probable. Popper distinguishes logical probability from numerical probability (“which is employed in the theory of games and chance, and in statistics”; Popper 1934 [1977: 119]). In a passage that is programmatic for the later development of the concept of information he defines the notion of logical probability:
The logical probability of a statement is complementary to its falsifiability: it increases with decreasing degree of falsifiability. The logical probability 1 corresponds to the degree 0 of falsifiability and vice versa. (Popper 1934 [1977: 119], emphasis in original)
It is possible to interpret numerical probability as applying to a subsequence (picked out from the logical probability relation) for which a system of measurement can be defined, on the basis of frequency estimates. (Popper 1934 [1977: 119], emphasis in original)
Popper never succeeded in formulating a good formal theory to measure this amount of information although in later writings he suggests that Shannon’s theory of information might be useful (Popper 1934 [1977], 404 [Appendix IX, from 1954]). These issues were later developed in philosophy of science. Theory of conformation studies induction theory and the way in which evidence “supports” a certain theory (Huber 2007 [OIR]). Although the work of Carnap motivated important developments in both philosophy of science and philosophy of information the connection between the two disciplines seems to have been lost. There is no mention of information theory or any of the more foundational work in philosophy of information in Kuipers (2007a), but the two disciplines certainly have overlapping domains. (See, e.g., the discussion of the so-called Black Ravens Paradox by Kuipers (2007b) and Rathmanner & Hutter (2011).)
4.2 Shannon: Information Defined in Terms of Probability
In two landmark papers Shannon (1948; Shannon & Weaver 1949) characterized the communication entropy of a system of messages A:
Here \(p_i\) is the probability of message i in A. This is exactly the formula for Gibb’s entropy in physics. The use of base-2 logarithms ensures that the code length is measured in bits (binary digits). It is easily seen that the communication entropy of a system is maximal when all the messages have equal probability and thus are typical.
The amount of information I in an individual message x is given by:
This formula, that can be interpreted as the inverse of the Boltzmann entropy, covers a number of our basic intuitions about information:
Information as the negative log of the probability is the only mathematical function that exactly fulfills these constraints (Cover & Thomas 2006). Shannon offers a theoretical framework in which binary strings can be interpreted as words in a (programming) language containing a certain amount of information (see 3.1 Languages). The expression \(-\log p_x\) exactly gives the length of an optimal code for message x and as such formalizes the old intuition that codes are more efficient when frequent letters get shorter representations (see 3.2 Optimal codes). Logarithms as a reduction of multiplication to addition (see 3.3 Numbers) are a natural representation of extensive properties of systems and already as such had been used by physicists in the nineteenth century (see 3.4 Physics).
One aspect of information that Shannon’s definition explicitly does not cover is the actual content of the messages interpreted as propositions. So the statement “Jesus was Caesar” and “The moon is made of green cheese” may carry the same amount of information while their meaning is totally different. A large part of the effort in philosophy of information has been directed to the formulation of more semantic theories of information (Bar-Hillel & Carnap 1953; Floridi 2002, 2003, 2011). Although Shannon’s proposals at first were almost completely ignored by philosophers it has in the past decennia become apparent that their impact on philosophical issues is big. Dretske (1981) was one of the first to analyze the philosophical implications of Shannon’s theory, but the exact relation between various systems of logic and theory of information are still unclear (see 6.6 Logic and Semantic Information).
4.3 Solomonoff, Kolmogorov, Chaitin: Information as the Length of a Program
This problem of relating a set of statements to a set of observations and defining the corresponding probability was taken up by Carnap (1945, 1950). He distinguished two forms of probability: Probability\(_1\) or “degree of confirmation” \(P_1 (h ; e)\) is a logical relation between two sentences, a hypothesis h and a sentence e reporting a series of observations. Statements of this type are either analytical or contradictory. The second form, Probability\(_2\) or “relative frequency”, is the statistical concept. In the words of his student Solomonoff (1997):
Carnap’s model of probability started with a long sequence of symbols that was a description of the entire universe. Through his own formal linguistic analysis, he was able to assign a priori probabilities to any possible string of symbols that might represent the universe.
The method for assigning probabilities Carnap used, was not universal and depended heavily on the code systems used. A general theory of induction using Bayes’ rule can only be developed when we can assign a universal probability to “any possible string” of symbols. In a paper in 1960 Solomonoff (1960, 1964a,b) was the first to sketch an outline of a solution for this problem. He formulated the notion of what is now called a universal probability distribution: consider the set of all possible finite strings to be programs for a universal Turing machine U and define the probability of a string x of symbols in terms of the length of the shortest program p that outputs x on U.
This notion of Algorithmic Information Theory was invented independently somewhat later separately by Kolmogorov (1965) and Chaitin (1969). Levin (1974) developed a mathematical expression of the universal a priori probability as a universal (that is, maximal) lower semicomputable semimeasure M, and showed that the negative logarithm of \(M(x)\) coincides with the Kolmogorov complexity of x up to an additive logarithmic term. The actual definition of the complexity measure is:
Algorithmic Information Theory (a.k.a. Kolmogorov complexity theory) has developed into a rich field of research with a wide range of domains of applications many of which are philosophically relevant (Li & Vitányi 2019):
There are also down-sides:
Algorithmic Information Theory has gained rapid acceptance as a fundamental theory of information. The well-known introduction in Information Theory by Cover and Thomas (2006) states: “… we consider Kolmogorov complexity (i.e., AIT) to be more fundamental than Shannon entropy” (2006: 3).
The idea that algorithmic complexity theory is a foundation for a general theory of artificial intelligence (and theory of knowledge) has already been suggested by Solomonoff (1997) and Chaitin (1987). Several authors have defended that data compression is a general principle that governs human cognition (Chater & Vitányi 2003; Wolff 2006). Hutter (2005, 2007a,b) argues that Solomonoff’s formal and complete theory essentially solves the induction problem. Hutter (2007a) and Rathmanner & Hutter (2011) enumerate a plethora of classical philosophical and statistical problems around induction and claim that Solomonoff’s theory solves or avoids all these problems. Probably because of its technical nature, the theory has been largely ignored by the philosophical community. Yet, it stands out as one of the most fundamental contributions to information theory in the twentieth century and it is clearly relevant for a number of philosophical issues, such as the problem of induction.
5. Systematic Considerations
In a mathematical sense information is associated with measuring extensive properties of classes of systems with finite but unlimited dimensions (systems of particles, texts, codes, networks, graphs, games etc.). This suggests that a uniform treatment of various theories of information is possible. In the Handbook of Philosophy of Information three different forms of information are distinguished (Adriaans & van Benthem 2008b):
Because of recent development the connections between Information-B (Shannon) and Information-C (Kolmogorov) are reasonably well understood (Cover & Thomas 2006). The historical material presented in this article suggests that reflection on Information-A (logic, knowledge) is historically much more interwoven than was generally known up till now. The research program of logical positivism can with hindsight be characterized as the attempt to marry a possible worlds interpretation of logic with probabilistic reasoning (Carnap 1945, 1950; Popper 1934; for a recent approach see Hutter et al. 2013). Modern attempt to design a Bayesian epistemology (Bovens & Hartmann 2003) do not seem to be aware of the work done in the first half of the twentieth century. However, an attempt to unify Information-A and Information-B seems a viable exercise. Also the connection between thermodynamics and information theory have become much closer, amongst others, due to the work of Gell-Mann & Lloyd (2003) (see also: Bais and Farmer 2008). Verlinde (2011, 2017) even presented a reduction of gravity to information (see the entry on information processing and thermodynamic entropy).
5.1 Philosophy of Information as An Extension of Philosophy of Mathematics
With respect to the main definitions of the concept of information, like Shannon Information, Kolmogorov complexity, semantic information and quantum information, a unifying approach to a philosophy of information is possible, when we interpret it as an extension to the philosophy of mathematics. The answer to questions like “What is data?” and “What is information?” then evolves from one’s answer to the related questions like “What is a set?” and “What is a number?” With hindsight one can observe that many open problems in the philosophy of mathematics revolve around the notion of information.
If we look at the foundations of information and computation there are two notions that are crucial: the concept of a data set and the concept of an algorithm. Once we accept these notions as fundamental the rest of the theory data and computation unfolds quite naturally. One can “plug in” one’s favorite epistemological or metaphysical stance here, but this does not really affect foundational issues in the philosophy of computation and information. One might sustain a Formalist, Platonic or intuitionistic view of the mathematical universe (see entry on philosophy of mathematics) and still agree on the basic notion of what effective computation is. The theory of computing, because of its finitistic and constructivist nature, seems to live more or less on the common ground in which these theories overlap.
Information as a scientific concept emerges naturally in the context of our every day dealing with nature when we measure things. Examples are ordinary actions like measuring the size of an object with a stick, counting using our fingers, drawing a straight line using a piece of rope. These processes are the anchor points of abstract concepts like length, distance, number, straight line that form the building blocks of science. The fact that these concepts are rooted in our concrete experience of reality guarantees their applicability and usefulness. The earliest traces of information processing evolved around the notions of counting, administration and accountancy.
This example helps to understand the importance of context in the analysis of information. In itself a scratch on a stick may have no meaning at all, but as soon as we decide that such a scratch represents another object or event it becomes a meaningful symbol. When we manipulate it in such a context we process information. In principle a simple scratch can represent any event or object we like: symbols are conventional.
Symbols are the semantic anchors by which symbol manipulating systems are tied to the world. Observe that the meta-statement:
if true, specifies semantic information:
Symbol manipulation can take many forms and is not restricted to sequences. Many examples of different forms of information processing can be found in prehistoric times.
The historical transformation from sets to strings is important. It is a more sophisticated form of coding of information. Formally we can distinguish several levels of complexity of token combination:
Sequences of symbols code more information than multisets and multisets are more expressive than sets. Thus the emergence of writing itself can be seen as a quest to find the most expressive representation of administrative data. When measuring information in sequences of messages it is important to distinguish the aspects of repetition, order and grouping. The extensive aspects of information can be studied in terms of such structural operations (see entry on substructural logics). We can study sets of messages in terms of operators defined on sequences of symbols.
The structural aspects of sets, multisets and strings can be formulated in terms of these properties:
Sets may be interpreted as spaces in which objects can move freely. When the same objects are in each others vicinity they collapse in to one object. Multisets can be interpreted as spaces in which objects can move freely, with the constraint that the total number of objects stays constant. This is the standard notion of extensiveness: the total volume of a space stays constant, but the internal structure may differ. Sequences may be interpreted as spaces in which objects have a fixed location. In general a sequence contains more information than the derived multiset, which contains more information than the associated set.
The notion of a set in mathematics is considered to be fundamental. Any identifiable collection of discrete objects can be considered to be a set. The relation between theory of sets and the concept of information becomes clear when we analyze the basic statement:
Which reads the object e is an element of the set A. Observe that this statement, if true, represents a piece of semantic information. It is wellformed, meaningful and truthful. (see entry on semantic conceptions of information) The concept of information is already at play in the basic building blocks of mathematics.The philosophical question “What are sets?” the answer to the ti esti question, is determined implicitly by the Zermelo-Fraenkel axioms (see entry on set theory), the first of which is that of extensionality:
The idea that mathematical concepts are defined implicitly by a set of axioms was proposed by Hilbert but is not uncontroversial (see entry on the Frege-Hilbert controversy). The fact that the definition is implicit entails that we only have examples of what sets are without the possibility to formulate any positive predicate that defines them. Elements of a set are not necessarily physical, nor abstract, nor spatial or temporal, nor simple, nor real. The only prerequisite is the possibility to formulate clear judgments about membership. This implicit definition of the notion of a set is not unproblematic. We might define objects that at first glance seem to be proper sets, which after scrutiny appear to be internally inconsistent. This is the basis for:
The implicit definition of the concepts of sets, entails that the class is essentially open itself. There are mathematical definitions of objects of which it is unclear or highly controversial whether they define a set or not.
Modern philosophy of mathematics starts with the Frege-Russell theory of numbers (Frege 1879, 1892, Goodstein 1957, see entry on alternative axiomatic set theories) in terms of sets. If we accept the notion of a class of objects as valid and fundamental, together with the notion of a one-to-one correspondence between classes of objects, then we can define numbers as sets of equinumerous classes.
Any set of, say four, objects then becomes a representation of the number 4 and for any other set of objects we can establish membership to the equivalence class defining the number 4 by defining a one to one correspondence to our example set.
We can reconstruct large parts of the mathematical universe by selecting appropriate mathematical example objects to populate it, beginning with the assumption that there is a single unique empty set \(\emptyset\) which represents the number 0. This gives us the existence of a set with only one member \(\{\varnothing\}\) to represent the number 1 and repeating this construction, \(\{\varnothing,\{\varnothing\}\}\) for 2, the whole set of natural numbers \(\mathbb{N}\) emerges. Elementary arithmetic then is defined on the basis of Peano’s axioms:
The fragment of the mathematical universe that emerges is relatively uncontroversial and both Platonists and constructivists might agree on its basic merits. On the basis of Peano’s axioms we can define more complex functions like addition and multiplication which are closed on \(\mathbb{N}\) and the inverse functions, subtraction and division, which are not closed and lead to the set of whole numbers \(\mathbb{Z}\) and the rational numbers \(\mathbb{Q}\).
We can define the concept of information for a number n by means of an unspecified function \(I(n)\). We observe that addition and multiplication specify multisets: both are non-contractive and commutative and associative. Suppose we interpret the tensor operator \(\oplus\) as multiplication \(\times\). It is natural to define the semantics for \(I(m \times n)\) in terms of addition. If we get both messages m and n, the total amount of information in the combined messages is the sum of the amount of information in the individual messages. This leads to the following constraints:
Furthermore we want bigger numbers to contain more information than smaller ones, which gives a:
We also want to select a certain number a as our basic unit of measurement:
The following theorem is due to Rényi (1961):
We define:
For finite sets we can now specify the amount of information we get when we know a certain element of a set conditional to knowing the set as a whole.
The bigger the set, the harder the search is, the more information we get when we find what we are looking for. Conversely, without any further information the probability of selecting a certain element of S is \(p_S(x) = \frac{1}{|S|}\). The associated function is the so-called Hartley function:
The combination of these definitions gives a theorem that ties together the notions of conditional information and probability:
The information about an element x of a set S conditional to the set is equal to the log of the probability that we select this element x under uniform distribution, which is a measure of our ignorance if we know the set but not which element of the set is to be selected.
Using these results we define the conditional amount of information in a subset of a finite set as:
This is just an application of our basic definition of information: the cardinality of the class of subsets of A with size k is \({n \choose k}\).
The formal properties of the concept of probability are specified by the Kolmogorov Axioms of Probability:
Let \(P(E)\) be the probability P that some event E occurs. Let \((\Omega, F,P)\), with \(P(\Omega)=1\), be a probability space, with sample space \(\Omega\), event space F and probability measure P.
One of the consequences is monotonicity: if \(A \subseteq B\) implies \(P(A) \leq P(B)\). Note that this is the same notion of additivity as defined for the concept of information. At subatomic level the Kolmogorov Axiom of additivity loses its validity in favor of a more subtle notion (see section 5.3).
From a philosophical point of view the importance of this construction lies in the fact that it leads to an ontologically neutral concept of information based on a very limited robust base of axiomatic assumptions:
This shows how Shannon’s theory of information and Boltzmann’s notion of entropy are rooted in more fundamental mathematical concepts. The notions of a set of messages or a set of micro states are specializations of the more general mathematical concept of a set. The concept of information already exists on this more fundamental level. Although many open questions still remain, specifically in the context of the relation between information theory and physics, perspectives on a unified theory of information now look better than at the beginning of the twenty-first century.
The definition of the amount of information in a number in therms of logarithms allows us to classify other mathematical functions in terms of their capacity to process information. The Information Efficiency of a function is the difference between the amount of information in the input of a function and the amount of information in the output (Adriaans 2016 [OIR]). It allows us to measure how information flows through a set of functions. We use the shorthand \(f(\overline{x})\) for \(f(x_1,x_2,\dots,x_k)\):
In general deterministic information processing systems do not create new information. They only process it. The following fundamental theorem about the interaction between information and computation is due to Adriaans and Van Emde Boas (2011):
This is in line with both Shannon’s theory and Kolmogorov complexity. The outcome of a deterministic program is always the same, so the probability of the outcome is 1 which gives under Shannon’s theory, 0 bits of new information. Likewise for Kolmogorov complexity, the output of a program can never be more complex than the length of the program itself, plus a constant. This is analyzed in depth in Adriaans and Van Emde Boas (2011). In a deterministic world it is the case that if:
The essence of information is uncertainty and a message that occurs with probability “1” contains no information. The fact that it might take a long time to compute the number is irrelevant as long as the computation halts. Infinite computations are studied in the theory of Scott domains (Abramsky & Jung 1994).
Estimating the information efficiency of elementary functions is not trivial. The primitive recursive functions (see entry on recursive functions) have one information expanding operation, the increment operation, one information discarding operation, choosing, all the others are information neutral. The information efficiency of more complex operations is defined by a combination of counting and choosing. From an information efficiency point of view the elementary arithmetical functions are complex families of functions that describe computations with the same outcome, but with different computational histories.
Some arithmetical operations expand information, some have constant information and some discard information. During the execution of deterministic programs expansion of information may take place, but, if the program is effective, the descriptive complexity of the output is limited. The flow of information is determined by the succession of types of operations, and by the balance between the complexity of the operations and the number of variables.
We briefly discuss the information efficiency of the two basic recursive functions on two variables and their coding possibilities:
Addition Addition is associated with information storage in terms of sequences or strings of symbols. It is information discarding for natural numbers bigger than 1. We have \(\delta(a + b) \lt 0\) since \(\log (a + b) \lt \log a + \log b\). Still, addition has information preserving qualities. If we add numbers with different log units we can reconstruct the frequency of the units from the resulting number:
\[\begin{align} 232 & = 200 + 30 + 2 \\ & = (2 \times 10^2) + (3 \times 10^1) + (2 \times 10^0)\\ & = 100 + 100 + 10 + 10 + 10 + 1 + 1 \end{align} \]
Since the information in the building blocks, 100, 10 and 1, is given the number representation can still be reconstructed. This implies that natural numbers code in terms of addition of powers of k in principle two types of information: value and frequency. We can use this insight to code complex typed information in single natural numbers. Basically it allows us to code any natural numbers in a string of symbols of length \(\lceil \log_k n \rceil \), which specifies a quantitative measure for the amount of information in a number in terms of the length of its code. See section 3.3 for a historical analysis of the importance of the discovery of position systems for information theory.
Multiplication is by definition information conserving. We have: \(\delta(a \times b) = 0\), since \(\log (a \times b) = \log a + \log b\). Still multiplication does not preserve all information in its input: the order of the operation is lost. This is exactly what we want from an operator that characterizes an extensive measure: only the extensive qualities of the numbers are preserved. If we multiply two numbers \(3 \times 4\), then the result, 12, allows us to reconstruct the original computation, in so far as we can reduce all its components to their most elementary values: \(2 \times 2 \times 3 = 12\). This leads to the observation that some numbers act as information building blocks of other numbers, which gives us the concept of a prime number:
The concept of a prime number gives rise to the Fundamental Theorem of Arithmetic:
The Fundamental Theorem of Arithmetic can be seen as a theorem about conservation of information: for every natural number there is a set of natural numbers that contains exactly that same amount of information. The factors of a number form a so-called multiset: a set that may contain multiple copies of the same element: e.g., the number 12 defines the multiset \(\{2,2,3\}\) in which the number 2 occurs twice. This makes multisets a powerful device for coding information since it codes qualitative information (i.e., the numbers 2 and 3) as well as quantitative information (i.e., the fact that the number 2 occurs twice and the number 3 only once). This implies that natural numbers in terms of multiplication of primes also code two types of information: value and frequency. Again we can use this insight to code complex typed information in single natural numbers.
Position based number representations using addition of powers are straightforward and easy to handle and form the basis of most of our mathematical functions. This is not the case for coding systems based on multiplication. Many of the open questions in the philosophy of mathematics and information arise in the context of the concepts of the Fundamental Theorem of Arithmetic and Primes. We give a short overview:
There is an infinite number of observations we can make about the set \(\mathbb{N}\) that are not implied directly by the axioms, but involve a considerable amount of computation.
In a landmark paper in 1931 Kurt Gödel proved that any consistent formal system that contains elementary arithmetic is fundamentally incomplete in the sense that it contains true statements that cannot be proved within the system. In a philosophical context this implies that the semantics of a formal system rich enough to contain elementary mathematics cannot be defined in terms of mathematical functions within the system, i.e., there are statements that contain semantic information about the system in the sense of being well-formed, meaningful and truthful without being provable.
Central is the concept of a Recursive Function. (see entry on recursive functions). Such functions are defined on numbers. Gödel’s notion of a recursive function is closest to what we would associate with computation in every day life. Basically they are elementary arithmetical functions operating on natural numbers like addition, subtraction, multiplication and division and all other functions that can be defined on top of these.
We give the basic structure of the proof. Suppose F is a formal system, with the following components:
Assume furthermore that F is consistent, i.e., it will never derive false statements form true ones. In his proof Gödel used the coding possibilities of multiplication to construct an image of the system (see the discussion of Gödel numbering from the entry on Gödel’s Incompleteness Theorems). According to the fundamental theorem of arithmetic any number can be uniquely factored in to its primes. This defines a one-to-one relationship between multisets of numbers and numbers: the number 12 can be constructed on the basis of the multiset \(\{2,2,3\}\) as \(12=2 \times 2\times 3\) and vice versa. This allows us to code any sequence of symbols as a specific individual number in the following way:
On the basis of this we can code any sequence of symbols as a so-called Gödel number, e.g., the number:
codes the multiset \(\{2,3,3,5,5,7\}\), which represents the string “abba” under the assumption \(a=1\), \(b=2\). With this observation conditions close to those that lead to the paradox of Russel are satisfied: elementary arithmetic itself is rich enough to express: Universality, Negation, and Self-reference.
Since arithmetic is consistent this does not lead to paradoxes, but to incompleteness. By a construction related to the liars paradox Gödel proved that such a system must contain statements that are true but not provable: there are true sentences of the form “I am not provable”.
In the context of philosophy of information the incompleteness of mathematics is a direct consequence of the rich possibilities of the natural numbers to code information. In principle any deterministic formal system can be represented in terms of elementary arithmetical functions. Consequently, If such a system itself contains arithmetic as a sub system, it contains a infinite chain of endomorphisms (i.e., images of itself). Such a system is capable of reasoning about its own functions and proofs but since it is consistent (and thus the construction of paradoxes is not possible within the system) it is by necessity incomplete.
5.2 Information and Symbolic Computation
Recursive functions are abstract relations defined on natural numbers. In principle they can be defined without any reference to space and time. Such functions must be distinguished from the operations that we use to compute them. These operations mainly depend on the type of symbolic representations that we choose for them. We can represent the number seven as unary number \(|||||||\), binary number 111, Roman number VII, or Arabic number 7 and depending on our choice other types of sequential symbol manipulation can be used to compute the addition two plus five is seven, which can be represented as:
\[ \begin{align} || + ||||| & = ||||||| \\ 10 + 101 & = 111 \\ \textrm{II} + \textrm{V} & = \textrm{VII}\\ 2 + 5 &= 7 \\ \end{align} \]
Consequently we can read these four sentences as four statements of the same mathematical truth, or as statements specifying the results of four different operations.
This leads to the following tentative definition:
In nature there are many other ways to perform such computations. One could use an abacus, study chemical processes or simply manipulate sequences of pebbles on a beach. The fact that the objects we manipulate are discrete together with the observation that the dataset is self-referential implies that the data domain is in principle Dedekind Infinite:
Since the data elements are discrete and finite the data domain will be countable infinite and therefore isomorphic to the set of natural numbers.
For infinite countable sets the notion of information is defined as follows:
Note that the correspondence f is specified explicitly. As soon as such an index function is defined for a class of objects in the real world, the manipulation of these objects can be interpreted a form of computing.
Once we choose a finite set of symbols and our operational rules the system starts to produce statements about the world.
We can study symbol manipulation in general on an abstract level, without any semantic implications. Such a theory was published by Alan Turing (1912–1954). Turing developed a general theory of computing focusing on the actual operations on symbols a mathematician performs (Turing 1936). For him a computer was an abstraction of a real mathematician sitting behind a desk, receiving problems written down on an in-tray (the inut), solving them according to fixed rules (the process) and leaving them to be picked up in an out-tray (the output).
Turing first formulated the notion of a general theory of computing along these lines. He proposed abstract machines that operate on infinite tapes with three symbols: blank \((b)\), zero \((0)\) and one \((1)\). Consequently the data domain for Turing machines is the set of relevant tape configurations, which can be associated with the set of binary strings, consisting of zero’s and one’s. The machines can read and write symbols on the tape and they have a transition function that determines their actions under various conditions. On an abstract level Turing machines operate like functions.
There is an infinite number of Turing machines. Turing discovered that there are so-called universal Turing machines \(U_j\) that can emulate any other Turing machine \(T_i\).
The self-delimiting code is necessary because the input for \(U_j\) is coded as one string \(\overline{T_i}x\). The universal machine \(U_j\) separates the input string \(\overline{T_i}x\) in to its two constituent parts: the description of the machine \(\overline{T_i}\) and the input for this machine x.
The self-referential nature of general computational systems allows us to construct machines that emulate other machines. This suggests the possible existence of a ‘super machine’ that emulates all possible computations on all possible machines and predicts their outcome. Using a technique called diagonalization, where one analyzes an enumeration of all possible machines running on descriptions of all possible machines, Turing proved that such a machine can not exist. More formally:
This implies that for a certain universal machine \(U_i\) the set of inputs on which it stops in finite time, is uncomputable. In recent years the notion of infinite computations on Turing machines has also been studied (Hamkins and Lewis 2000.) Not every machine will stop on every input, but in some case infinite computations compute useful output (consider the infinite expansion of the number pi).
The existence of universal Turing machines indicates that the class embodies a notion of universal computing: any computation that can be performed on a specific Turing machine can also be performed on any other universal Turing machine. This is the mathematical foundation of the concept of a general programmable computer. These observations have bearing on the theory of information: certain measures of information, like Kolmogorov complexity, are defined, but not computable.
The proof of the existence uncomputable functions in the class of Turing machines is similar to the incompleteness result of Gödel for elementary arithmetic. Since Turing machines were defined to study the notion of computation and thus contain elementary arithmetic. The class of Turing machines is in itself rich enough to express: Universality, Negation and Self-reference. Consequently Turing machines can model universal negative statements about themselves. Turing’s uncomputability proof is also motivated by the liars paradox, and the notion of a machine that stops on a certain input is similar to the notion of a proof that exists for a certain statement. At the same time Turing machines satisfy the conditions of Gödel’s theorem: they can be modeled as a formal system F that contains elementary Peano arithmetic.
This insight can be generalized:
The philosophical implications of this observation are strong and rich, not only for the theory of computing but also for our understanding of the concept of information.
There is an intricate ration between the notion of universal computing and that of information. Precisely the fact that Turing Systems are universal allows us to say that they process information, because their universality entails invariance:
Proof: The proof is simple and relevant for philosophy of information. Let \(l(x)\) be the length of the string of symbols x. Suppose we have two different universal Turing machines \(U_j\) and \(U_k\). Since they are universal they can both emulate the computation \(T_i(x)\) of Turing machine \(T_i\) on input x:
Here \(l(\overline{T}_i^j)\) is the length of the code for \(T_i\) on \(U_j\) and \(l(\overline{T}_i^k)\) is the length of the code for \(T_i\) on \(U_k\). Suppose \(l(\overline{T}_i^jx) \ll l(\overline{T}_i^kx)\), i.e., the code for \(T_i\) on \(U_k\) is much less efficient that on \(U_j\). Observe that the code for \(U_j\) has constant length, i.e., \(l(\overline{U}_j^k)=c\). Since \(U_k\) is universal we can compute:
The length of the input for this computation is:
Consequently the specification of the input for the computation \(T_i(x)\) on the universal machine \(U_k\) never needs to longer than a constant. \(\Box\)
This proof forms the basis of the theory of Kolmogorov complexity and is originally due to Solomonoff (1964a,b) and discovered independently by Kolmogorov (1965) and Chaitin (1969). Note that this notion of invariance can be generalized over the class of Turing Complete Systems:
How strong this result is becomes clear when we analyze the class of Turing complete systems in more detail. In the first half of the twentieth century three fundamentally different proposals for a general theory of computation were formulated: Gödel’s recursive functions ( Gödel 1931), Turing’s automata (Turing 1937) and Church’s Lambda Calculus (Church 1936). Each of these proposals in its own way clarifies aspects of the notion of computing. Later much more examples followed. The class of Turing equivalent systems is diverse. Apart from obvious candidates like all general purpose programming languages (C, Fortran, Prolog, etc.) it also contains some unexpected elements like various games (e.g., Magic: The Gathering [Churchill 2012 OIR]). The table below gives an overview of some conceptually interesting systems:
We make the following:
A direct consequence of this observation is:
It is not possible to derive any necessary qualities of computational systems and data domains beyond the fact that they are general mathematical operations and structures. Data domains on which Turing equivalent systems are defined are not necessarily physical, nor temporal, nor spatial, not binary or digital. At any moment a new member for the class can be introduced. We know that there are computational systems that are weaker than the class of Turing machines (e.g., regular languages). We cannot rule out the possibility that one-day we come across a system that is stronger. The thesis that such a system does not exist is known as the Church-Turing thesis (see entry on Church-Turing thesis):
We give an overview of the arguments for and against the thesis:
Arguments in favor of the thesis: The theory of Turing machines seems to be the most general theory possible that we can formulate since it is based on a very limited set of assumptions about what computing is. The fact that it is universal also points in the direction of its generality. It is difficult to conceive in what sense a more powerful system could be “more” universal. Even if we could think of such a more powerful system, the in- and output for such a system would have to be finite and discrete and the computation time also finite. So, in the end, any computation would have the form of a finite function between finite data sets, and, in principle, all such relations can be modeled on Turing machines. The fact that all known systems of computation we have defined so far have the same power also corroborates the thesis.
Arguments against the thesis: The thesis is, in its present form, unprovable. The class of Turing Complete systems is open. It is defined on the basis of the existence of equivalence relations between known systems. In this sense it does not define the notion of computing intrinsically. It doesn’t not provide us with a philosophical theory that defines what computing exactly is. Consequently it does not allow us to exclude any system from the class a priori. At any time a proposal for a notion of computation might emerge that is fundamentally stronger. What is more, nature provides us with stronger notions of computing in the form of quantum computing. Quantum bits are really a generalization of the normal concept of bits that is associated with symbol manipulation, although in the end quantum computing does not seem to necessitate us to redefine the notion of computing so far. We can never rule out that research in physics, biology or chemistry will define systems that will force us to do so. Indeed various authors have suggested such systems but there is currently no consensus on convincing candidates (Davis 2006). Dershowitz and Gurevich (2008) claim to have vindicated the hypothesis, but this result is not generally accepted (see the discussion on “Computability – What would it mean to disprove the Church-Turing thesis”, in the Other Internet Resources [OIR]).
Being Turing complete seems to be quite a natural condition for a (formal) system. Any system that is sufficiently rich to represent the natural numbers and elementary arithmetical operations is Turing complete. What is needed is a finite set of operations defined on a set of discrete finite data elements that is rich enough to make the system self-referential: its operations can be described by its data elements. This explains, in part, why we can use mathematics to describe our world. The abstract notion of computation defined as functions on numbers in the abstract world mathematics and the concrete notion of computing by manipulation objects in our every day world around us coincide. The concepts of information end computation implied by the Recursive Function Paradigm and the Symbol Manipulation Paradigm are the same.
5.3 Quantum Information and Beyond
We have a reasonable understanding of the concept of classical computing, but the implications of quantum physics for computing and information may determine the philosophical research agenda for decades to come if not longer. Still it is already clear that the research has repercussions for traditional philosophical positions: the Laplacian view (Laplace 1814 [1902]) that the universe is essentially deterministic seems to be falsified by empirical observations. Quantum random generators are commercially available (see Wikipedia entry on Hardware random number generator [OIR]) and quantum fluctuations do affect neurological, biological and physical processes at a macroscopic scale (Albrecht & Phillips 2014). Our universe is effectively a process that generates information permanently. Classical deterministic computing seems to be too weak a concept to understand its structure.
Standard computing on a macroscopic scale can be defined as local, sequential, manipulation of discrete objects according to deterministic rules. Is has a natural interpretation in operations on the set of natural numbers N and a natural measurement function in the log operation \(\log: \mathbb{N} \rightarrow \mathbb{R}\) associating a real number to every natural number. The definition gives us an adequate information measure for countable infinite sets, including number classes like the integers \(\mathbb{Z}\), closed under subtraction, and the rational numbers \(\mathbb{Q}\), closed under division.
The operation of multiplication with the associated logarithmic function characterizes our intuitions about additivity of the concept of information exactly. It leads to a natural bijection between the set of natural numbers \(\mathbb{N}\) and the set of multisets of numbers (i.e., sets of prime factors). The notion of a multiset is associated with the properties of commutativity and associativity. This program can be extended to other classes of numbers when we study division algebras in higher dimensions. The following table gives an overview of some relevant number classes together with the properties of the operation of multiplication for these classes:
The table is ordered in terms of increasing generality. Starting from the set of natural numbers \(\mathbb{N}\), various extensions are possible taking into account closure under subtraction, \(\mathbb{Z}\), and division, \(\mathbb{Q}\). This are the number classes for which we have adequate finite symbolic representations on a macroscopic scale. For elements of the real numbers \(\mathbb{R}\) such a representations are not available. The real numbers \(\mathbb{R}\) introduce the aspect of manipulation of infinite amounts of information in one operation.
More complex division algebras can be defined when we introduce imaginary numbers as negative squares \(i^2 = -1\). We can now define complex numbers: \(a + bi\), where a is the real part and \(bi\) the imaginary part. Complex numbers can be interpreted as vectors in a two dimensional plane. Consequently they lack the notion of a strict linear order between symbols. Addition is quite straightforward:
Multiplication follows the normal distribution rule but the result is less intuitive since it involves a negative term generated by \(i^2\):
In this context multiplication ceases to be a purely extensive operation:
More complicated numbers systems with generalizations of this type of multiplication in 4 and 8 dimensions can be defined. Kervaire (1958) and Bott & Milnor (1958) independently proved that the only four division algebras built on the reals are \(\mathbb{R}\), \(\mathbb{C}\), \(\mathbb{H}\) and \(\mathbb{O}\), so the table gives a comprehensive view of all possible algebra’s that define a notion of extensiveness. For each of the number classes in the table a separate theory of information measurement, based on the properties of multiplication, can be developed. For the countable classes \(\mathbb{N}\), \(\mathbb{Z}\) and \(\mathbb{Q}\) these theories ware equivalent to the standard concept of information implied by the notion of Turing equivalence. Up to the real numbers these theories satisfy our intuitive notions of extensiveness of information. For complex numbers the notion of information efficiency of multiplication is destroyed. The quaternions lack the property of commutativity and the octonions that of associativity. These models are not just abstract constructions since the algebras play an important role in our descriptions of nature:
We briefly discuss the application of vector spaces in quantum physics. Classical information is measured in bits. Implementation of bits in nature involves macroscopic physical systems with at least two different stable states and a low energy reversible transition process (i.e., switches, relays, transistors). The most fundamental way to store information in nature on an atomic level involves qubits. The qubit is described by a state vector in a two-level quantum-mechanical system, which is formally equivalent to a two-dimensional vector space over the complex numbers (Von Neumann 1932; Nielsen & Chuang 2000). Quantum algorithms have, in some cases, a fundamentally lower complexity (e.g., Shor’s algorithm for factorization of integers (Shor 1997)).
Under this mathematical model our intuitions about computing as local, sequential, manipulation of discrete objects according to deterministic rules evolve in to a much richer paradigm:
From this analysis it is clear that the description of our universe at very small (and very large) scales involves mathematical models that are alien to our experience of reality in everyday life. The properties that allow us to understand the world (the existence of stable, discrete objects that preserve their identity in space and time) seem to be emergent aspects of a much more complex reality that is incomprehensible to us outside its mathematical formulation. Yet, at a macroscopic level, the universe facilitates elementary processes, like counting, measuring lengths, and the manipulation of symbols, that allow us to develop a consistent hierarchy of mathematical models some of which seems to describe the deeper underlying structure of reality.
In a sense the same mathematical properties that drove the development of elementary accounting systems in Mesopotamia four thousand years ago, still help us to penetrate in to the world of subatomic structures. In the past decennia information seems to have become a vital concept in physics. Seth Lloyd and others (Zuse 1969; Wheeler 1990; Schmidhuber 1997b; Wolfram 2002; Hutter 2010) have analyzed computational models of various physical systems. The notion of information seems to play a major role in the analysis of black holes (Lloyd & Ng 2004; Bekenstein 1994 [OIR]). Erik Verlinde (2011, 2017) has proposed a theory in which gravity is analyzed in terms of information. For the moment these models seem to be purely descriptive without any possibility of empirical verification.
6. Anomalies, Paradoxes, and Problems
Some of the fundamental issues in philosophy of Information are closely related to existing philosophical problems, others seem to be new. In this paragraph we discuss a number of observations that may determine the future research agenda. Some relevant questions are:
Since Frege most mathematicians seem to believe that the answer to the first question is positive (Frege 1879, 1892). The descriptions “The morning star” and “The evening star” are associated with procedures to identify the planet Venus, but they do not give access to all information about the object itself. If this were so the discovery that the evening star is in fact also the morning star would be uninformative. If we want to maintain this position we get into conflict, because in terms of information theory the answer to the second question is negative (see section 5.1.7). Yet this observation is highly counter intuitive, because it implies that we never can construct new information on the basis of deterministic computation, which leads to the third question. These issues cluster around one of the fundamental open problems of Philosophy of Information:
Why would we compute at all, if according to our known information measures, deterministic computing does not produce new information? The question could be rephrased as: should we use Kolmogorov or Levin complexity (Levin 1973, 1974, 1984) as our basic information measure? In fact both choices lead to relevant, but fundamentally different, theories of information. When using the Levin measure, computing generates information and the answer to the three questions above is a “yes”, when using Kolmogorov this is not the case. The questions are related to many problems both in mathematics and computer science. Related issues like approximation, computability and partial information are also studied in the context of Scott domains (Abramsky & Jung 1994). Below we discuss some relevant observations.
6.1 The Paradox of Systematic Search
The essence of information is the fact that it reduces uncertainty. This observation leads to problems in opaque contexts, for instance, when we search an object. This is illustrated by Meno’s paradox (see entry on epistemic paradoxes):
And how will you enquire, Socrates, into that which you do not know? What will you put forth as the subject of enquiry? And if you find what you want, how will you ever know that this is the thing which you did not know? (Plato, Meno, 80d1-4)
The paradox is related to other open problems in computer science and philosophy. Suppose that John is looking for a unicorn. It is very unlikely that unicorns exist, so, in terms of Shannon’s theory, John gets a lot of information if he finds one. Yet from a descriptive Kolmogorov point of view, John does not get new information, since he already knows what unicorns are. The related paradox of systematic search might be formulated as follows:
Any information that can be found by means of systematic search has no value, since we are certain to find it, given enough time. Consequently information only has value as long as we are uncertain about its existence, but then, since we already know what we are looking for, we get no new information when we find out that it exists.
Example: Goldbach conjectured in 1742 that every even number bigger than 2 could be written as the sum of two primes. Until today this conjecture remains unproved. Consider the term “The first number that violates Goldbach’s conjecture”. It does not give us all information about the number, since the number might not exist. The prefix “the first” ensures the description, if it exists, is unique, and it gives us an algorithm to find the number. It is a partial uniquely identifying description. This algorithm is only effective if the number really exists, otherwise it will run forever. If we find the number this will be great news, but from the perspective of descriptive complexity the number itself will be totally uninteresting, since we already know the relevant properties to find it. Observe that, even if we have a number n that is a counter example to Goldbach’s conjecture, it might be difficult to verify this: we might have to check almost all primes \( \leq n\). This can be done effectively (we will always get a result) but not, as far as we know, efficiently (it might take “close” to n different computations) .
A possible solution is to specify the constraint that it is illegal to measure the information content of an object in terms of partial descriptions, but this would destroy our theory of descriptive complexity. Note that the complexity of an object is the length of the shortest program that produces an object on a universal Turing machine. In this sense the phrase “the first number that violates Goldbach’s conjecture” is a perfect description of a program, and it adequately measures the descriptive complexity of such a number. The short description reflects the fact that the number, if it exists, is very special, and thus it has a high possibility to occur in some mathematical context.
There are relations which well-studied philosophical problems like the Anselm’s ontological argument for God’s existence and the Kantian counter claim that existence is not a predicate. In order to avoid similar problems Russell proposed to interpret unique descriptions existentially (Russell 1905): A sentence like “The king of France is bald” would have the following logical structure:
This interpretation does not help us to analyze decision problems that deal with existence. Suppose the predicate L is true of x if I’m looking for x, then the logical structure of the phrase “I’m looking for the king of France” would be:
i.e., if the king of France does not exist it cannot be true that I am looking for him, which is unsatisfactory. Kripke (1971) criticized Russell’s solution and proposed his so-called causal theory of reference in which a name get its reference by an initial act of “baptism”. It then becomes a rigid designator (see entry on rigid designators) that can be followed back to that original act via causal chains. In this way ad hoc descriptions like “John was the fourth person to step out of the elevator this morning” can establish a semantics for a name.
In the context of mathematics and information theory the corresponding concept is that of names, constructive predicates and ad-hoc predicates of numbers. For any number there will be in principle an infinite number of true statements about that number. Since elementary arithmetic is incomplete there will be statements about numbers that are true but unprovable. In the limit a vanishing fragment of numbers will have true predicates that actually compress their description. Consider the following statements:
The first statement simply specifies a name for a number. The second statement gives a partial description that is constructive, information compressing and unique. The 1000th Fibonacci number has 209 digits, so the description “the 1000th Fibonacci number” is much more efficient than the actual name of the number. Moreover, we have an algorithm to construct the number. This might not be that case for the description in the third statement. We do not know whether the first number that violates the Goldbach conjecture exists, but if it does, the description might well be ad hoc and thus gives us no clue to construct the number. This rise to the conjecture that there are data compressing effective ad hoc descriptions:
The conjecture is a more general variant of the so-called P vs. NP thesis (see section 6.3). If one replaces the term “effective” with the term “efficient” one gets a formulation of the \(\textrm{P} \neq \textrm{NP}\) thesis.
6.2 Effective Search in Finite Sets
When we restrict ourselves to effective search in finite sets, the problem of partial descriptions, and construction versus search remain. It seems natural to assume that when one has a definition of a set of numbers, then one also has all the information about the members of the set and about its subsets, but this is not true. In general the computation of the amount of information in a set of numbers is a highly non-trivial issue. We give some results:
Proof: Consider the set S of all natural numbers smaller than n. The descriptive complexity of this set in bits is \( \log_2 n + c\). Now construct A by selecting half of the elements of S randomly. Observe that:
We have:
The conditional descriptive complexity of this set will be: \(I(A\mid S) \approx n + c \gg \log n + c\). \(\Box\)
A direct consequence is that we can lose information when we merge two sets. An even stronger result is:
Proof: Consider the set S of natural numbers smaller then \(2^n\). The cardinality of S is \(2^n\). The descriptive complexity of this set is \(\log n + c\) bits, but for half of the elements of S we need n bits to describe them. \(\Box\)
In this case the description of the set itself is highly compressible, but it still contains non-compressible elements. When we merge or split sets of numbers, or add or remove elements, the effects on the amount of information are in general hard to predict and might even be uncomputable:
Proof: Immediate consequence of the lemmas above. \(\Box\)
This shows how the notion of information pervades our everyday life. When John has two apples in his pocket it seems that he can do whatever he wants with them, but, in fact, as soon as he chooses one of the two, he has created (new) information. The consequences for search problems are clear: we can always effectively perform bounded search on the elements and the set of subsets of a set. Consequently when we search for such a set of subsets by means of partial descriptions then the result generates (new) information. This analysis prima facie appears to force us to accept that in mathematics there are simple descriptions that allow us to identify complex objects by means of systematic search. When we look for the object we have only little information about it, when we finally find it our information increases to the set of full facts about the object searched. This is in conflict with our current theories of information (Shannon and Kolmogorov): any description that allows us to identify an object effectively by deterministic search contains all relevant information about the object. The time complexity of the search process then is irrelevant.
6.3 The P versus NP Problem, Descriptive Complexity Versus Time Complexity
In the past decennia mathematicians have been pondering about a related question: suppose it would be easy to check whether I have found what I’m looking for, how hard can it be to find such an object? In mathematics and computer science there seems to be a considerable class of decision problems that cannot be solved constructively in polynomial time, \(t(x)=x^c\), where c is a constant and x is the length of the input), but only through systematic search of a large part of the solution space, which might take exponential time, \(t(x)=c^x\). This difference roughly coincides with the separation of problems that are computationally feasible from those that are not.
The issue of the existence of such problems has been framed as the possible equivalence of the class P of decision problems that can be solved in time polynomial to the input to the class NP of problems for which the solution can be checked in time polynomial to the input. (Garey & Johnson 1979; see also Cook 2000 [OIR] for a good introduction.)
This is an example of a so-called decision problem. The answer is a simple “yes” or “no”, but it might be hard to find the answer. Observe that the formulation of the question conditional to S has descriptive complexity \(\log k + c\), whereas most random subsets of S have a conditional descriptive complexity of \(|S|\). So any subset \(S^{\prime}\) that adds up to k might have a descriptive complexity that is bigger then the formulation of the search problem. In this sense search seems to generate information. The problem is that if such a set exists the search process is bounded, and thus effective, which means that the phrase “the first subset of S that adds up to k” is an adequate description. If \(\textrm{P} = \textrm{NP}\) then the Kolmogorov complexity and the Levin complexity of the set \(S^{\prime}\) we find roughly coincide, if \(P \neq \textit{NP}\) then in some cases \(Kt(S^{\prime}) \gg K(S^{\prime})\). Both positions, the theory that search generates new information and the theory that it does not, are counterintuitive from different perspectives.
The P vs. NP problem, that appears to be very hard, has been a rich source of research in computer science and mathematics although relatively little has been published on its philosophical relevance. That a solution might have profound philosophical impact is illustrated by a quote from Scott Aaronson:
If P = NP, then the world would be a profoundly different place than we usually assume it to be. There would be no special value in “creative leaps,” no fundamental gap between solving a problem and recognizing the solution once it’s found. Everyone who could appreciate a symphony would be Mozart; everyone who could follow a step-by-step argument would be Gauss…. (Aaronson 2006 – in the Other Internet Resources)
In fact, if \(\textrm{P}=\textrm{NP}\) then every object that has a description that is not too large and easy to check is also easy to find.
6.4 Model Selection and Data Compression
In current scientific methodology the sequential aspects of the scientific process are formalized in terms of the empirical cycle, which according to de Groot (1969) has the following stages:
In the context of information theory the set of observations will be a data set and we can construct models by observing regularities in this data set. Science aims at the construction of true models of our reality. It is in this sense a semantical venture. In the 21-st century the process of theory formation and testing will for the largest part be done automatically by computers working on large databases with observations. Turing award winner Jim Grey framed the emerging discipline of e-science as the fourth data-driven paradigm of science. The others are empirical, theoretical and computational. As such the process of automatic theory construction on the basis of data is part of the methodology of science and consequently of philosophy of information (Adriaans & Zantinge 1996; Bell, Hey, & Szalay 2009; Hey, Tansley, and Tolle 2009). Many well-known learning algorithms, like decision tree induction, support vector machines, normalized information distance and neural networks, use entropy based information measures to extract meaningful and useful models out of large data bases. The very name of the discipline Knowledge Discovery in Databases (KDD) is witness to the ambition of the Big Data research program. We quote:
At an abstract level, the KDD field is concerned with the development of methods and techniques for making sense of data. The basic problem addressed by the KDD process is one of mapping low-level data (which are typically too voluminous to understand and digest easily) into other forms that might be more compact (for example, a short report), more abstract (for example, a descriptive approximation or model of the process that generated the data), or more useful (for example, a predictive model for estimating the value of future cases). At the core of the process is the application of specific data-mining methods for pattern discovery and extraction. (Fayyad, Piatetsky-Shapiro, & Smyth 1996: 37)
Much of the current research focuses on the issue of selecting an optimal computational model for a data set. The theory of Kolmogorov complexity is an interesting methodological foundation to study learning and theory construction as a form of data compression. The intuition is that the shortest theory that still explains the data is also the best model for generalization of the observations. A crucial distinction in this context is the one between one- and two-part code optimization:
A possible approach is suggested by an interpretation of Bayes’ law. If we combine Shannon’s notion of an optimal code with Bayes’ law, we get a rough theory about optimal model selection. Let \(\mathcal{H}\) be a set of hypotheses and let x be a data set. Using Bayes’ law, the optimal computational model under this distribution would be:
This is equivalent to optimizing:
Here \(-\log P(M)\) can be interpreted as the length of the optimal model code in Shannon’s sense and \(- \log P(x\mid M)\) as the length of the optimal data-to-model code; i.e., the data interpreted with help of the model. This insight is canonized in the so-called:
The MDL principle is often referred to as a modern version of Ockham’s razor (see entry on William of Ockham), although in its original form Ockham’s razor is an ontological principle and has little to do with data compression. In many cases MDL is a valid heuristic tool and the mathematical properties of the theory have been studied extensively (Grünwald 2007). Still MDL, Ockham’s razor and two-part code optimization have been the subject of considerable debate in the past decennia (e.g., Domingos 1998; McAllister 2003).
The philosophical implications of the work initiated by Solomonoff, Kolmogorov and Chaitin in the sixties of the 20-th century are fundamental and diverse. The universal distribution m proposed by Solomonoff, for instance, codifies all possible mathematical knowledge and when updated on the basis of empirical observations would in principle converge to an optimal scientific model of our world. In this sense the choice of a universal Turing machine as basis for our theory of information measurement has philosophical importance, specifically for methodology of science. A choice for a universal Turing machine can be seen as a choice of a set of bias for our methodology. There are roughly two schools:
Both approaches have their value. For rigid mathematical proofs the poor machine approach is often best. For practical applications on finite data sets the rich model strategy often gets much better results, since a poor machine would have to “re-invent the wheel” every time it compresses a data set. This leads to the conclusion that Kolmogorov complexity inherently contains a theory about scientific bias and as such implies a methodology in which the class of admissible universal models should be explicitly formulated and motivated a priori. In the past decennia there have been a number of proposals to define a formal unit of measurement of the amount of structural (or model-) information in a data set.
Three intuitions dominate the research. A string is “interesting” when …
Such models penalize both maximal entropy and low information content. The exact relationship between these intuitions is unclear. The problem of meaningful information has been researched extensively in the past years, but the ambition to formulate a universal method for model selection based on compression techniques seems to be misguided:
This appears to be the case even if we restrict ourselves to weaker computational models like total functions, but more research is necessary. There seems to be no a priori mathematical justification for the approach, although two-part code optimization continues to be a valid approach in an empirical setting of data sets that have been created on the basis of repeated observations. Phenomena that might be related to a theory of structural information and that currently are ill-understood are: phase transitions in the hardness of satisfiability problems related to their complexity (Simon & Dubois 1989; Crawford & Auton 1993) and phase transitions in the expressiveness of Turing machines related to their complexity (Crutchfield & Young 1989, 1990; Langton 1990; Dufort & Lumsden 1994).
6.5 Determinism and Thermodynamics
Many basic concepts of information theory were developed in the nineteenth century in the context of the emerging science of thermodynamics. There is a reasonable understanding of the relationship between Kolmogorov Complexity and Shannon information (Li & Vitányi 2008; Grünwald & Vitányi 2008; Cover & Thomas 2006), but the unification between the notion of entropy in thermodynamics and Shannon-Kolmogorov information is very incomplete apart from some very ad hoc insights (Harremoës & Topsøe 2008; Bais & Farmer 2008). Fredkin and Toffoli (1982) have proposed so-called billiard ball computers to study reversible systems in thermodynamics (Durand-Lose 2002) (see the entry on information processing and thermodynamic entropy). Possible theoretical models could with high probability be corroborated with feasible experiments (e.g., Joule’s adiabatic expansion, see Adriaans 2008).
Questions that emerge are:
These problems seem to be hard because 150 years of research in thermodynamics still leaves us with a lot of conceptual unclarities in the heart of the theory of thermodynamics itself (see entry on thermodynamic asymmetry in time).
Real numbers are not accessible to us in finite computational processes yet they do play a role in our analysis of thermodynamic processes. The most elegant models of physical systems are based on functions in continuous spaces. In such models almost all points in space carry an infinite amount of information. Yet, the cornerstone of thermodynamics is that a finite amount of space has finite entropy. There is, on the basis of the theory of quantum information, no fundamental reason to assume that the expressiveness of real numbers is never used in nature itself on this level. This problem is related to questions studied in philosophy of mathematics (an intuitionistic versus a more platonic view). The issue is central in some of the more philosophical discussions on the nature of computation and information (Putnam 1988; Searle 1990). The problem is also related to the notion of phase transitions in the description of nature (e.g., thermodynamics versus statistical mechanics) and to the idea of levels of abstraction (Floridi 2002).
In the past decade some progress has been made in the analysis of these questions. A basic insight is that the interaction between time and computational processes can be understood at an abstract mathematical level, without the burden of some intended physical application (Adriaans & van Emde Boas 2011). Central is the insight that deterministic programs do not generate new information. Consequently deterministic computational models of physical systems can never give an account of the growth of information or entropy in nature:
A statistical reduction of thermodynamics to a deterministic theory like Newtonian physics leads to a notion of entropy that is fundamentally different from the information processed by deterministic computers. From this perspective the mathematical models of thermodynamics, which are basically differential equations on spaces of real numbers, seem to operate on a level that is not expressive enough. More advanced mathematical models, taking in to account quantum effects, might resolve some of the conceptual difficulties. At a subatomic level nature seems to be inherently probabilistic. If probabilistic quantum effects play a role in the behavior of real billiard balls, then the debate whether entropy increases in an abstract gas, made out of ideal balls, seems a bit academic. There is reason to assume that stochastic phenomena at quantum level are a source of probability at a macroscopic scale (Albrecht & Phillips 2014). From this perspective the universe is a constant source of, literally, astronomical amounts of information at any scale.
6.6 Logic and Semantic Information
Logical and computational approaches to the understanding of information both have their roots in the “linguistic turn” that characterized the philosophical research in the beginning of the twentieth century and the elementary research questions originate from the work of Frege (1879, 1892, see the entry on logic and information). The ambition to quantify information in sets of true sentences, as apparent in the work of researchers like Popper, Carnap, Solomonoff, Kolmogorov, Chaitin, Rissanen, Koppel, Schmidthuber, Li, Vitányi and Hutter is an inherently semantic research program. In fact, Shannon’s theory of information is the only modern approach that explicitly claims to be non-semantic. More recent quantitative information measures like Kolmogorov complexity (with its ambition to codify all scientific knowledge in terms of a universal distribution) and quantum information (with its concept of observation of physical systems) inherently assume a semantic component. At the same time it is possible to develop quantitative versions of semantic theories (see entry on semantic conceptions of information).
The central intuition of algorithmic complexity theory that an intension or meaning of an object can be a computation, was originally formulated by Frege (1879, 1892). The expressions “1 + 4” and “2 + 3” have the same extension (Bedeutung) “5”, but a different intension (Sinn). In this sense one mathematical object can have an infinity of different meanings. There are opaque contexts in which such a distinction is necessary. Consider the sentence “John knows that \(\log_2 2^2 = 2\)”. Clearly the fact that \(\log_2 2^2\) represents a specific computation is relevant here. The sentence “John knows that \(2 = 2\)” seems to have a different meaning.
Dunn (2001, 2008) has pointed out that the analysis of information in logic is intricately related to the notions of intension and extension. The distinction between intension and extension is already anticipated in the Port Royal Logic (1662) and the writings of Mill (1843), Boole (1847) and Peirce (1868) but was systematically introduced in logic by Frege (1879, 1892). In a modern sense the extension of a predicate, say “X is a bachelor”, is simply the set of bachelors in our domain. The intension is associated with the meaning of the predicate and allows us to derive from the fact that “John is a bachelor” the facts that “John is male” and “John is unmarried”. It is clear that this phenomenon has a relation with both the possible world interpretation of modal operators and the notion of information. A bachelor is by necessity also male, i.e., in every possible world in which John is a bachelor he is also male, consequently: If someone gives me the information that John is a bachelor I get the information that he is male and unmarried for free.
The possible world interpretation of modal operators (Kripke 1959) is related to the notion of “state description” introduced by Carnap (1947). A state description is a conjunction that contains exactly one of each atomic sentence or its negation (see section 4.3). The ambition to define a good probability measure for state descriptions was one of the motivations for Solomonoff (1960, 1997) to develop algorithmic information theory. From this perspective Kolmogorov complexity, with its separation of data types (programs, data, machines) and its focus on true sentences describing effects of processes is basically a semantic theory. This is immediately clear if we evaluate the expression:
As is explained in section 5.2.1 the expression \(U_j(\overline{T_i}x)\) denotes the result of the emulation of the computation \(T_i(x)\) by \(U_j\) after reading the self-delimiting description \(\overline{T_i}\) of machine \(T_j\). This expression can be interpreted as a piece of semantic information in the context of the informational map (See entry on semantic conceptions of information) as follows:
The logical structure of the sentence \(U_j(\overline{T_i}x)= y\) is comparable to a true sentence like:
Mutatis mutandis one could develop the following interpretation: \(U_j\) can be seen as a context that, for instance, codifies a bias for scientific observations on earth, y is the extension Venus, \(\overline{T_i}x\) is the intension “the bright star you can see in the morning in the eastern sky”. The intension consists of \(T_i\), which can be interpreted as some general astronomical observation routine (e.g., instructional data), and x provides the well-formed data that tells one where to look (bright star in the morning in the eastern sky).
This suggests a possible unification between more truth oriented theories of information and computational approaches in terms of the informational map presented in the entry of semantic conceptions of information. We delineate some research questions:
6.7 Meaning and Computation
Ever since Descartes, the idea that the meaningful world, we perceive around us, can be reduced to physical processes has been a predominant theme in western philosophy. The corresponding philosophical self-reflection in history neatly follows the technical developments from: Is the human mind an automaton, to is the mind a Turing machine and, eventually, is the mind a quantum computer? It is not the place here to discuss these matters extensively, but the corresponding problem in philosophy of information is relevant:
The question is interwoven with more general issues in philosophy and its answer directly forces a choice between a more positivistic or a more hermeneutical approach to philosophy, with consequences for theory of knowledge, metaphysics, aesthetics and ethics. It also effects direct practical decisions we take on a daily basis. Should the actions of a medical doctor be guided by evidence based medicine or by the notion of caritas? Is a patient a conscious human being that wants to lead a meaningful life, or is he ultimately just a system that needs to be repaired?
The idea that meaning is essentially a computational phenomenon may seem extreme, but here are many discussions and theories in science, philosophy and culture that implicitly assume such a view. In popular culture, e.g., there is a remarkable collection of movies and books in which we find evil computers that are conscious of themselves (2001, A Space Odyssey), individuals that upload their consciousness to a computer (1992, The Lawnmower Man), and fight battles in virtual realities (1999, The Matrix). In philosophy the position of Bostrom (2003), who defends the view that it is very likely that we already live in a computer simulation, is illustrative. There are many ways to argue the pros and cons of the reduction of meaning to computation. We give an overview of possible arguments for the two extreme positions:
Apart from that, a discipline like physics, that until recently overlooked about 68% of the energy in the universe and 27% of the matter, that has no unified theory of elementary forces and only explains the fundamental aspects of our world in terms of mathematical models that lack any intuitive foundation, for the moment does not seem to converge to a model that could be an adequate basis for a reductionistic metaphysics.
As soon as one defines information in terms of true statements, some meanings become computational and others lack that feature. In the context of empirical science we can study groups of researchers that aim at the construction of theories generalizing structural information in data sets of repeated observations. Such processes of theory construction and intersubjective verification and falsification have an inherent computational component. In fact, this notion of intersubjective verification seems an essential element of mathematics. This is the main cause of the fact that central questions of humanities are not open for quantitative analysis: We can disagree on the question whether one painting is more beautiful than the other, but not on the fact that there are two paintings.
It is clear that computation as a conceptual model pays a role in many scientific disciplines varying from cognition (Chater & Vitányi 2003), to biology (see entry on biological information) and physics (Lloyd & Ng 2004; Verlinde 2011, 2017). Extracting meaningful models out of data sets by means of computation is the driving force behind the Big Data revolution (Adriaans & Zantinge 1996; Bell, Hey, & Szalay 2009; Hey, Tansley, & Tolle 2009). Everything that multinationals like Google and Facebook “know” about individuals is extracted from large data bases by means of computational processes, and it cannot be denied that this kind of “knowledge” has a considerable amount of impact on society. The research question “How can we construct meaningful data out of large data sets by means of computation?” is a fundamental meta-problem of science in the twenty-first century and as such part of philosophy of information, but there is no strict necessity for a reductionistic view.
7. Conclusion
The first domain that could benefit from philosophy of information is of course philosophy itself. The concept of information potentially has an impact on almost all philosophical main disciplines, ranging from logic, theory of knowledge, to ontology and even ethics and esthetics (see introduction above). Philosophy of science and philosophy of information, with their interest in the problem of induction and theory formation, probably both could benefit from closer cooperation (see 4.1 Popper: Information as degree of falsifiability). The concept of information plays an important role in the history of philosophy that is not completely understood (see 2. History of the term and the concept of information).
As information has become a central issue in almost all of the sciences and humanities this development will also impact philosophical reflection in these areas. Archaeologists, linguists, physicists, astronomers all deal with information. The first thing a scientist has to do before he can formulate a theory is gathering information. The application possibilities are abundant. Datamining and the handling of extremely large data sets seems to be an essential for almost every empirical discipline in the twenty-first century.
In biology we have found out that information is essential for the organization of life itself and for the propagation of complex organisms (see entry on biological information). One of the main problems is that current models do not explain the complexity of life well. Valiant has started a research program that studies evolution as a form of computational learning (Valiant 2009) in order to explain this discrepancy. Aaronson (2013) has argued explicitly for a closer cooperation between complexity theory and philosophy.
Until recently the general opinion was that the various notions of information were more or less isolated but in recent years considerable progress has been made in the understanding of the relationship between these concepts. Cover and Thomas (2006), for instance, see a perfect match between Kolmogorov complexity and Shannon information. Similar observations have been made by Grünwald and Vitányi (2008). Also the connections that exist between the theory of thermodynamics and information theory have been studied (Bais & Farmer 2008; Harremoës & Topsøe 2008) and it is clear that the connections between physics and information theory are much more elaborate than a mere ad hoc similarity between the formal treatment of entropy and information suggests (Gell-Mann & Lloyd 2003; Verlinde (2011, 2017). Quantum computing is at this moment not developed to a point where it is effectively more powerful than classical computing, but this threshold might be passed in the coming years. From the point of view of philosophy many conceptual problems of quantum physics and information theory seem to merge into one field of related questions:
The notion of information has become central in both our society and in the sciences. Information technology plays a pivotal role in the way we organize our lives. It also has become a basic category in the sciences and the humanities. Philosophy of information, both as a historical and a systematic discipline, offers a new perspective on old philosophical problems and also suggests new research domains.