Limits of Learning Algorithms

A learning algorithm is a program that, following some fixed set of rules, produces new programs from sets or sequences of input data. Recent progress on the development of some learning algorithms (especially the “deep learning” approach based on “deep”, i.e. multi-layered systems of artificial neurons) have lead some researchers to the claim that, after a long “AI winter” of lack of success, true AI (artificial intelligence) deserving that name is finally arriving. But are such claims justified? Or is this hype?

Initial attempts to build artificial intelligent systems were based on algorithms programmed by people. While such systems could do some interesting special things, like playing checkers and even doing so better than most people, they did not display any ability that could even remotely be called “intelligence”. Just like our cars can move faster than we are able to walk, just as our drilling machines are better at making holes than we are with our fingers alone (etc., etc.), such special purpose programs were better than humans at doing some special purpose information processing tasks. But there is nothing special about this; it is one of the reasons why we are using software instead of doing everything by ourselves. It has nothing to do with intelligence. An electronic calculator is better at multiplying numbers than most people, but it is in no way intelligent.

Now the idea that intelligence could instead be achieved by building machines that learn seems plausible at first. The task of producing the special purpose software is shifted to the machine itself. The machine learns from examples and produces the special purpose software by itself. What human programmers provide is not any of the special purpose software but instead a software for producing the special software.

But there is a problem: the learning software is an algorithm itself. As such, it is limited. It can be represented by a finite text, a finite sequence of signs or characters, so it contains only a limited amount of information. The following line of thought (in fact a sketch of a mathematical proof) will show that any learning algorithm is going to be incomplete, i.e. it will not be able to find all the regularity existing in the data presented to it. It will simply not be capable of spotting some patterns in some of the data, i.e., it is going to have systematic blind spots.

Moreover, the amount of programming knowledge contained in such an algorithm, i.e. the amount of ways it is able to put together the building blocks of the underlying programming language (basic commands and programming language constructs, artificial neurons and the like) to form new programs, is going to be limited as well. For any given learning algorithm, there are possible programs that another learning algorithm might be able to construct that are out of reach for this particular algorithm.

To see why there always must be such a limit, let us now look back at the proof, sketched in a previous article, that total computable functions are not enumerable by an algorithm. What we have shown there is the following: if we have a program K1 that produces programs calculating total computable functions, we can always produce another program K2 that produces all of these programs, plus another one not produced by the previous program.

Computable total functions are used here as an abstract model for knowledge in general, see Knowledge and Total Functions. In a nutshell, thinking is viewed here as the application of knowledge to contents of mind or sense data, producing new contents of mind. Application of knowledge is viewed as the application of programs that perform such transformations, so any kind of knowledge can be modeled as a program calculating some function. Learning would then consist of constructing such programs. The application of a certain bit of knowledge might yield nothing in some cases. We can think of this case as our calculation producing some error value. In any case, something comes out after a limited time, i.e. all these bits of knowledge are modelled as computable total functions (i.e. functions that can be computed by a program yielding an output for every possible input after a limited time). A system that retrieves or constructs such knowledge-programs for given inputs can be viewed as a knowledge base or a learning program.

The proof tells us that each such learning program/knowledge base must be incomplete. We can always construct some knowledge it is not able to produce.

Since the new knowledge can actually be constructed from the old formal system (by means of a productive function), there is a process that brings us from one formal description of a cognitive system (think of the “brain” of a robot, for example) to another one. We can always extend the formal system. Could we build this mechanism into the program? Well, this is of course possible and will yield another program K’ that will produce additional programs kx that the original program K could not produce. But this program, no matter how we build it, would also be incomplete. We can apply the same operation to it as before and generate a program it could not produce. If we apply the productive function from the outside, we can apply it to the formal system as a whole. If we build it into the system, the productive system will no longer see the formal system in its totality, i.e. the formal system cannot contain a complete reference onto itself (see Shifting the Vantage Point and Kurt Ammon: Informal Physical Reasoning Processes).

So, if general intelligence means the ability to produce arbitrary knowledge, a learning algorithm, no matter how impressive its results might be, cannot be generally intelligent. Every algorithm is limited and has some blind spot. Applied to human beings, this means that either we can be described as algorithms, but then we must be limited in our ability to achieve knowledge of the world. There would be things we could not learn in principle because of the way our brain works, just as a fly bumping against a window pane is unable to learn about glass. Or, we are universal in the sense that we can learn anything. But then we cannot be described completely by algorithms or formal theories.

But can we not solve the problem on the meta-level? Let us assume that we could produce learning programs automatically by some meta-algorithm. If each of them is limited, we might be able to come up with an algorithm instead that produces all such programs automatically. This would be a system that can learn how to learn and produce different ways of learning. Could such a meta-learning algorithm that produces other algorithms automatically be a possible way to achieve general intelligence? Could such an algorithm produce every possible learning algorithm, so that it would not be limited in the way special purpose learning procedures are?

The following proof sketch will show that this is also impossible (if you are not interested in the technical detail, you can skip the indented section):

We start again with the assumption that such a program could exist and derive a contradiction from that assumption. Let’s assume that we can enumerate all the programs that produce enumerations of total computable functions. So we assume we have a program K that produces all possible programs K1, K2, K3, which in turn produce programs k1, k2, k3 etc. calculating total functions, i.e. they enumerate programs computing total functions. Our assumption means that every such enumerating program Kn that is possible must be in this enumeration produced by K.

To make this understandable, we may think of K as producing a table. Each column in this table stands for a program Kn that enumerates programs calculating total functions (the kn in the fields of the table).

meta_learning

Note that here the fields of the table stand for programs calculating total functions, not for the individual values these programs assign to inputs, as in the other proof.

Now, every program km that computes any total function must be produced by at least one of the Kn because for each km one could easily write a program producing it for one of its inputs (or all of its inputs). Therefore, every program computing a total function must occur somewhere in the table, at least once.

Now, it is possible to bring all the fields of the table into a single sequence. There are several ways of doing this. One is to start with the left upper corner of the table (k1, green field), then the two fields marked in blue (k2, k3), then the fields marked in yellow, then the orange ones and so on. You can walk through the whole table on such diagonal lines. You can write a computer program that does so (e.g. by means of two loops nested into each other). So from the program K you can generate another program K’ that enumerates all the programs k1, k2, k3 etc. We can make sure that every program occurs only once in the enumeration if we want to, by checking if it has already been generated before. Since every program calculating a total function must occur somewhere on the table, the resulting enumeration must be complete. However, in our previous proof, it has already been shown that a complete enumeration of these programs is impossible. Hence, K cannot be complete, i.e. programs that produce computable total functions are in turn not enumerable.

Note that we can construct a sequence k1, k2, k3 etc. and apply the “diagonalization + modification”-operation of our first proof to that. This yields a new k’. We can then construct a program producing k’ and modify K so it would also produce that program. So on this meta-level again, we can go from one algorithm to another, more powerful one, i.e. we can produce a productive function that extends every single given formalism but that cannot be used to achieve completeness.

We could construct similar argumentations for meta-meta-meta-programs etc. The argumentations would become more convoluted, but the result will remain the same. It is not possible to construct an algorithm that can produce every possible item of knowledge, or that can produce all possible meta-programs that can generate knowledge, and so on.

If we define a generally intelligent system as a system that is universal in the sense that it can spot every pattern existing in arbitrary data and can construct every possible program that can embody such patterns, i.e. that it can construct every possible bit of knowledge, such a system cannot be an algorithm. General learning, which we might understand to be the historical or biographical development of a learning system, cannot be an algorithm. It cannot be described completely by a formal theory or algorithm. Such a system may contain formal theories or algorithms, but it would not be such a formal system itself.

Algorithms or formal theories cannot describe their own evolution into other formal theories or algorithms. Everything that is derivable inside a given formal system is so from the outset. Formal systems are ahistorical. The development of knowledge and ideas, and hence human culture, on the other hand, is a historical process in which knowledge is changed. It may be mathematically modelled as a process of applying productive functions to existing knowledge, but this process is not under the complete control of the existing knowledge itself. As a result, the resulting process of evolution of knowledge or culture cannot be described in terms of a single formal theory.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: