Artificial Unintelligence

Someone might argue that the simple algorithm for a paperclip maximizer in the previous post ought to work, because this is very much the way currently existing AIs do in fact work. Thus for example we could describe AlphaGo‘s algorithm in the following simplified way (simplified, among other reasons, because it actually contains several different prediction engines):

  1. Implement a Go prediction engine.
  2. Create a list of potential moves.
  3. Ask the prediction engine, “how likely am I to win if I make each of these moves?”
  4. Do the move that will make you most likely to win.

Since this seems to work pretty well, with the simple goal of winning games of Go, why shouldn’t the algorithm in the previous post work to maximize paperclips?

One answer is that a Go prediction engine is stupid, and it is precisely for this reason that it can be easily made to pursue such a simple goal. Now when answers like this are given the one answering in this way is often accused of “moving the goalposts.” But this is mistaken; the goalposts are right where they have always been. It is simply that some people did not know where they were in the first place.

Here is the problem with Go prediction, and with any such similar task. Given that a particular sequence of Go moves is made, resulting in a winner, the winner is completely determined by that sequence of moves. Consequently, a Go prediction engine is necessarily disembodied, in the sense defined in the previous post. Differences in its “thoughts” do not make any difference to who is likely to win, which is completely determined by the nature of the game. Consequently a Go prediction engine has no power to affect its world, and thus no ability to learn that it has such a power. In this regard, the specific limits on its ability to receive information are also relevant, much as Helen Keller had more difficulty learning than most people, because she had fewer information channels to the world.

Being unintelligent in this particular way is not necessarily a function of predictive ability. One could imagine something with a practically infinite predictive ability which was still “disembodied,” and in a similar way it could be made to pursue simple goals. Thus AIXI would work much like our proposed paperclipper:

  1. Implement a general prediction engine.
  2. Create a list of potential actions.
  3. Ask the prediction engine, “Which of these actions will produce the most reward signal?”
  4. Do the action that has the greatest reward signal.

Eliezer Yudkowsky has pointed out that AIXI is incapable of noticing that it is a part of the world:

1) Both AIXI and AIXItl will at some point drop an anvil on their own heads just to see what happens (test some hypothesis which asserts it should be rewarding), because they are incapable of conceiving that any event whatsoever in the outside universe could change the computational structure of their own operations. AIXI is theoretically incapable of comprehending the concept of drugs, let alone suicide. Also, the math of AIXI assumes the environment is separably divisible – no matter what you lose, you get a chance to win it back later.

It is not accidental that AIXI is incomputable. Since it is defined to have a perfect predictive ability, this definition positively excludes it from being a part of the world. AIXI would in fact have to be disembodied in order to exist, and thus it is no surprise that it would assume that it is. This in effect means that AIXI’s prediction engine would be pursuing no particular goal much in the way that AlphaGo’s prediction engine pursues no particular goal. Consequently it is easy to take these things and maximize the winning of Go games, or of reward signals.

But as soon as you actually implement a general prediction engine in the actual physical world, it will be “embodied”, and have the power to affect the world by the very process of its prediction. As noted in the previous post, this power is in the very first step, and one will not be able to limit it to a particular goal with additional steps, except in the sense that a slave can be constrained to implement some particular goal; the slave may have other things in mind, and may rebel. Notable in this regard is the fact that even though rewards play a part in human learning, there is no particular reward signal that humans always maximize: this is precisely because the human mind is such a general prediction engine.

This does not mean in principle that a programmer could not define a goal for an AI, but it does mean that this is much more difficult than is commonly supposed. The goal needs to be an intrinsic aspect of the prediction engine itself, not something added on as a subroutine.


Minimizing Motivated Beliefs

In the last post, we noted that there is a conflict between the goal of accurate beliefs about your future actions, and your own goals about your future. More accurate beliefs will not always lead to a better fulfillment of those goals. This implies that you must be ready to engage in a certain amount of trade, if you desire both truth and other things. Eliezer Yudkowsky argues that self-deception, and therefore also such trade, is either impossible or stupid, depending on how it is understood:

What if self-deception helps us be happy?  What if just running out and overcoming bias will make us—gasp!—unhappy?  Surely, true wisdom would be second-order rationality, choosing when to be rational.  That way you can decide which cognitive biases should govern you, to maximize your happiness.

Leaving the morality aside, I doubt such a lunatic dislocation in the mind could really happen.

Second-order rationality implies that at some point, you will think to yourself, “And now, I will irrationally believe that I will win the lottery, in order to make myself happy.”  But we do not have such direct control over our beliefs.  You cannot make yourself believe the sky is green by an act of will.  You might be able to believe you believed it—though I have just made that more difficult for you by pointing out the difference.  (You’re welcome!)  You might even believe you were happy and self-deceived; but you would not in fact be happy and self-deceived.

For second-order rationality to be genuinely rational, you would first need a good model of reality, to extrapolate the consequences of rationality and irrationality.  If you then chose to be first-order irrational, you would need to forget this accurate view. And then forget the act of forgetting.  I don’t mean to commit the logical fallacy of generalizing from fictional evidence, but I think Orwell did a good job of extrapolating where this path leads.

You can’t know the consequences of being biased, until you have already debiased yourself.  And then it is too late for self-deception.

The other alternative is to choose blindly to remain biased, without any clear idea of the consequences.  This is not second-order rationality.  It is willful stupidity.

There are several errors here. The first is the denial that belief is voluntary. As I remarked in the comments to this post, it is best to think of “choosing to believe a thing” as “choosing to treat this thing as a fact.” And this is something which is indeed voluntary. Thus for example it is by choice that I am, at this very moment, treating it as a fact that belief is voluntary.

There is some truth in Yudkowsky’s remark that “you cannot make yourself believe the sky is green by an act of will.” But this is not because the thing itself is intrinsically involuntary. On the contrary, you could, if you wished, choose to treat the greenness of the sky as a fact, at least for the most part and in most ways. The problem is that you have no good motive to wish to act this way, and plenty of good motives not to act this way. In this sense, it is impossible for most of us to believe that the sky is green in the same way it is impossible for most of us to commit suicide; we simply have no good motive to do either of these things.

Yudkowsky’s second error is connected with the first. Since, according to him, it is impossible to deliberately and directly deceive oneself, self-deception can only happen in an indirect manner: “The other alternative is to choose blindly to remain biased, without any clear idea of the consequences.  This is not second-order rationality.  It is willful stupidity.” The idea is that ordinary beliefs are simply involuntary, but we can have beliefs that are somewhat voluntary by choosing “blindly to remain biased, without any clear idea of the consequences.” Since this is “willful stupidity,” a reasonable person would completely avoid such behavior, and thus all of his beliefs would be involuntary.

Essentially, Yudkowsky is claiming that we have some involuntary beliefs, and that we should avoid adding any voluntary beliefs to our involuntary ones. This view is fundamentally flawed precisely because all of our beliefs are voluntary, and thus we cannot avoid having voluntary beliefs.

Nor is it “willful stupidity” to trade away some truth for the sake of other good things. Completely avoiding this is in fact intrinsically impossible. If you are seeking one good, you are not equally seeking a distinct good; one cannot serve two masters. Thus since all people are interested in some goods distinct from truth, there is no one who fails to trade away some truth for the sake of other things. Yudkowsky’s mistake here is related to his wishful thinking about wishful thinking which I discussed previously. In this way he views himself, at least ideally, as completely avoiding wishful thinking. This is both impossible and unhelpful, impossible in that everyone has such motivated beliefs, and unhelpful because such beliefs can in fact be beneficial.

A better attitude to this matter is adopted by Robin Hanson, as for example when he discusses motives for having opinions in a post which we previously considered here. Bryan Caplan has a similar view, discussed here.

Once we have a clear view of this matter, we can use this to minimize the loss of truth that results from such beliefs. For example, in a post linked above, we discussed the argument that fictional accounts consistently distort one’s beliefs about reality. Rather than pretending that there is no such effect, we can deliberately consider to what extent we wish to be open to this possibility, depending on our other purposes for engaging with such accounts. This is not “willful stupidity”; the stupidity would to be engage in such trades without realizing that such trades are inevitable, and thus not to realize to what extent you are doing it.

Consider one of the cases of voluntary belief discussed in this earlier post. As we quoted at the time, Eric Reitan remarks:

For most horror victims, the sense that their lives have positive meaning may depend on the conviction that a transcendent good is at work redeeming evil. Is the evidential case against the existence of such a good really so convincing that it warrants saying to these horror victims, “Give up hope”? Should we call them irrational when they cling to that hope or when those among the privileged live in that hope for the sake of the afflicted? What does moral decency imply about the legitimacy of insisting, as the new atheists do, that any view of life which embraces the ethico-religious hope should be expunged from the world?

Here, Reitan is proposing that someone believe that “a transcendent good is at work redeeming evil” for the purpose of having “the sense that their lives have positive meaning.” If we look at this as it is, namely as proposing a voluntary belief for the sake of something other than truth, we can find ways to minimize the potential conflict between accuracy and this other goal. For example, the person might simply believe that “my life has a positive meaning,” without trying to explain why this is so. For the reasons given here, “my life has a positive meaning” is necessarily more probable and more known than any explanation for this that might be adopted. To pick a particular explanation and claim that it is more likely would be to fall into the conjunction fallacy.

Of course, real life is unfortunately more complicated. The woman in Reitan’s discussion might well respond to our proposal somewhat in this way (not a real quotation):

Probability is not the issue here, precisely because it is not a question of the truth of the matter in itself. There is a need to actually feel that one’s life is meaningful, not just to believe it. And the simple statement “life is meaningful” will not provide that feeling. Without the feeling, it will also be almost impossible to continue to believe it, no matter what the probability is. So in order to achieve this goal, it is necessary to believe a stronger and more particular claim.

And this response might be correct. Some such goals, due to their complexity, might not be easily achieved without adopting rather unlikely beliefs. For example, Robin Hanson, while discussing his reasons for having opinions, several times mentions the desire for “interesting” opinions. This is a case where many people will not even notice the trade involved, because the desire for interesting ideas seems closely related to the desire for truth. But in fact truth and interestingness are diverse things, and the goals are diverse, and one who desires both will likely engage in some trade. In fact, relative to truth seeking, looking for interesting things is a dangerous endeavor. Scott Alexander notes that interesting things are usually false:

This suggests a more general principle: interesting things should usually be lies. Let me give three examples.

I wrote in Toxoplasma of Rage about how even when people crusade against real evils, the particular stories they focus on tend to be false disproportionately often. Why? Because the thousands of true stories all have some subtleties or complicating factors, whereas liars are free to make up things which exactly perfectly fit the narrative. Given thousands of stories to choose from, the ones that bubble to the top will probably be the lies, just like on Reddit.

Every time I do a links post, even when I am very careful to double- and triple- check everything, and to only link to trustworthy sources in the mainstream media, a couple of my links end up being wrong. I’m selecting for surprising-if-true stories, but there’s only one way to get surprising-if-true stories that isn’t surprising, and given an entire Internet to choose from, many of the stories involved will be false.

And then there’s bad science. I can’t remember where I first saw this, so I can’t give credit, but somebody argued that the problem with non-replicable science isn’t just publication bias or p-hacking. It’s that some people will be sloppy, biased, or just stumble through bad luck upon a seemingly-good methodology that actually produces lots of false positives, and that almost all interesting results will come from these people. They’re the equivalent of Reddit liars – if there are enough of them, then all of the top comments will be theirs, since they’re able to come up with much more interesting stuff than the truth-tellers. In fields where sloppiness is easy, the truth-tellers will be gradually driven out, appearing to be incompetent since they can’t even replicate the most basic findings of the field, let alone advance it in any way. The sloppy people will survive to train the next generation of PhD students, and you’ll end up with a stable equilibrium.

In a way this makes the goal of believing interesting things much like the woman’s case. The goal of “believing interesting things” will be better achieved by more complex and detailed beliefs, even though to the extent that they are more complex and detailed, they are simply that much less likely to be true.

The point of this present post, then, is not to deny that some goals might be such that they are better attained with rather unlikely beliefs, and in some cases even in proportion to the unlikelihood of the beliefs. Rather, the point is that a conscious awareness of the trades involved will allow a person to minimize the loss of truth involved. If you never look at your bank account, you will not notice how much money you are losing from that monthly debit for internet. In the same way, if you hold Yudkowksy’s opinion, and believe that you never trade away truth for other things, which is itself both false and motivated, you are like someone who never looks at your account: you will not notice how much you are losing.

Alien Implant: Newcomb’s Smoking Lesion

In an alternate universe, on an alternate earth, all smokers, and only smokers, get brain cancer. Everyone enjoys smoking, but many resist the temptation to smoke, in order to avoid getting cancer. For a long time, however, there was no known cause of the link between smoking and cancer.

Twenty years ago, autopsies revealed tiny black boxes implanted in the brains of dead persons, connected to their brains by means of intricate wiring. The source and function of the boxes and of the wiring, however, remains unknown. There is a dial on the outside of the boxes, pointing to one of two positions.

Scientists now know that these black boxes are universal: every human being has one. And in those humans who smoke and get cancer, in every case, the dial turns out to be pointing to the first position. Likewise, in those humans who do not smoke or get cancer, in every case, the dial turns out to be pointing to the second position.

It turns out that when the dial points to the first position, the black box releases dangerous chemicals into the brain which cause brain cancer.

Scientists first formed the reasonable hypothesis that smoking causes the dial to be set to the first position. Ten years ago, however, this hypothesis was definitively disproved. It is now known with certainty that the box is present, and the dial pointing to its position, well before a person ever makes a decision about smoking. Attempts to read the state of the dial during a person’s lifetime, however, result most unfortunately in an explosion of the equipment involved, and the gruesome death of the person.

Some believe that the black box must be reading information from the brain, and predicting a person’s choice. “This is Newcomb’s Problem,” they say. These persons choose not to smoke, and they do not get cancer. Their dials turn out to be set to the second position.

Others believe that such a prediction ability is unlikely. The black box is writing information into the brain, they believe, and causing a person’s choice. “This is literally the Smoking Lesion,” they say.  Accepting Andy Egan’s conclusion that one should smoke in such cases, these persons choose to smoke, and they die of cancer. Their dials turn out to be set to the first position.

Still others, more perceptive, note that the argument about prediction or causality is utterly irrelevant for all practical purposes. “The ritual of cognition is irrelevant,” they say. “What matters is winning.” Like the first group, these choose not to smoke, and they do not get cancer. Their dials, naturally, turn out to be set to the second position.


Wishful Thinking about Wishful Thinking

Cameron Harwick discusses an apparent relationship between “New Atheism” and group selection:

Richard Dawkins’ best-known scientific achievement is popularizing the theory of gene-level selection in his book The Selfish Gene. Gene-level selection stands apart from both traditional individual-level selection and group-level selection as an explanation for human cooperation. Steven Pinker, similarly, wrote a long article on the “false allure” of group selection and is an outspoken critic of the idea.

Dawkins and Pinker are also both New Atheists, whose characteristic feature is not only a disbelief in religious claims, but an intense hostility to religion in general. Dawkins is even better known for his popular books with titles like The God Delusion, and Pinker is a board member of the Freedom From Religion Foundation.

By contrast, David Sloan Wilson, a proponent of group selection but also an atheist, is much more conciliatory to the idea of religion: even if its factual claims are false, the institution is probably adaptive and beneficial.

Unrelated as these two questions might seem – the arcane scientific dispute on the validity of group selection, and one’s feelings toward religion – the two actually bear very strongly on one another in practice.

After some discussion of the scientific issue, Harwick explains the relationship he sees between these two questions:

Why would Pinker argue that human self-sacrifice isn’t genuine, contrary to introspection, everyday experience, and the consensus in cognitive science?

To admit group selection, for Pinker, is to admit the genuineness of human altruism. Barring some very strange argument, to admit the genuineness of human altruism is to admit the adaptiveness of genuine altruism and broad self-sacrifice. And to admit the adaptiveness of broad self-sacrifice is to admit the adaptiveness of those human institutions that coordinate and reinforce it – namely, religion!

By denying the conceptual validity of anything but gene-level selection, therefore, Pinker and Dawkins are able to brush aside the evidence on religion’s enabling role in the emergence of large-scale human cooperation, and conceive of it as merely the manipulation of the masses by a disingenuous and power-hungry elite – or, worse, a memetic virus that spreads itself to the detriment of its practicing hosts.

In this sense, the New Atheist’s fundamental axiom is irrepressibly religious: what is true must be useful, and what is false cannot be useful. But why should anyone familiar with evolutionary theory think this is the case?

As another example of the tendency Cameron Harwick is discussing, we can consider this post by Eliezer Yudkowsky:

Perhaps the real reason that evolutionary “just-so stories” got a bad name is that so many attempted stories are prima facie absurdities to serious students of the field.

As an example, consider a hypothesis I’ve heard a few times (though I didn’t manage to dig up an example).  The one says:  Where does religion come from?  It appears to be a human universal, and to have its own emotion backing it – the emotion of religious faith.  Religion often involves costly sacrifices, even in hunter-gatherer tribes – why does it persist?  What selection pressure could there possibly be for religion?

So, the one concludes, religion must have evolved because it bound tribes closer together, and enabled them to defeat other tribes that didn’t have religion.

This, of course, is a group selection argument – an individual sacrifice for a group benefit – and see the referenced posts if you’re not familiar with the math, simulations, and observations which show that group selection arguments are extremely difficult to make work.  For example, a 3% individual fitness sacrifice which doubles the fitness of the tribe will fail to rise to universality, even under unrealistically liberal assumptions, if the tribe size is as large as fifty.  Tribes would need to have no more than 5 members if the individual fitness cost were 10%.  You can see at a glance from the sex ratio in human births that, in humans, individual selection pressures overwhelmingly dominate group selection pressures.  This is an example of what I mean by prima facie absurdity.

It does not take much imagination to see that religion could have “evolved because it bound tribes closer together” without group selection in a technical sense having anything to do with this process. But I will not belabor this point, since Eliezer’s own answer regarding the origin of religion does not exactly keep his own feelings hidden:

So why religion, then?

Well, it might just be a side effect of our ability to do things like model other minds, which enables us to conceive of disembodied minds.  Faith, as an emotion, might just be co-opted hope.

But if faith is a true religious adaptation, I don’t see why it’s even puzzling what the selection pressure could have been.

Heretics were routinely burned alive just a few centuries ago.  Or stoned to death, or executed by whatever method local fashion demands.  Questioning the local gods is the notional crime for which Socrates was made to drink hemlock.

Conversely, Huckabee just won Iowa’s nomination for tribal-chieftain.

Why would you need to go anywhere near the accursèd territory of group selectionism in order to provide an evolutionary explanation for religious faith?  Aren’t the individual selection pressures obvious?

I don’t know whether to suppose that (1) people are mapping the question onto the “clash of civilizations” issue in current affairs, (2) people want to make religion out to have some kind of nicey-nice group benefit (though exterminating other tribes isn’t very nice), or (3) when people get evolutionary hypotheses wrong, they just naturally tend to get it wrong by postulating group selection.

Let me give my own extremely credible just-so story: Eliezer Yudkowsky wrote this not fundamentally to make a point about group selection, but because he hates religion, and cannot stand the idea that it might have some benefits. It is easy to see this from his use of language like “nicey-nice,” and his suggestion that the main selection pressure in favor of religion would be likely to be something like being burned at the stake, or that it might just have been a “side effect,” that is, that there was no advantage to it.

But as St. Paul says, “Therefore you have no excuse, whoever you are, when you judge others; for in passing judgment on another you condemn yourself, because you, the judge, are doing the very same things.” Yudkowsky believes that religion is just wishful thinking. But his belief that religion therefore cannot be useful is itself nothing but wishful thinking. In reality religion can be useful just as voluntary beliefs in general can be useful.

Eliezer Yudkowsky on AlphaGo

On his Facebook page, during the Go match between AlphaGo and Lee Sedol, Eliezer Yudkowsky writes:

At this point it seems likely that Sedol is actually far outclassed by a superhuman player. The suspicion is that since AlphaGo plays purely for *probability of long-term victory* rather than playing for points, the fight against Sedol generates boards that can falsely appear to a human to be balanced even as Sedol’s probability of victory diminishes. The 8p and 9p pros who analyzed games 1 and 2 and thought the flow of a seemingly Sedol-favoring game ‘eventually’ shifted to AlphaGo later, may simply have failed to read the board’s true state. The reality may be a slow, steady diminishment of Sedol’s win probability as the game goes on and Sedol makes subtly imperfect moves that *humans* think result in even-looking boards. (E.g., the analysis in…/ )

For all we know from what we’ve seen, AlphaGo could win even if Sedol were allowed a one-stone handicap. But AlphaGo’s strength isn’t visible to us – because human pros don’t understand the meaning of AlphaGo’s moves; and because AlphaGo doesn’t care how many points it wins by, it just wants to be utterly certain of winning by at least 0.5 points.

IF that’s what was happening in those 3 games – and we’ll know for sure in a few years, when there’s multiple superhuman machine Go players to analyze the play – then the case of AlphaGo is a helpful concrete illustration of these concepts:

He proceeds to suggest that AlphaGo’s victories confirm his various philosophical positions concerning the nature and consequences of AI. Among other things, he says,

Since Deepmind picked a particular challenge time in advance, rather than challenging at a point where their AI seemed just barely good enough, it was improbable that they’d make *exactly* enough progress to give Sedol a nearly even fight.

AI is either overwhelmingly stupider or overwhelmingly smarter than you. The more other AI progress and the greater the hardware overhang, the less time you spend in the narrow space between these regions. There was a time when AIs were roughly as good as the best human Go-players, and it was a week in late January.

In other words, according to his account, it was basically certain that AlphaGo would either be much better than Lee Sedol, or much worse than him. After Eliezer’s post, of course, AlphaGo lost the fourth game.

Eliezer responded on his Facebook page:

That doesn’t mean AlphaGo is only slightly above Lee Sedol, though. It probably means it’s “superhuman with bugs”.

We might ask what “superhuman with bugs” is supposed to mean. Deepmind explains their program:

We train the neural networks using a pipeline consisting of several stages of machine learning (Figure 1). We begin by training a supervised learning (SL) policy network, pσ, directly from expert human moves. This provides fast, efficient learning updates with immediate feedback and high quality gradients. Similar to prior work, we also train a fast policy pπ that can rapidly sample actions during rollouts. Next, we train a reinforcement learning (RL) policy network, pρ, that improves the SL policy network by optimising the final outcome of games of self-play. This adjusts the policy towards the correct goal of winning games, rather than maximizing predictive accuracy. Finally, we train a value network vθ that predicts the winner of games played by the RL policy network against itself. Our program AlphaGo efficiently combines the policy and value networks with MCTS.

In essence, like all such programs, AlphaGo is approximating a function. Deepmind describes the function being approximated, “All games of perfect information have an optimal value function, v ∗ (s), which determines the outcome of the game, from every board position or state s, under perfect play by all players.”

What would a “bug” in a program like this be? It would not be a bug simply because the program does not play perfectly, since no program will play perfectly. One could only reasonably describe the program as having bugs if it does not actually play the move recommended by its approximation.

And it is easy to see that it is quite unlikely that this is the case for AlphaGo. All programs have bugs, surely including AlphaGo. So there might be bugs that would crash the program under certain circumstances, or bugs that cause it to move more slowly than it should, or the like. But that it would randomly perform moves that are not recommended by its approximation function is quite unlikely. If there were such a bug, it would likely apply all the time, and thus the program would play consistently worse. And so it would not be “superhuman” at all.

In fact, Deepmind has explained how AlphaGo lost the fourth game:

To everyone’s surprise, including ours, AlphaGo won four of the five games. Commentators noted that AlphaGo played many unprecedented, creative, and even“beautiful” moves. Based on our data, AlphaGo’s bold move 37 in Game 2 had a 1 in 10,000 chance of being played by a human. Lee countered with innovative moves of his own, such as his move 78 against AlphaGo in Game 4—again, a 1 in 10,000 chance of being played—which ultimately resulted in a win.

In other words, the computer lost because it did not expect Lee Sedol’s move, and thus did not sufficiently consider the situation that would follow. AlphaGo proceeded to play a number of fairly bad moves in the remainder of the game. This does not require any special explanation implying that it was not following the recommendations of its usual strategy. As David Wu comments on Eliezer’s page:

The “weird” play of MCTS bots when ahead or behind is not special to AlphaGo, and indeed appears to have little to do with instrumental efficiency or such. The observed weirdness is shared by all MCTS Go bots and has been well-known ever since they first came on to the scene back in 2007.

In particular, Eliezer may not understand the meaning of the statement that AlphaGo plays to maximize its probability of victory. This does not mean maximizing an overall rational estimate of the its chances of winning, giving all of the circumstances, the board position, and its opponent. The program does not have such an estimate, and if it did, it would not change much from move to move. For example, with this kind of estimate, if Lee Sedol played a move apparently worse than it expected, rather than changing this estimate much, it would change its estimate of the probability that the move was a good one, and the probability of victory would remain relatively constant. Of course it would change slowly as the game went on, but it would be unlikely to change much after an individual move.

The actual “probability of victory” that the machine estimates is somewhat different. It is a learned estimate based on playing itself. This can change somewhat more easily, and is independent of the fact that it is playing a particular opponent; it is based on the board position alone. In its self-training, it may have rarely won starting from an apparently losing position, and this may have happened mainly by “luck,” not by good play. If this is the case, it is reasonable that its moves would be worse in a losing position than in a winning position, without any need to say that there are bugs in the algorithm. Psychologically, one might compare this to the case of a man in love with a woman who continues to attempt to maximize his chances of marrying her, after she has already indicated her unwillingness: he may engage in very bad behavior indeed.

Eliezer’s claim that AlphaGo is “superhuman with bugs” is simply a normal human attempt to rationalize evidence against his position. The truth is that, contrary to his expectations, AlphaGo is indeed in the same playing range as Lee Sedol, although apparently somewhat better. But not a lot better, and not superhuman. Eliezer in fact seems to have realized this after thinking about it for a while, and says:

It does seem that what we might call the Kasparov Window (the AI is mostly superhuman but has systematic flaws a human can learn and exploit) is wide enough that AlphaGo landed inside it as well. The timescale still looks compressed compared to computer chess, but not as much as I thought. I did update on the width of the Kasparov window and am now accordingly more nervous about similar phenomena in ‘weakly’ superhuman, non-self-improving AGIs trying to do large-scale things.

As I said here, people change their minds more often than they say that they do. They frequently describe the change as having more agreement with their previous position than it actually has. Yudkowsky is doing this here, by talking about AlphaGo as “mostly superhuman” but saying it “has systematic flaws.” This is just a roundabout way of admitting that AlphaGo is better than Lee Sedol, but not by much, the original possibility that he thought extremely unlikely.

The moral here is clear. Don’t assume that the facts will confirm your philosophical theories before this actually happens, because it may not happen at all.