Which values are stable under ontology shifts?

Here's a rough argument which I've been thinking about lately:

We have coherence theorems which say that, if you’re not acting like you’re maximizing expected utility over outcomes, you’d make payments which predictably lose you money. But in general I don't see any principled distinction between “predictably losing money” (which we view as incoherent) and “predictably spending money” (to fulfill your values): it depends on the space of outcomes over which you define utilities, which seems pretty arbitrary. You could interpret an agent being money-pumped as a type of incoherence, or as an indication that it enjoys betting and is willing to pay to do so; similarly you could interpret an agent passing up a “sure thing” bet as incoherence, or just a preference for not betting which it’s willing to forgo money to satisfy. Many humans have one of these preferences!

Now, these preferences are somewhat odd ones, because you can think of every action under uncertainty as a type of bet. In other words, “betting” isn't a very fundamental category in an ontology which has a sophisticated understanding of reasoning under uncertainty. Then the obvious follow-up question is: which human values will naturally fit into much more sophisticated ontologies? I worry that not many of them will:
  • In a world where minds can be easily copied, our current concepts of personal identity and personal survival will seem very strange. You could think of those values as “predictably losing money” by forgoing the benefits of temporarily running multiple copies. (This argument was inspired by this old thought experiment from Wei Dai.)
  • In a world where minds can be designed with arbitrary preferences, our values related to “preference satisfaction” will seem very strange, because it’d be easy to create people with meaningless preferences that are by default satisfied to an arbitrary extent.
  • In a world where we understand minds very well, our current concepts of happiness and wellbeing may seem very strange. In particular, if happiness is understood in a more sophisticated ontology as caused by positive reward prediction error, then happiness is intrinsically in tension with having accurate beliefs. And if we understand reward prediction error in terms of updates to our policy, then deliberately invoking happiness would be in tension with acting effectively in the world.
    • If there's simply a tradeoff between them, we might still want to sacrifice accurate beliefs and effective action for happiness. But what I'm gesturing towards is the idea that happiness might not actually be a concept which makes much sense given a complete understanding of minds - as implied by the buddhist view of happiness as an illusion, for example.
  • In a world where people can predictably influence the values of their far future descendants, and there’s predictable large-scale growth, any non-zero discounting will seem very strange, because it predictably forgoes orders of magnitude more resources in the future.
    • This might result in the strategy described by Carl Shulman of utilitarian agents mimicking selfish agents by spreading out across the universe as fast as they can to get as many resources as they can, and only using those resources to produce welfare once the returns to further expansion are very low. It does seem possible that we design AIs which spend millions or billions of years optimizing purely for resource acquisition, and then eventually use all those resources for doing something entirely different. But it seems like those AIs would need to have minds that are constructed in a very specific and complicated way to retain terminal values which are so unrelated to most of their actions.
A more general version of these arguments: human values are generalizations of learned heuristics for satisfying innate drives, which in turn are evolved proxies for maximizing genetic fitness. In theory, you can say “this originated as a heuristic/proxy, but I terminally value it”. But in practice, heuristics tend to be limited, messy concepts which don't hold up well under ontology improvement. So they're often hard to continue caring about once you deeply understand them - kinda like how it’s hard to endorse “not betting” as a value once you realize that everything is a kind of bet, or endorse faith in god as a value if you no longer believe that god exists. And they're especially hard to continue caring about at scale.

Given all of this, how might future values play out? Here are four salient possibilities:
  1. Some core notion of happiness/conscious wellbeing/living a flourishing life is sufficiently “fundamental” that it persists even once we have a very sophisticated understanding of how minds work.
  2. No such intuitive notions are strongly fundamental, but we decide to ignore that fact, and optimize for values that seem incoherent to more intelligent minds. We could think of this as a way of trading away the value of consistency.
  3. We end up mainly valuing something like “creating as many similar minds as possible” for its own sake, as the best extrapolation of what our other values are proxies for.
  4. We end up mainly valuing highly complex concepts which we can’t simplify very easily - like “the survival and flourishing of humanity”, as separate from the survival and flourishing of any individual human. In this world, asking whether an outcome is good for individuals might feel like asking whether human actions are good or bad for individual cells - even if we can sometimes come up with a semi-coherent answer, that’s not something we care about very much.


Popular posts from this blog

In Search of All Souls

25 poems

Book review: Very Important People