Augmented Science

Artificial intelligence is becoming an ever more natural aspect of the research process. But before it can trigger a scientific revolution, researchers first have to learn to understand better just what kind of assistant they've invited into their labs. By Roland Fischer

(From "Horizons" no. 113 June 2017)​​​

Intelligent machines and self-learning systems have been keeping researchers busy for decades now. Initial reports of attempts at machine learning in the identification of genetic patterns were published over 20 years ago. And in particle physics, they've been experimenting with artificial intelligence (AI) for so long that some reviews from the year 2000 even reported a sagging interest and called for rapid revival.

"Neuronal networks were actually studied and employed in various experiments at CERN back in the 1990s", recalls Sigve Haug from the Laboratory for High Energy Physics at the University of Bern. They simply didn't call it 'machine learning' at the time.

AI everywhere

Today, the use of such AI methods in large experiments in particle physics is almost the norm, whether in data reconstruction or data analysis. And they are also often used in distributed computing, where programs have to learn when and how computing processes can be distributed in the most efficient manner. But AI isn't just omnipresent at CERN. Suddenly, the situation is very similar everywhere. Artificial intelligence is the current credo in research. Physical chemistry, molecular biology, medical genetics, astrophysics and even the digital humanities: wherever large amounts of data are to be found, AI isn't far away.

Is the development towards AI as laboratory assistant – in other words towards a mixed research team of man and machine – the next, necessary step? "Absolutely", says Karsten Borgwardt, a professor at the Machine Learning and Computational Biology Lab at ETH Zurich. "In many fields in the life sciences where we work with high-throughput technologies, we simply can't do without it any more". The amounts of data are simply too big if you want to link half a million medical histories with the corresponding genetic data. "No human being can recognise any meaningful, hitherto unrecognised patterns with the naked eye any more". Such data volumes can only be handled with efficient statistical procedures such as those currently being developed by specialists like Borgwardt. In any case, the border between statistics and machine learning is fluid today, he says.

Science on steroids

Artificial intelligence as a natural partner in the research process: this vision is reminiscent of Garry Kasparov's 'Advanced Chess' idea that he came up with shortly after his defeat against Deep Blue, almost exactly 20 years ago. In future, he said, humans should no longer play against each other or against machines, but joint man/machine teams should compete instead. This would enable the game to be raised to a whole new level, believed Kasparov: a game of chess beyond the bounds of human strategic possibilities.

"Machine learning is the scientific method on steroids", writes the AI expert Pedro Domingos of the University of Washington in his book 'The Master Algorithm'. In it, he postulates something along the lines of a super-machine-learning method. By means of an intensive use of AI, research would become quicker, more efficient and more profound. This would free researchers from their statistical routine and let them concentrate wholly on the creative aspects of their work. Domingos promises nothing less than a new, golden age of science.

Not all researchers engaged with AI are keen to sing from the same happy song sheet. Neven Caplar of the Institute for Astronomy at ETH Zurich is a data nerd through and through: he runs the data blog astrodataiscool.com and has recently used machine learning to quantify the gender bias in astronomical publications.

For a few years now, Caplar has noticed a definite upswing in publications that include AI. But he doubts whether the methods in his field will allow for any big breakthrough. Astronomy is "a science of biases", he says. It's also about controlling the instruments as well as possible. For this reason, AI shouldn't be conceived as a 'black box' – in other words, AI should not be a practical tool that delivers good results, but whose precise means of functioning remains incomprehensible. When it comes to handling the observation data, their interpretation by a human researcher is still the crucial aspect, says Caplar.

The black-box problem

"Oh, this black box!", cries his colleague Kevin Schawinski (see also: 'The physics of everything', p. 30). Everyone is talking about AI being a 'black box', claiming we aren't able to scrutinise the logic and arguments of a machine. Schawinski is an astronomer, and doesn't see AI like that. From his perspective, it's simply a new research method that has to be calibrated and tested for us to understand it properly. That isn't different from any other method that science has appropriated, he says. After all, there is no one who can comprehend every single aspect of complex experimental assemblies such as the Large Hadron Collider at CERN or the Hubble Telescope. Here, Schawinski trusts the research community just as much. They know how to ensure that the scientific process functions robustly.

Together with colleagues from the computer sciences, Schawinski has launched the platform space.ml, which is a collection of easy-to-use tools to interpret astronomical data. He has himself developed a method that uses a neuronal network to let us improve images of galaxies; more information can thereby be extracted, without the computer needing further specifications. With other applications, so-called supervised learning is employed in which there is recourse to a data training set. When pre-sorted by humans, or when provided with meta-information, these training sets help the computer to devise rules itself that can enable it to fulfil a task.

An over-eager assistant

As a biostatistician, Borgwardt uses supervised-learning methods to find out, for example, whether changes in the genome have a harmful effect on an organism. He feeds the computer with patterns that have already been determined, hoping that it will subsequently be independent enough to find hitherto unrecognised connections.

But there is a stumbling block: 'overfitting'. He has to check whether the computer really recognises the fundamental characteristics in the training set, or whether it is misrepresenting chance patterns as the rule while all the data rushes past. Domingos has coined a laconic phrase for this. He sees machine learning as: "forever walking the narrow path between blindness and hallucination". On the one hand, an algorithm might recognise nothing at all in the mountain of data before it. On the other hand, there is overfitting, when it suddenly begins to see things that aren't actually there. In this manner, you can actually 'over-teach' a system – the more you train it, the worse it gets.

According to Borgwardt, one of the main reasons for overfitting in genomics and medicine is that the computer's training set doesn't always have the necessary transparency. This means that you cannot always estimate how much the training data will overlap with the data that is to be evaluated later. If the sets are too similar, the worst-case scenario is that the machine cannot 'generalise', but simply applies cases it has already memorised when it finds a correlation. In this manner, we do not achieve any real knowledge gain. Artificial intelligence remains on the level of a common or garden database.

But even if everything has gone right in training, the problem remains of being able to differentiate chance correlations from real, statistically significant connections. The bigger the amounts of data, the bigger the probability that genome variants will come together on a purely chance basis, says Borgwardt – and this could also even correlate with the appearance of a disease. So, an important part of his work comprises evaluating significance in extremely high-dimensional spaces. In other words: getting to grips statistically with highly complex situations that by their very nature are multi-causal.

Machine learning for quanta

AI can trace more than just connections in complex datasets. It can also create completely new materials. But in contrast to the life sciences, machine learning in physical chemistry is not yet very widespread, according to Anatole von Lilienfeld, a chemist and materials researcher at the University of Basel. Nevertheless, he can also see a "rapid upswing" and believes that AI will "inevitably" be an integral part of most study programmes in ten years' time.

His group's work is pioneering. Thanks to AI, he and his team were able to calculate the characteristics of millions of theoretically possible crystals constructed from four specific elements. In the course of this, their AI identified 90 unknown crystals that are thermodynamically stable and would be conceivable as new types of materials. The increase in efficiency when calculating crystal properties – AI is faster by several orders of magnitude – even astonishes an expert like von Lilienfeld. It is so vast that "it's not just about solving conventional problems; whole new research questions are being opened up". But even von Lilienfeld has some reservations. Machine learning only functions if there is a cause-and-effect principle at work, and if there are enough data available. It's also essential that the researchers in question "have enough expertise to devise efficient representations of the objects to be investigated, along with their characteristics".

This degree of expertise was also the decisive issue for Giuseppe Carleo, a theoretical physicist at ETH Zurich. Together with his colleagues he has found a way of replicating the wave function of quantum systems with a neuronal network. After he succeeded in this step, optimising the wave function was "really just child's play". The algorithm carried out its task quickly and without any problems. Common or garden methods quickly reach their limits with such computational tasks: simulating complex quantum systems was until recently regarded as a computational "impossibility".
Carleo's new approach is based on the methods of 'unsupervised learning' in which the computer learns without any prior knowledge. This is interesting for theoretical physicists, says Carleo, because it makes it possible to see "old problems from new perspectives". Even the engineering sciences and pure research could benefit from this progress.

Carleo was inspired by last year's triumph of the AlphaGo algorithm in a game against a go master. In that case, the AI had become stronger and stronger by playing innumerable games against itself. Its method of reinforced learning raised the playing intelligence of AlphaGo to new strategic levels. Carleo has now adapted this to his own purposes.

Metaphorically speaking, Carleo taught the machine to regard the hunt for the solution to wave function as if it were a game in which the goal is clear, but the path to it is completely open. The point was for the AI to learn to prefer good solution processes. And indeed, the AI liked the game a lot. So much so that it has now mastered it like no other intelligence in the world.

Roland Fischer is a freelance science journalist in Bern.