Anomaly hunters: how CERN searches for rare particles using Yandex algorithms

Andrey Ustyuzhanin — Head of the Research and Educational Laboratory of Big Data Analysis Methods at the National Research University Higher School of Economics.

Head of joint projects at Yandex and CERN. Participates in the development of the EventIndex and EventFilter services, which Yandex has been providing for the LHCb experiment since 2011.

Graduated from Moscow Institute of Physics and Technology in 2000, candidate of physical and mathematical sciences. One of the judges of the Microsoft Imagine Cup international final, before that he was a mentor of the MIPT team that won the cup in 2005.

How to look for anomalies in the data of the Large Hadron Collider

What are data anomalies?

- If we talk about the data obtained with the help ofLarge Hadron Collider (LHC), these may be discoveries that do not fit into the standard ideas about how the decays of particles that occur after the collision of protons take place there. These discoveries will be anomalies.

For example, if we are talking about asset quoteson the exchange, then anomalies there may be due to the fact that a certain hedge fund decided to pump an asset or Wall Street Bets decided to earn extra money and set up their own distributed hedge fund. That is, the physics is completely different, and the manifestation of this physics in the data is also not similar to other cases.

Therefore, if we talk about anomalies, then first you need to understand what kind of data and what kind of physics we are talking about.

— Then let's clarify with a focus on colliders.

- Here it is a little easier, although it also arisesfork. The fact is that there is data on what kind of processes occur with particles inside the detector. And there is data on how this collider works. People who are primarily interested in discovering new particles or laws are mainly interested in the first type of data. But the fact is that everything that happens in physics goes through a rather long chain of collecting and processing this information. And if any of the nodes of this chain begins to behave not as well as we imagined, that is, it goes beyond certain limits of the permissible, this introduces a distortion in the measurements. We can see anomalies in the place where, in general, they were not in physics.

Discoveries that do not fit into the standard ideas about how particle decays occur there, arising after the collision of protons, will be anomalies

To avoid such unpleasant events, peoplewrite special data quality control systems that monitor all the data in the measuring instruments and try to exclude from consideration those periods of time when there is a suspicion that something is going wrong.

One of the examples that they like to talk aboutphysicists from the LHC was that in the early stages of the work of the collider they noticed anomalies that did not fit into physical representations. There was also not a LHC, but its previous version. As a result, physicists found out that the correlation is very serious with the train schedule on the railway, which is located nearby. And if you make adjustments related to these fluctuations, then you get a non-physical picture of the world.

It is necessary to take into account external factors and be able tounderstand which of them need to be compensated in the right way. The simplest solution: let's throw out the data that does not fit into the usual picture of the world. More complex stories are to try these anomalies using understandable and physical principles to return to the mainstream of normal data and try to benefit from them.

Throwing out data is a waste of budgetary funds. Each kilobyte-megabyte has a certain price.

Andrey Ustyuzhanin, Head of the Research and Educational Laboratory for Big Data Analysis Methods at the National Research University Higher School of Economics

- And, accordingly, how can the anomaly be detected in these data using a machine learning system?

— There are two groups of such algorithms, whichwork with anomalies. The first group of one-class classification methods includes algorithms that use information only about those events that are marked as good. That is, they are trying to build a convex hull that encloses whatever we think is right. The logic is this: everything that goes beyond this shell, we will consider anomalies. That is, for example, 99% of the data is covered by such a shell, and everything else looks like something suspicious.

Another group of algorithms relies on partiallabeling what we think is wrong. In fact, there is a set of events that are known to be undesirable results. And then the search for anomalies is reduced to the problem of two-class classification. This is an ordinary classifier that can be built on the principles of neural networks or decision trees.

The nuance is that usually in tasksanomalies, the sample is not balanced. That is, the number of positive examples significantly exceeds the number of negative ones. Under such conditions, standard classification algorithms may not work as well as we would like. The default loss function treats instances that qualify correctly equally, and may overlook the fact that among 10,000 correct results there are a hundred that qualify incorrectly. This hundred just represents those negative examples that are most interesting. It is clear that this can be combated, for example, by assigning more weight to negative examples, and taking into account errors with their classification with much more weight.

Loss function is a function that, in the theory of statistical decisions, characterizes the losses due to incorrect decision making based on the observed data.

The contribution of our laboratory to solving the problemanomaly detection lies in the proposal of methods that combine the features of the first and second approaches. That is, the task of working with one-class and two-class classification. Such a combination becomes possible if generative models of anomalous examples are built.

Using approaches such as generativeadversarial networks or normalizing flows, we can learn to recover those examples that are labeled as negative and generate an extra sample that will allow the regular classifier to work with the augmented synthetic sample more efficiently. This approach works well for both tabular data and images. There was an article about this last year, which describes how such a system is built, and gives practical examples of its use.

— You mentioned working with images. How does it work in this case?

— There are examples on which we showed the workthis algorithm. They simply chose one of the image classes: for example, handwritten numbers. And they said that zero is some kind of anomaly. And they asked the neural network, which decides that the zeros are not like everything else, to be attributed to the negative class. Naturally, these can be not only zeros, but also, for example, digits inside which there are closed cycles - 068 - or digits with horizontal intersections. Or just images rotated by some angle with respect to the rest of the sample.

“We can simulate physics under certainexternal parameters with good accuracy and say what observable characteristics will describe the correct signal events, for example, the decay of the Higgs boson "

There is a dataset called an omniglot -letters written in different fonts. There are a huge number of fonts: from Futurama, Gothic, handwritten from unpopular alphabets - Sanskrit or Hebrew. We can say that the letters in Sanskrit are an anomaly, the letters written in a certain handwriting are also.

We ask the system to learn to distinguish everythingthe rest from these anomalous symbols. The main thing is that they are much smaller than everything else. This is the difficulty of working with them for conventional machine learning algorithms.

Symbiosis of physics and IT: how machine learning is used in LHC research

— What tasks of the LHC are solved with the help of machine learning?

— One big task we are working on,is to speed up computational processes that simulate physical collisions and decays of particles. The fact is that the decision on whether these events are similar to certain physical decays or not is made after analyzing a fairly large number of simulated decays. We can simulate physics at certain external parameters with good accuracy and say what observable characteristics will describe the correct signal events, for example, the decay of the Higgs boson.

But there are certain caveats:we do not always know the parameters at which these decays should be generated. As a rule, there is a certain idea about it. And the task of finding the right physics is to distinguish signal from background events, which can be associated either with incorrect operation of recovery algorithms, or with the physics of other processes that are very similar to what we are trying to find. Machine learning algorithms do a good job of this, but it's a fairly well-known story.

But to train such algorithms, it is requireda rather large statistical sample of simulated events, and the computation of these synthetic data requires certain resources. Because the simulation of one event takes about a minute or even ten minutes of computing time of modern computer centers. Due to the fact that the number of real events that physicists will work with will increase by orders of magnitude in the coming years, the number of synthesized events should also increase. Now computing resources are barely enough to cover the needs of researchers. Because to simulate one event, we have to calculate the interaction of microparticles with the structure of the detector and simulate the response that we will see on the sensors of this detector with very high accuracy.

The idea of ​​acceleration is to train the neural networkon events that were simulated using a certified package - GMT 4, which simulates everything that happens inside the detectors of the collider. This neuron will learn to match the inputs, the parameters of the particles that we want to simulate, and the outputs - those observable characteristics that the detector produces. Neural networks today are already doing quite well with the task of interpolating data. And several projects of our laboratory are aimed precisely at this. That is, to restore the characteristics of decays from the available synthetic sample, that is, to make such second-order synthetics. But there is a caveat: the advantage of neural networks is that we can fine-tune them using real data. That is, to make this setting more accurate for a specific physical decay.

People who are engaged in full-fledged physicalsimulation, they spend their time and effort on this, but with neurons it turns out a little less labor-intensive. And from the results that we did for the LHTV experiment at CERN and the project with the MPD experiment in Dubna at the Nica accelerator, it became clear that neural networks can achieve very high accuracy in covering the phase space of simulated events. They significantly speed up the calculation process: orders and even hundreds faster than an honest simulation.

How does the training of a neural network take place?

- There are no differences in the learning process.But there is one feature: for a neural network, in addition to the training sample, it is necessary to formulate quality criteria, that is, to set a loss function that would best correspond to the task with which this grid should cope well. In addition, the quality of the work of such a neural network is not evaluated by researchers: it can be adequately assessed in terms of the computational steps that occur at a later stage of data processing.

To determine whether a simulation is good or not, we canonly after we pass the events through the chain of their analysis, reconstruction, and we understand that the same characteristics that we originally laid in them are restored from them. This means that, for example, using a simple MSE Mean Squared Error metric is not enough.

MSE Mean Squared Error — measures the root mean square difference between the estimated values ​​and the actual value.

The behavior of the neural network needs to be evaluated further, infeatures on parameter ranges that may not have been present in the training set. Building such models that behave well beyond the parameter values ​​known at the training stage is a large and theoretical task.

Neural networks are good in the places where theyknew something at the training stage. Outside of them, they can give out whatever they please. In our case, this is especially sensitive, because the correctness of the physical interpretation of the reality around us depends on it.

“If a dark matter particle decays into particles with which we know how to interact, it can be assumed that this dark matter particle really was”

- That is, the neural network is looking for rare events that can occur at the collider?

— Based on the work of generative models, that is,First, we are talking about the synthesis of everything that can happen. We do this with miniature models. And at the output of such networks, we can build a model that will look for what we need: what we managed to generate on a generative neural network.

How to search for dark matter and why neural networks are needed for this

— Can a similar search principle be applied to dark matter?

- The fact is that dark matter can be searched fordifferent ways. One way is to build a proper detector that can isolate fairly well from the effects of ordinary matter. That is, to block the signal that comes from particles known to physicists. This is just a method of elimination: if the detector sees something other than noise, then it sees something that we have never seen before. One possibility would be that these are dark matter particles.

If, for example, a dark matter particledecays into particles with which we know how to interact, and it is clear that traces of decay could not appear from anywhere except from it, then we can assume that this particle of dark matter really was.

Such experiments are discussed and planned.One of them is called SHiP (Search for Hidden Particles). And, by the way, for such an experiment, the approaches that I spoke about are also applicable. It requires simulation and algorithms for recognizing rare approaches. But since the luminosity of this experiment is much lower (luminosity is the number of particles that are planned to be detected per unit time), the need to simulate a large number of similar events is not as acute as in the case of the Hadron Collider detectors. Although, for example, the task associated with assessing the quality of the work of a protective system against particles known to physics requires the simulation of a fairly large number of events. This is necessary in order to make sure that the protection works well with the enormous number of incoming particles of various types.

SHiP is an experiment aimed at searching for hidden particles, including dark matter particles, in the particle stream from the SPS accelerator filtered by magnetic fields, a five-meter layer of concrete and metal.

There are other ways to search for dark matter,associated with observations of cosmic phenomena. In particular, one approach is to build sensing elements that recognize the direction of very weakly interacting particles depending on the angle of incidence of this particle. The logic of the experiment lies in the fact that it is possible to place the sensitive elements so that they are oriented along the motion vector of the solar system, that is, towards the constellation Cygnus. Then we will be able to distinguish particles that move in the Earth's coordinate system from particles that move in a different way. Like a fixed ether, which is distributed in outer space according to its own laws, in no way connected with the orientation and direction of the planets. It's just that instead of ether, it is assumed that there are particles of dark matter. They may weakly interact with the sensors of our experiment. And by analyzing their readings, it is possible to derive patterns of angular distributions of interacting particles. If we see that there is a serious component that does not depend on the position of the Earth in space, this will indicate the existence of previously unknown particles. And, perhaps, these will be candidates for dark matter particles.

In such an experiment, simulation is quite important,because to build an algorithm for recognizing signal events, you need to imagine what the signal of interest to us looks like. Therefore, the tasks associated with fast simulation and the search for anomalies are relevant and applicable there.

They speak different languages, but the goals are common

Let's talk about working at CERN. What is it like for an IT person to work with physicists? What features are associated with working in such a cross-scientific space as the LHC?

- Good question.Indeed, people speak different languages: it comes to the fact that the same concepts are graphically represented in different ways. For example, ROC curves, which machine learning specialists are used to, are usually drawn in physics rotated by 90 degrees. And the coordinates are not called True Positive Rate and False Negative Rate, but Signal efficiency and Background rejection. At the same time, if Signal efficiency is still Precision, then Background rejection is one minus True Negative Rate.

ROC-curve (from the English receiver operating characteristic, receiver operating characteristic) - a graph that allows you to evaluate the quality of the binaryclassification. Displays the ratio between the proportion of objects from the total number of feature carriers that are correctly classified as bearing the feature, and the proportion of objects from the total number of non-feature objects that are erroneously classified as bearing the feature.

It is clear that such things can besurfaces, and are relatively easy to get used to, but the main difficulty lies in understanding some of the basic assumptions that researchers start from when writing their articles. And, as a rule, they are outside the scope of what they write about. That is, this is some secret knowledge that is transmitted in the process of teaching a person in graduate school, in the process of working on his research projects, it is formed in his mind.

For people from another field of science, it's likedifferent cultural environment. For them, these assumptions may not be so obvious. Due to the fact that the lexicon turns out to be quite extensive and different, the construction of a dialogue can be delayed or even be unproductive. Therefore, here, as recommendations, one can probably advise either asking people to go beyond what they are used to and formulate the problem in the most abstract terms from physics. We do this partly when we organize competitions as part of our IDAL Olympiad. In the process of dialogue, we find a setting that would not require deep immersion in physics, but at the same time would be interesting for machine learning specialists.

This year we had a joint project withItalian laboratory, which is just looking for dark matter. They provided synthetic data for the Olympiad on the search for this dark matter. There really is no dark matter there, because the decays of known physics were simulated: collisions of electrons and helium ions. But collisions of dark matter particles can be very similar to some of these collisions. They are very difficult to simulate, and even more difficult to interpret. Therefore, especially for people who are not experts in this field, we decided not to pull out this data and limit ourselves to only those that will be similar. The algorithms that we will see work on approximate data, but can be applied to real ones.

Andrey Ustyuzhanin. Photo from the speaker's archives

To sum up, one way is to agree on clear terms for everyone, and the other is to spend time and effort, attend summer schools, participate in practical research projects.

Books about machine learning and physical experiments recommended by Andrey Ustyuzhanin:

  • deepakkar, Experimental Particle Physics: Understanding the measurements and searches at the Large Hadron Collider.
  • Ilya Narsky, Statistical Analysis Techniques in Particle Physics: Fits, Density Estimation and Supervised Learning.
  • Giuseppe Carleo, Machine learning and the physical sciences.

- Are there any contradictions between the values ​​of physicists and IT specialists: for example, is the nature of interactions more important to someone, or, on the contrary, accuracy?

- If we talk specifically about accuracy, probably,there is no ambiguity. But this is more likely due to the fact that IT people do not understand the nature of the data. Simply, if we measured the data with an accuracy of a millimeter, then there is no point in calculating the area with an accuracy of square microns. In the case of complex neural networks, we are faced with the fact that they give out information up to the last character in the mantissa, but there is no more sense in these characters than in the accuracy that was at the input.

Well, maybe a general wish for peoplethat are concerned with evaluating the accuracy of models is to give not only absolute characteristics, but also the limits of acceptable ranges or the spread in which these values ​​were obtained. Actually a good recommendation not only for those who interact with physicists or with biologists. This is, in principle, the correct way to maintain a presentation of the results obtained.

And if we talk about how they can bedifferent expectations on the one hand and the other, then these are all working issues, in fact. If there is interest on both sides, they are solved simply and well. That is, for physicists in the broadest sense, machine learning is now in demand, because it provides more accurate tools for working with their data. And it works in the opposite direction, because it is much more interesting for machine learning specialists to see how their algorithms help in the discovery of new particles, for example, as in the case of our laboratory. We have been working for a long time to make an algorithm that would determine the type of particle. And recently there was news about the discovery of new tetraquarks, and our algorithms took a direct part in their discovery.

Therefore, for people from IT, conditionally from Data Science,Computer Science, to feel the usefulness of the algorithms they develop is very important. Therefore, at our faculty, for example, there is an International Bioinformatics Laboratory.

These interactions become more and moremore and more normal. I don't know if it's possible to consider them mainstream right now or if we still have to wait, but one way or another, this story is inevitable. Even if you look at the workshops organized as part of today's leading conferences on artificial intelligence, the workshop on the application of AI in the physical sciences occupies a leading position in terms of the number of interested people.

Read more:

American satellite "saw" an unusual message from Earth

Published video from the rocket, which was launched from an experimental accelerator

The monster at the center of our Galaxy: look at the photo of a black hole in the Milky Way