Content filtering and diagnosis: how AI is taught to do complex tasks without data

Huge datasets are not needed

The history of machine learning began at the dawn of the 20th century. During this time the models passed

way from simple algorithms that couldfilter emails and detect malware, to data mining that can predict disease progression in patients and beat world-class chess players.

Whatever the purpose of the model, its purpose— predict the result from the input data. The more diverse the dataset (the set of data that “feeds” the models), the easier it is for the algorithm to find patterns, and therefore, the more accurate the output result.

The model needs two main components to work:data and algorithm. Data means already labeled information, where each example of input data (for example, photographs of a street with pedestrians) is assigned the expected result of the neural network (the contours of the figures of pedestrians that the neural network should highlight).

The world of machine learning is currently dominated bya model-centric approach, which is why ML engineers spend a lot of time on algorithms—the second important component of model performance. The speed and accuracy of the work depends on the choice of algorithm. But, despite the fact that this approach is simpler and more interesting for engineers, do not forget about the simple principle of garbage in, garbage out. If the collected data is not representative, no amount of algorithmic tricks will help improve the quality of the model. Therefore, the focus of engineers is gradually shifting to data. 

ML engineers are increasingly looking to the sidedata-centric AI, the idea of ​​which is to collect less data, but of better quality. This is more efficient: the development of algorithms improves the performance of the model by 0-10%, and work with data quality - by 10-30%.

It all starts with data 

In an ideal world, a company that usesmachine learning technology respects the culture of data collection. But data collection is just the beginning. Then comes the time-consuming and expensive marking process. Following the concept of Data-driven AI, ML engineers can achieve much higher model performance compared to labeling data “as cheaply as possible”. Here are the main principles of this approach:

  • High-quality markup guidelines

You might think:why formalize each point of the process of setting and solving a problem when it can be formulated in one sentence. Let's say we are talking about data markup for the autopilot, it might sound like this: "select all pedestrians in the photos." But annotators will quickly come across ambiguous cases - whether to single out a cyclist, a person on a scooter, or a passenger in an open body as a pedestrian? Each annotator will come up with an answer on his own, but it will be different and destroy the homogeneity of the data. Therefore, it is necessary to enter all complex examples into a database, where annotators, in case of difficulties, can turn. But for such a document to appear, you need feedback from annotators.

  • Feedback

A database cannot appear out of nowhere.This requires two conditions: a culture of respect for annotators’ feedback and employees responsible for keeping this database up to date. As a rule, this is the most experienced of the markers or a data scientist himself. 

Resources need to be connected as the core of the team is formed, which feels all the responsibility and importance of the process, helping newcomers to get involved in it.

Database can't appear out of nowhere

  • Cross-validation

The company often employs more than oneannotator with different skill levels. Therefore, the same data set can be labeled in different ways. So the results of the work should be checked periodically. This will give an understanding of where specialists encounter difficulties that should be entered into the database - this will reduce the human error factor.

  • Passing data through a data scientist

Before giving the annotators the data to mark up, it is helpful to have the data scientist dive into the data and mark up the first couple of hundred examples. This will allow you to understand how the problem is solvable for the model.

Although the division of labor is attractive from the pointIn terms of the cost of work, one should not expect the same level of work with data from annotators as from data scientists - markers cannot and should not identify machine learning problems.

If you have to work with specificdata, you need industry knowledge. For example, if the algorithm must recognize x-ray images with a tumor, the model can be trained correctly only if living specialists are sure that there are neoplasms in each marked fragment, and the image is defective.

  • "Border" examples are important

The main principle of manual marking is that it mustbe intelligent. During the training process, the neural network can guess which examples in the training set it is most likely to “stumble” on. It is better to hand them over for manual marking; this will improve the quality of the model’s work more than millions of marked examples, training on which the model will not make mistakes.

  • Augmentation or data synthetics 

If there is little data or markup of collected datatoo expensive - you can propagate them. For example, if the data is textual, the same user calls can be rephrased. If these are images, you can change the brightness, cut and flip some of the pictures.

In the increase in the amount of data, there is anotherapproach is to synthesize them. But such data cannot always replace real data, especially if the neural network produces the same type or idealized data. In this case, you can use synthetic data only at certain steps of the model.

From theory to practice

  • Social networks

To protect users and protect them fromnegative, the largest social networks are integrating a toxic content detector based on machine learning. In the process of work, the main problem is not the selection of a model, but the collection and analysis of data. The problem is that there is less toxic content than normal content, so the team needs to collect a database of such content on the platform, which cannot be done without an algorithm. Therefore, data collection takes up to 90% of the time of data scientists. But the quality of the final model is improved.

  • Online retail

When training a model that turns the recipeto a shopping list based on 2 million examples the model predictably showed a quality of 97%. At scale, the model worked great, but in the case of a specific retailer, with atypical products, the quality dropped sharply to an unacceptable 70%. To solve this problem, the annotation team focused on ensuring that new data was not lost in the background of the mature dataset. It was enough to train the model on a couple of thousand examples and the quality again increased to 97%.

AI helps in retail, and not only by selecting preferred products

  • Conveyor production

A company that used artificial intelligenceto detect defects in parts on a conveyor belt, obtained 90% accuracy of the model after initial work with the data. But such indicators did not meet the client’s requirements. 

In an attempt to improve model performance, ML engineers“polished” the work of the algorithms without working with the data, which improved the result by only 0.4%. After re-analyzing the data, cleaning the dataset from poorly labeled examples and re-labeling the newly collected data, the result increased by 8%.

  • recommender system

Recipe App Recommender Systemconsistently showed a low click-through rate of 5%. Working with algorithms did not help, and data analysis indicated that the clients whose data was used to train the model were mostly vegetarians, and the general population of users mostly ate meat. A system geared towards vegetarians was not good at capturing the interests of others and was highly influenced by the preferences of vegetarian users. Training data balancing improved conversions by up to 11%.

In the past, the field of artificial intelligence inmainly focused on big data - training was carried out on an extensive dataset. Although there is still progress in creating such models, the focus is gradually shifting to small data and working with it. This expands the entry threshold into the field of AI - complex solutions can already be created even with a small amount of data.

Read more:

A black hole in the galaxy proved Einstein right. The main thing

Space destroys bones and changes their structure: scientists do not know how people will fly to Mars

Astronomers have found planets that are different from Earth, but suitable for life