Some lessons from the google machine learning team developing the machine learning system Seti

Recently posted on the googlesearch blog a blog titled "Lessons learned developing a pracTIcal large scale machine learning system", the author should be a member of the google machine learning team, listed they are developing a scalable large machine learning system SeTI Some of the lessons learned during the time. Although the three lessons listed seem simple and obvious, in reality, we tend to be tempted to go to the other extreme. A particularly typical scenario is that these doctrines have a certain intuitive feel for engineers who have had a certain amount of project experience on the front line, but because they are not well described or described by an authoritative mouth. Come out, and may be led to the wrong direction by some wonderful ideas that only exist on the design paper, causing labor and wealth to be lost. The simplest truth, but the easiest to be violated by people over and over again, so I want to write them down, on the one hand reminding myself that I will be less tempted to make mistakes in the future, on the other hand, I hope to give friends with similar application scenarios and ideas. There is an authoritative induction and evidence.

Most of the following are full of respect for the original translation based on the original translation, interspersed with some of their own views.

The machine learning system mentioned here actually refers to a classification system (or pattern recognition system). In most cases, the two can be equated, but in the reply, it is pointed out that the scope of machine learning should be more pattern recognition. Big, here is no longer attached to the question of who is big and small, just need to know that the machine learning system or SeTI mentioned below is mainly a classifier, which has the following characteristics:

For a prediction or classification problem, if you don't have enough data, you have to focus on how to make full use of statistical knowledge to build a sophisticated classifier under a small sample of data. Conversely, if your data volume is overwhelming, you have to pay attention. How can your system adapt to such a large sample size and mine useful information from it. The scale of the problem solved by SeTI is roughly as described in the following table:

Training set size Unique features

Mean 100 Billion 1 Billion

Median 1 Billion 10 Million

Generally speaking, a good machine learning system needs more emphasis on precision, but when faced with a large system, such one-sided emphasis is easy to make many mistakes. Here are some lessons we have accumulated during the development of Seti. Of course, some of them were summed up afterwards, and we didn't realize it at the time. (Note: The author should be saying that some factors can not be ignored or even more important than precision)

1. Keep the system simple, even if it means losing a certain amount of precision. (Keep it simple, even at the expense of a little accuracy)

Temptation: It's important to have high accuracy for your classifiers in different applications, so we should focus on the accuracy of the algorithm.

However, the actual algorithm has the same important position in several other aspects:

Easy to use: If there are other people in the system or other teams in use, they must hope that the system is simple to configure and use. They may not be experts in machine learning, so they don't want to waste time on the system. Start up and operate.

System reliability: Everyone is more focused on deploying a reliable machine learning system in a real-world environment. It must be stable and does not need to always pay attention to whether it crashes. Although the early Seti was better in accuracy, its complexity, the pressure on the network and the GFS file system, and the need to keep an eye on it caused many people to be unwilling to deploy it.

(In many cases, we can think that the above two points are equivalent, that is: the ease of use of the system is equal to stable and reliable)

Seti is usually used in scenarios that have greatly improved the original system (see Lesson 3), so everyone is less concerned with the nuances of precision caused by the different algorithms used by Seti. On the other hand, these small differences in precision can often be smoothed by other means, such as better data filtering, adding other more appropriate features, adjusting parameters, etc. If the system is stable, scalable, Easy to use, these additional steps are easier to implement, and these system characteristics often determine whether it will be accepted or abandoned by the team.

For academia, designing an algorithm that is less accurate but more stable and simple to use is not an attractive thing, but based on our experience, this has extraordinary value in practice.

2. Start with some specific applications. (Start with a few specific applications in mind)

Temptation: Build a system that is not limited to any particular application, not only for current applications, but also for future classification tasks.

However, we decided to focus on a small initial application, which was based on several reasons:

3. Know when to say "no". (Know when to say "no")

Temptation: We have a hammer, so we have nails in our eyes, and any problem can be solved with a machine learning system.

We have long discovered that although machine learning systems bring significant benefits, they also bring complexity, opacity and unpredictability to the entire system. In some cases, simple techniques are enough to solve the problem at hand. In the long run, instead of focusing on integrating, maintaining and diagnosing online machine learning systems, it is better to spend on other methods to improve system performance.

The premise of Seti's application is that there is a significant improvement in the prediction effect on the current system. We also often recommend that you avoid applying it to scenarios where the effect is not obvious.

Supplement 1: When I saw the scale of the data applied by Seti, my first reaction was how to get such a large amount of markup data, because the training classifier is a data set that needs to be marked, and the accuracy of the classifier is repeatedly mentioned in the text. And the accuracy of the calculation classifier is also not lacking in tag data. My understanding is that in one case, the tag data comes from unintentional click contributions from google users, and the other is that they use a semi-supervised learning approach that starts with a small, manually tagged data set. And then overwrite to the full set of data.

Supplement 2: One of the main points in the paper is that commercial systems are not the same as the systems pursued by academia. The academic community tends to find out how to get statistically meaningful results even when the amount of data is insufficient, and accuracy is always the most important. In the absence of data, the business community needs to focus on how to filter out valuable information from noisy data, and the system must be scalable. At this time, accuracy is not the only important factor.

12v100Ah Lithium Ion Battery

12V100Ah Lithium Ion Battery,Deep Cycle Solar Battery,2V 100Ah Lifepo4 Battery,12V 100Ah Lifepo4 Battery Pack

Jiangsu Zhitai New Energy Technology Co.,Ltd , https://www.zhitainewenergy.com