Can big data-based data mining technology be "fortune-telling"?

Although BAT has an advantage in data volume, it is limited in terms of richness, and does not even have the big data capability of UGC in the vertical domain. SMEs can take advantage of their deep penetration in the vertical field and increase the richness of data to gain differentiated advantages.

Can big data-based data mining technology be "fortune-telling"?

The threshold of big data

"If you only have a bunch of people's phone numbers, this may not make much sense. But like Ctrip's data, such as everyone booking, searching, browsing, commenting, etc., this is valuable. But the deeper core is You can use this data on a certain product, and it really helps." Jiao Yu, general manager of the data intelligence business unit of the company.

The head of the US Mission Cloud Big Data Platform agreed with this. “The first thing to do is to find out whether the data is valuable, whether anyone is willing to pay for it. In addition, the richness of the source data can bring the value of the data. Supplement and improve."

Obviously, the purpose of data collection is not just to centralize the data, but ultimately to play a role in actual operations. Having data is just the beginning. How to analyze and understand the relationship between data is the key to big data applications. This is also the watershed for many big data companies.

However, in this process, there is a problem that cannot be ignored, that is, the quality of the data. "Wrong input, inevitably is the wrong output," Han Xin, director of cellular data technology, specifically pointed out this problem in an interview.

"The real decision on the success of data mining is the quality of the data itself. It is secondary to the rational use and optimization of the algorithm. Due to the rise of big data, we can easily obtain complex data; however, we simply hope to go from the advanced algorithm. Getting the information we want and ignoring the quality of the data itself can often only be a castle in the air."

Big data is big, can data mining technology be "fortune-telling"?

For big data, the more data on the surface, the better, because more data can produce a scene that fits the real situation, but at the same time more data produces more noise - so pure An increase in the amount of data does not increase the accuracy of the calculation.

Therefore, having high-quality data is far more valuable than holding a bunch of complicated data: this can reduce the difficulty of data mining and improve the accuracy of data mining. But is this the core threshold for big data?

Han Xin believes: "The establishment of a complete big data system requires two important factors, the richness of the business and the integration of data thinking."

Starting from his own practical experience, Jiao Yu talked about his own views: "For a particularly good product manager, the threshold for big data is first to understand what this thing is; the second modeling ability is strong. From these two In terms of terms, talents are relatively scarce. For example, some companies have big data, but to find a very good person to do this, although in theory it can be found at any time, but in fact it is very difficult."

"The first one is big data. In the second aspect, some people compare the data to 'oil'. If there is a petroleum treasure, you have to use tools and tools to dig it out. This tool is machine learning. The third aspect is the advancement of computing power. The tool is strong, there is no very strong computing power, or it can't run." He Xiaofei, the president of Drip Institute, gave such an answer.

Difficulties in data mining

Data mining, unlike collecting a few sheets of data, can be easily solved by asking a few questions. Its professionalism is relatively high, and the knowledge and technical difficulty of application are also significantly increased. Therefore, most data mining is basically done by professionals or professional teams.

In addition, the success of modeling has a very important impact on the results presented by the data. The models are different and the results tend to differ.

"Anyone can come up with a model, as long as the model can be used to produce results, but does the result reflect the real world? Because the relationship between the data is not a direct linear relationship, so the model can be very complicated. So you First, you have to know what problem you are trying to solve: statistically speaking, what type of problem is it, what kind of characteristics it has, what are the limitations of your data collection? Then find the model closest to this problem. "Jiao Yu said.

“The difficulty of data mining lies in the interrelated but contradictory relationship between the main data collection and the final application. This is similar to the problem of 'chicken or egg first'. The mutual influence complements each other, leading to its Compared with other types of program development, it is a longer and more complicated process.” Han Xin said.

Whether it is the model that Jiao Yu said, or the algorithm that Han Xin said, in fact, it emphasizes a key point: make corresponding adjustments to the model and algorithm according to the actual situation. There are no fixed rules, only data that is updated from time to time and changing circumstances, so the rules applied should be adjusted from time to time.

The head of the US Mission Cloud Big Data Platform believes that how to get the "normative data" is the real difficulty: "New America produces p-level data every day, including a large number of merchants, users and interactive data; daily through Hadoop Big data tools such as hive, spark, and storm are being cleaned in batches and in real time to form standardized data."

However, perhaps the hardest part is the one that is the most practical. The rapid development of technology provides methods such as application statistical methods, case-based reasoning, decision trees, rule-based reasoning, fuzzy sets, neural networks, genetic algorithms, etc. to process information, which not only reduces the difficulty of data mining, but also improves data mining. Efficiency and precision - but all of this requires a lot of money.

Many people may have heard of the brilliant case of using big data: Facebook stores about 100TB of user data every day; NASA processes about 24TB of data every day. So what is the cost of processing this data?

According to Amazon Redshift pricing, NASA will need to pay more than $1 million for a 45-day data storage service. According to a foreign survey, the CIOs of most companies say that their budget cannot afford the cost of big data deployment, and the cost of data storage and processing is too high.

WEICHAI Generator Set

Weichai Generator Set,Power Genset,Diesel Generator Set,Prime Power Genset

Yangzhou Hengyuan Electromechanical Equipment Co., Ltd. , http://www.lchypower.com