Job Recruitment Website - Recruitment portal - Recruitment in Sheng Rui, Haiti

Recruitment in Sheng Rui, Haiti

Wen | Ye Li

In the process of contacting the data sampling industry, Heizhi heard a story about Qilu and Henan labeling factories.

It is reported that most labeling factories in Henan use Baidu's labeling tools to do Baidu's work. When Qilu took over the banner of Baidu, he released a lot of bidding demand. It was not difficult to survive at that time (the accuracy rate was only 90%), and the marked profit rate could reach 60%-70%. Some enterprises blindly expand and recruit hundreds of people at once; After Qilu left, Baidu's demand decreased. In the second half of 2065438+2008, the accuracy rate was generally improved to 95%-96%, which made the work more difficult. These factories only know Baidu's labeling tools, and it is difficult to take over other businesses, so a number of them died. Factories that are not dead have to lay off employees and are currently in a difficult transition period.

When the labeling factory in Henan was in a difficult transition, Zhang San's labeling company officially opened. The company has just been established, and everything is very complicated. A few days ago, Heizhi contacted him at noon. He told Heizhi that the list two years ago needed to be reworked and had been busy. For a startup, being busy is better than being idle. If you are free one day, Zhang San said that you can't sleep at night. "A day without work, thousands of dollars are wasted. The monthly expenditure is 6.5438+0.5 million (note: there are currently 65 employees in the company). "

In his view, the labeling industry is a bitter industry. "In the first half of the year, you will definitely lose money. You should be prepared to lose 10 thousand yuan alone. " He smiled and said to Heizhi, "If you have a grudge against anyone, persuade him to do labeling." This is a famous passage in the label circle. The label circle is neither too big nor too small, and is divided into four echelons. Zhang San said that his company belongs to the third echelon. The first echelon, such as Baidu Zhongce and JD.COM Cizhi. The second echelon, such as Totoro Data, Testin Cloud Measurement, Double Match BasicFinder, Data Hall, etc. He compared the relationship between the second echelon and the third echelon to small real estate developers and brick movers. Below the third echelon are a large number of small workshops with a team size of 3-5 people.

The label industry is a promising new industry.

Freshness means uncertainty and infinite possibilities. "Dry labeling is like pouring water into a bucket. Every time you pull a box, add a bowl of water At present, no one knows how long it will last, only when the water overflows. " This does not prevent Zhang San from designing the future. "The first step is to serve the second echelon at this stage, and then build a platform to make the company a second echelon."

30 billion market and inflection point

How big is the market for data collection and labeling? 30 billion yuan.

This quotation appears near 1984. Xinboyou's company is one of many companies. At that time, these companies were more like "import companies"-companies that digitized paper content rather than labeling it. "Entry" is a labor-intensive job, and a company needs to hire many people to do it. According to Zhaopin's recruitment, Xinboyou checked "1000-9999" in the number of companies.

Different from Xinboyou, Haitian Sheng Rui was founded in 1998, engaged in phonetic annotation, and built many phonetic databases by itself. The insider told me that the dark horse &; Heizhi, Haitian Sheng Rui's repetition of the previous sales of voice database is a relatively big business. The data hall was established on 20 1 1. Usually, what impresses the outside world most is that "it is the largest data trading platform in China". This is related to its entrepreneurship.

Around 20 15, with the strong rise of the artificial intelligence company TOP50 in the list, the demand for data annotation and collection gradually increased. What really formed in this market is the four echelons mentioned above. As Party B, they entered this expanding market, served AI unicorns with a valuation of more than 654.38 billion US dollars, and taught artificial intelligence products that can change the world.

1.

Data is a necessity for AI company. Just as people need three meals a day, AI models also need daily data feeding. Du Lin, founder and CEO of Besay BasicFinder, deeply understands the relationship between data and AI model. He began to study computer vision during high school, and published papers in senior three. During his college years, he has also been doing related research. He is well aware of the importance of data to the AI model, and draws the conclusion that "there is no threshold for AI modeling, but data is the threshold".

In his view, artificial intelligence at this stage is simple cognitive intelligence. "Cognitive intelligence is to help you identify and classify the world. The construction of classifier is a mathematical problem, which is formed by data accumulation. " "Deep learning is essentially a mathematical problem, and it is a process of reversely constructing the classifier coefficient space from a large number of sample space data. You must have many samples. What do you mean by samples? The sample is the person who knows the correct answer. This is the same as when we were young and asked for various styles and coefficient formulas. We need many known points in space to fit multi-mode. In the same way, deep learning is also this model, and it also requires a large number of samples, that is, calibrated data. "

Therefore, Du Lin realized that "at this stage of industrial AI application research and development, the standard data will definitely not jump, and may rely on the standard data within 10 years." Data is so important to AI, but companies that label and collect data are not recognized by academia, industry, capital and even the media. Halo belongs to those AI companies that do model research and development from the beginning, such as Shangtang Technology and Defiance Technology.

"A company has made a good artificial intelligence product. Everyone will say that artificial intelligence algorithm cattle or scientist cattle, but no one has ever said that data collection is good. " Testin cloud test VP Jia Yuhang said. Jia Yuhang told me that Dark Horse &; Black wisdom, not only the spotlight can't shine, but also data sampling is a "hard job". So bitter that no one wants to do it. Much like the mobile internet, the product is good. No one expected that the military medal actually had an APP tester. Once something goes wrong, the first one to be accused must be the testing department.

230 million metadata bidding market

The importance of data to AI company is self-evident. It is reported that the investment of AI company in data sampling is 10%- 15%. It has also been mentioned that this ratio is 20%-30%. In 20 18, the total financing scale of China AI Company reached more than 1000 billion yuan, and the data acquisition market was about1000 billion yuan-30 billion yuan. One-third of them will be digested by the internal labeling department of AI company, some will be carved up by business process outsourcing companies, and the remaining 25%-33% will flow to third-party companies specializing in data adoption. At present, the scale of AI financing is growing at a rate of about 25% per year.

With the lowering of the threshold of AI technology, more and more companies have opened their own frameworks, and a model can be generated by feeding data. More and more vertical companies began to set up AI departments. Before that, they will hand over the business to the company that makes the AI model. In the past two years, many customers of Totoro Data, Testin Cloud Measurement and Besay BasicFinder are not from the AI industry, but from the AI business departments of traditional companies. Qi Zhi, founder and CEO of Totoro Data, believes that from this perspective, the market size is not easy to calculate. How much budget Internet companies such as BAT, Xiaomi, JD.COM and TMD and traditional enterprises in traditional industries will spend on AI is unknown. The only certainty is that in the past two or three years, the market scale of data adoption has become larger and larger.

In the past two or three years, the AI model has demanded more and more complexity and precision of data sampling. For example, now, to make a face frame, the accuracy of the face frame should be within five pixels or three pixels; In other words, the accuracy of the whole batch of data needs to reach more than 97%. Jia Yuhang believes that the improvement of accuracy is the inevitable result of the development of the AI industry. For the AI industry, there is a saying that garbage goes in, garbage goes out, and low-precision labeled data has no meaning to the algorithm. Only by continuously outputting high-precision bidding data can service providers maintain their competitive advantage.

Second, a larger and more diverse data scale. The huge thing is that the amount of data will be even larger. Take the sensor as an example. With the decrease of sensor cost and wide application, more and more data need to be marked. More diversity refers to richer data dimensions. At this year's CES show, Panasonic introduced a smart home solution, which can not only observe the fatigue of the face through the camera on the TV, but also detect the heartbeat through the capacitive sensor on the chair. Previously, fatigue detection only captured faces through cameras. In the future, more dimensional data will be collected, not only 2D images and sounds, but also 3D lidar and heartbeat data will be included in the bidding scope.

3. turning point

Changes in the demand side will inevitably lead to a considerable earthquake in the supply side. The supply side began to transform from labor-intensive industries to new industries and new models-tools+crowdsourcing. Shuffle began and data adoption ushered in the second half.

The fourth echelon with the greatest negative impact. Whether it is complicated or requires higher precision, it is not good news for them. Since the middle of last year, a dozen or twenty small workshops have asked to be affiliated with Bessie BasicFinder every day, which shows that the small workshops have lost their business sources. "Their model of seizing the market by low-quality data and low prices is no longer sustainable. Because AI engineers can't accept low-quality data and can't accept unreliable delivery. " Du Lin said.

Zhang San thinks that the fourth echelon has broken the rules. They grab the order at a low price first, and then explore what kind of project can produce the most per unit time, and then do this project. Other projects are subcontracted to smaller teams. The quality is difficult to guarantee. "They don't calculate rent, management fees, etc. , only accounting for labor costs. Their logic is that a person earns 50 yuan a day, and above this price, he earns it. So they quoted the unit price of 100 yuan. The third echelon needs to bear the rent, taxes, management fees and the messy consumption of drinking water and eating every day. At least you must quote the unit price of 200 yuan to do it. "

In the early days, the fourth echelon made some money in this way, recovered the hardware cost and had a balance. But at the beginning of 20 18, the second echelon began to do shop tests, "Look at how many people you have and look at your venue. You are unprofessional and the industry is slowly eliminating you. " Elimination means that there is no business source, so many people need to eat and get paid, and the unprofessional fourth echelon crisis appears. Even if the project can be found, the requirements for the bidding project will be improved. For example, the accuracy rate should reach 95% or even 99%. The small workshop will also draw some people from the team for full-time quality inspection, and finally the sampling inspection will increase the cost.

For everyone involved in this industry, the pressure is the same. For the second echelon companies such as Totoro Data, Testin Cloud Measurement and Besay BasicFinder, they need entrepreneurial iteration, and they need to find ways to break through themselves, innovate constantly and get out of their comfort zone in this process. They found a breakthrough point, and what they need to think about is how to win in the future. The insiders believe that the emergence of the fourth echelon crisis is conducive to the powerful second echelon relying on service quality and efficiency to seize the market gap left by the small workshops that have withdrawn.

New stage and new competition

Data annotation and collection is a technical activity.

When the demand comes, the bidding company will do two things: first, allocate and develop modules; Second, try to sum up the rules and conduct training. After these two aspects are completed, the company will quote the demand side. In the process of quotation, the bidding company will go back to prepare relevant bidding materials or response materials.

After winning the bid, the bidding company began to transmit data to the platform, and began to configure the production and labeling business. It is reported that the configuration of data label service is a complex mathematical model. For example, some tasks need serial-parallel workflow, and parallel workflow is the collaborative work of many people. The latter result of serial workflow is based on the former result, and serial-parallel workflow needs a platform to realize the configuration of business workflow. For example, some NLP-type text tagging operations require multiple people to tag, and finally choose one or vote. Serial-parallel configuration involves the distribution of the underlying data stream.

In the process of labeling, collaborative management of quality and statistics of performance are very important. The platform needs to count everyone's accuracy, stability and efficiency in time. After marking, before customer acceptance, the bidding company still needs sampling inspection. Finally, the company delivers the goods according to the format agreed with the customer, which involves the problem of format conversion.

The above process includes all the technical core points of the whole labeling system. Labeling and collection services can't be done by a bunch of people. For the third and fourth echelons relying on manpower, Jia Yuhang believes that if they want to transform the new production mode of crowdsourcing+tools, "the limitations are relatively large." There are two reasons:

First, the leader in the data industry will win the reputation in the customer circle through three years of continuous service, and the brand effect will bring some commercial accumulation to it. Some companies that care more about quality and input-output ratio will gradually tilt to the leading position. Second, technological advantages. The head marking company has the funds to optimize its own tools and meet the customization needs of customers, and optimize the corresponding service system and process through management experience. However, for small teams, it is limited to quickly establish existing tools and process systems to cover one or more industries. There are two paths for them to choose from. First, streamline the team, specialize in the business of one or several AI companies, and do a small and beautiful business; Second, cooperate with the elite and use the tools provided by the elite to do the task of platform allocation.

For the latecomers who have not yet entered the market, if the latecomers are determined to be a crowdsourcing+tool platform from the beginning, crowdsourcing platform needs strong operational capabilities and enough people on the platform in addition to overcoming business barriers. The platform needs to consider how to pull new products and how to keep daily and monthly activities. In terms of tools, it is not enough to have only one APP that can be used as a target. Without convenient communication, it is difficult to reduce the spread of errors. Just like the barrel theory, you can't put it in the water without a board. In other words, the window period for new entrants is gradually closing.

The insiders believe that the bidding market will enter the warring States period. The powerful second echelon inevitably faces a melee. The data adoption market began to become unified. The first echelon is doomed not to be the protagonist in the hegemonic period. Because of industry competition and other considerations, the demander will not hand over the data to the crowdsourcing platforms of Baidu and JD.COM. Listed companies that outsource human resources will gain a certain proportion of market share in the second half of the year, which poses a certain threat to the five bidding companies, but the threat is not great.

How will the second echelon compete in the second half? Through in-depth communication with three companies in the second echelon, Heizhi found that their understanding of the future and competition is different and their layout is different. These differences are doomed from the moment of birth.

1. Do you want to be lighter or heavier?

Totoro data, Testin cloud measurement, and the Finder of the two-match foundation give different answers to the question "Do it lightly or redo it". Testin Cloud Measurement and Double Match BasicFinder both have their own tagging teams, while Totoro Data insists on crowdsourcing tagging.

Different genes are behind different choices. Testin cloud measurement was established at 20 1 1. Starting from the compatibility test of App, it entered enterprise services, and then derived services such as functional testing, automation testing, security testing and performance testing, and became a one-stop testing platform. In 20 17 years, Testin Cloud has accumulated a large number of customers. Some AI companies find cloud measurement and hope to collect data through the public measurement platform of cloud measurement. This is the starting point of Testin's cloud measurement and sampling business.

Testin Cloud has done a lot of bidding business. For example, in addition to crowdsourcing, we will also do customized scene collection, and even cooperate with Hengdian Film and Television Base to use Hengdian Group's performance resources to build exclusive scenes and complete the customized scene collection of customers. In terms of labeling, Testin Cloud has built its own labeling base and cooperated with Fangshan Municipal Government to label data. Jia Yuhang said that everything Testin cloud measurement does is for the needs of customers. "Through the development and driving of tools, the efficiency, accuracy and safety of labeling are guaranteed. And through project management, risk control management, etc. To ensure that the labeling accuracy meets customer standards and meets customer requirements for accuracy. "

Judging from the product gene of Besay BasicFinder, Besay's tools tend to be management tools in team mode rather than crowdsourcing mode. On February 20 18, 18, the competition acquired new bloggers. As mentioned above, Xinboyou is a Beijing data processing company that has been operating for 30 years. The company puts forward requirements and provides technical support. "We have iterated many times, and the optimization of every tool, shortcut key and every setting is a run-in in data production. The business of the competition is later than other companies. Business was basically not received on 20 16, and it didn't begin to be received until 20 17. Our tools are very strong. "

In addition to new bloggers, Besay Basic Finder has been actively expanding its production capacity. Du Lin said that at present, Bessie Basic Finder has expanded its branch factory of nearly 3,000 people. "By expanding our own production capacity, we can achieve the most professional service." 2065438+In September 2008, Besay Basic Finder acquired 0/00% equity of Ding Huo Intelligent/KLOC. Ding Huozhineng's "Juju APP" has accumulated hundreds of thousands of active crowdsourcing users. "We have established an independent collection system, and then combined with the integrated APP to realize data collection and complete more diversified tasks."

Unlike Testin Cloud Measurement and BasicFinder, Totoro Data does not have its own annotation team, and the tools are biased towards crowdsourcing mode. Qizhi and Lianchuang, who came from Internet companies, prefer to adopt standards in a platform-based way rather than "being a pure data factory". Zanzhi's past experience told him that the system should do these complex data processing, rather than relying on person-to-person management. Because the management efficiency of people is very low.

According to Qi Zhi, my neighbor totoro data was collected in crowdsourcing mode earlier. "We let things happen through crowdsourcing, and many followers began to do it through crowdsourcing." Yan Zhi thinks that Totoro data has created an "eternal sword". He doesn't think that people who study Totoro data can do crowdsourcing well. "Players who entered this industry in the early days have a treasure knife. They use this treasure knife to gain benefits, and then they see others holding eternal swords to gain greater benefits. In order to build this eternal sword, he can't lose this treasure knife. They lost their knives. They may have lost everything. But without losing the knife, it is difficult for them to build the sword. Because people's energy is limited and their thinking is limited, it is impossible to focus on Baodao and Eternal Sword at the same time, and it is unscientific to make Eternal Sword better than us. "

Qizhi believes that there is no treasure knife in Totoro data. "After receiving customer demand, we can only optimize the system to ensure accurate data output. For them, after receiving the customer's demand, there is still a way to take a step back, so supervise everyone to do it on the spot. They have a retreat, and we have no retreat. We must solve it. When there is a retreat, it is easy for people to choose a retreat when they are anxious. " It is understood that at present, the Totoro crowdsourcing platform has more than 4 million users, of which only 1000 is labeled. Totoro data labeling business is mainly undertaken by more than 1000 channel providers.

2. Do you want to be a model?

Jia Yuhang mentioned that the industrial chain of data labeling can be divided into three parts: people, tools and algorithms. Testin cloud measurement insists on doing a good job of people+tools, not algorithms. "The data is replicable. If the collection and labeling company knows the algorithm, it is a bit like an algorithm company looking for another algorithm company to label. Whether this data is used for Party B's promotion is controversial. " "We are a company that serves the data field, not a company that sells algorithms. We are only responsible for completing the data adoption requirements of enterprises. After delivery, we will completely eliminate customer data. "

Du Lin may disagree with Jia Yuhang's point of view, because Besay BasicFinder is building a fool-like modeling system-users only need to input data to get an AI model. "If a customer wants to set up an AI department, they only need to deploy it on the system of the previous competition, and then find two or three AI engineers to adjust the parameters, so they can make their own models. In this way, marking, collecting and modeling will become a big closed loop, because the customer knows the business and he knows what the business data should be like. " Du Lin said.

Now, Besay BasicFinder avoids direct modeling. Du Lin emphasized, "We have unified the self-developed privatized label system and mainstream deep learning framework into BasicAI, the basic system of Besay AI, to realize the whole life cycle management of AI data and models. Double competition does not model, we only provide customers with a set of low-level tools for customers to model themselves. " Du Lin explained, "The emergence of deep learning libraries such as Tensorflow, Keras, and Pytorch has made modeling without a threshold, and even high school students can model in the future."

If a car company asks Besay BasicFinder to help build an autonomous driving system, Du Lin says it is impossible. But he also said, "Our foundation has realized efficient process management from labeling to modeling. The customer is doubling the bidding data and streaming it to the modeling platform. The customer adjusts some parameters in Tensorflow and the model comes out. " This year, Besay will launch a new version of 3.0 and provide SaaS tagging tool services to help customers realize data tagging management. Du Lin mentioned that the bidding and modeling process tools created for the team can improve the business extension of competition and improve the advantage in competition.

There is no good or bad choice, but the market will give a clear answer to all choices. And the warring States melee, or in the next few years. However, customers don't want one family to dominate, and nothing grows under the big tree. In the future, the situation of strong numbers will exist for a long time.

wane to the close

A scene, a market, an industry, a river and lake.

The bustling crowds who enter the market are active or passive, but once they enter, the logic of the market and capital plays a role. They, you and I, all become production factors in the production chain and are selected, improved or eliminated.

The position of each industry participant has been or has been doomed since its birth. From the moment it came into being, it followed the existing logic and never took personal will as the transfer. In the first half, grassroots heroes came forth in large numbers, fighting for price, fighting for brand, fighting for service and fighting for efficiency in the second half. The elite began to clear the field, and the grassroots left or re-stood. And capital accelerates the iteration of the whole industry.

Now, the second half has just begun, and it seems a bit premature to talk about the final. There are too many uncertainties that will become certain in the competition in the next few years. But more uncertainties may reappear. Chengtou changed the king's flag, only in an instant.

Heizhi believes that in the next few years, although uncertainty is the mainstream, there are still several things that are certain:

1. The second half of the year will still be a cost-effective battle. Customers always want to get higher quality data at the lowest cost. In order to survive and stand out from the competition, suppliers have to meet the demand of cost performance, and they have to obtain the space of price reduction and profit through technology. Jia Yuhang thinks that technology is always the most important. "Force yourself not to make too much money through technical means. This can reduce prices and improve competitiveness. "

2. Don't ignore the AI needs of traditional companies. Undoubtedly, in the next few years, the AI demand of traditional enterprises will blow out. How to catch them and serve them well is an urgent problem for all bidding companies. Of course, we can't ignore the new data in the AI industry, such as 3D lidar and heartbeat data.

3. Can't ignore the business ability. Business ability is not strong, or it will become a new shortcoming of the bidding company. At this stage, their products and business models have basically been verified by the market. They need to expand the product coverage by expanding commercial leverage.

4. Establish the second growth curve. In the next few years, some people left and some people stayed. Everyone is the owner, leader or dominated in the industrial chain. All the remaining companies should look for a second growth curve to break through the existing cost-benefit restrictions. In addition, Zhang San's dream still needs to be done and realized. There is always a dream, in case it comes true. (Note: Zhang San is a pseudonym)