数据即资产：DataFi 正在开启新蓝海

2025-07-23 02:06:53

Author: Biteye core contributor @anci_hu49074

“We are in an era of global competition to build the best basic models. Computing power and model architecture are important, but the real moat is training data”

—Sandeep Chinchali, Chief AI Officer, Story

Let’s talk about the potential of AI Data track from the perspective of Scale AI

The biggest gossip in the AI circle this month is Meta showing its money ability. Zuckerberg recruited talents everywhere and formed a luxurious Meta AI team composed mainly of Chinese scientific research talents. The team leader is Alexander Wang , who is only 28 years old and founded Scale AI. He founded Scale AI and is currently valued at 29 billion US dollars. The service objects include the US military, as well as OpenAI, Anthropic, Meta and other competing AI giants, all of which rely on the data services provided by Scale AI. The core business of Scale AI is to provide a large amount of accurate labeled data.

Why can Scale AI stand out from a group of unicorns?

The reason is that it discovered the importance of data in the AI industry early on.

Computing power, models, and data are the three pillars of AI models. If the big model is compared to a person, then the model is the body, computing power is food, and data is knowledge/information.

In the years since the rise of LLM, the industry's development focus has also shifted from models to computing power. Today, most models have established transformers as the model framework, with occasional innovations such as MoE or MoRe. Major giants have either built their own Super Clusters to complete the computing power Great Wall, or signed long-term agreements with powerful cloud services such as AWS. Once the basic computing power is met, the importance of data has gradually become prominent.

Unlike traditional To B big data companies with a prominent reputation in the secondary market such as Palantir, Scale AI, as its name suggests, is committed to building a solid data foundation for AI models. Its business is not limited to mining existing data, but also focuses on longer-term data generation business. It also attempts to form an AI trainer team through artificial experts in different fields to provide better quality training data for AI model training.

If you don't agree with this business, let's take a look at how the model is trained.

The training of the model is divided into two parts - pre-training and fine-tuning.

The pre-training part is a bit like the process of human babies gradually learning to speak. What we usually need is to feed the AI model a large amount of text, code and other information obtained from online crawlers. The model learns these contents by itself, learns to speak human language (academically called natural language), and has basic communication skills.

The fine-tuning part is similar to going to school, where there are usually clear right and wrong, answers and directions. Schools will train students into different talents based on their respective positioning. We will also use some pre-processed and targeted data sets to train the model to have the capabilities we expect.

At this point, you may have figured out that the data we need is also divided into two parts.

Some data does not need to be processed too much, just enough is enough, usually from crawler data of large UGC platforms such as Reddit, Twitter, Github, public literature databases, corporate private databases, etc.
The other part, like professional textbooks, requires careful design and screening to ensure that the model's specific excellent qualities can be cultivated. This requires us to carry out some necessary work such as data cleaning, screening, labeling, and manual feedback.

These two data sets constitute the main body of the AI Data track. Don’t underestimate these seemingly low-tech data sets. The current mainstream view is that as the computing power advantage in the Scaling laws gradually becomes ineffective, data will become the most important pillar for different large model manufacturers to maintain their competitive advantage.

As model capabilities continue to improve, more sophisticated and professional training data will become key influencing variables of model capabilities. If we further compare model training to the cultivation of martial arts masters, then high-quality data sets are the best martial arts secrets (to complete this metaphor, we can also say that computing power is the panacea and the model is the qualification itself).

From a vertical perspective, AI Data is also a long-term track with the ability to snowball. With the accumulation of previous work, data assets will also have the ability to compound, and will become more popular as they age.

Web3 DataFi: The Chosen Fertile Ground for AI Data

Compared with Scale AI's remote manual labeling team of hundreds of thousands of people in the Philippines, Venezuela and other places, Web3 has a natural advantage in the field of AI data, and the new term DataFi was born.

Ideally, the advantages of Web3 DataFi are as follows:

Data sovereignty, security and privacy guaranteed by smart contracts

At a stage when existing public data is about to be developed and exhausted, how to further mine undisclosed data, even private data, is an important direction for obtaining and expanding data sources. This faces an important trust choice issue - do you choose a centralized large company's contract buyout system and sell your data; or do you choose the blockchain method, continue to hold the data IP in your hands, and clearly understand through smart contracts: who uses your data, when and for what purpose.

At the same time, for sensitive information, you can use zk, TEE and other methods to ensure that your private data is only handled by machines that keep their mouths shut and will not be leaked.

Natural geographical arbitrage advantage: free distributed architecture to attract the most suitable labor force

Perhaps it is time to challenge the traditional labor production relationship. Instead of looking for cheap labor all over the world like Scale AI, it is better to make use of the distributed characteristics of blockchain and enable the labor force scattered around the world to participate in data contribution through open and transparent incentives guaranteed by smart contracts.

For labor-intensive tasks such as data labeling and model evaluation, the use of Web3 DataFi is more conducive to the diversity of participants than the centralized approach of establishing data factories, which also has long-term significance for avoiding data bias.

Blockchain’s clear incentive and settlement advantages

How to avoid the tragedy of the "Jiangnan Leather Factory"? Naturally, we should use the incentive system with clear price tags in smart contracts to replace the darkness of human nature.

In the context of inevitable deglobalization, how can we continue to achieve low-cost geographical arbitrage? It is obviously more difficult to open companies all over the world, so why not bypass the barriers of the old world and embrace the on-chain settlement method?

It is conducive to building a more efficient and open "one-stop" data market

"Middlemen making a profit from the price difference" is an eternal pain for both supply and demand sides. Instead of letting a centralized data company act as a middleman, it is better to create a platform on the chain, through an open market like Taobao, so that the supply and demand sides of data can connect more transparently and efficiently.

With the development of the on-chain AI ecosystem, the demand for on-chain data will become more vigorous, segmented and diverse. Only a decentralized market can efficiently digest this demand and transform it into ecological prosperity.

For retail investors, DataFi is also the most decentralized AI project that is most conducive to the participation of ordinary retail investors.

Although the emergence of AI tools has lowered the learning threshold to a certain extent, and the original intention of decentralized AI is to break the current monopoly of AI business by giants; however, it must be admitted that many current projects are not very accessible to retail investors with no technical background - participating in decentralized computing network mining is often accompanied by expensive initial hardware investment, and the technical threshold of the model market can always easily discourage ordinary participants.

In contrast, it is one of the few opportunities that ordinary users can seize in the AI revolution. Web3 allows you to participate in it by completing various simple tasks, including providing data, labeling and evaluating models based on the intuition and instinct of the human brain, or further using AI tools to perform some simple creations, participate in data transactions, etc. For the old drivers of the Mao Party, the difficulty value is basically zero.

Web3 DataFi’s potential projects

Where the money flows, there is the direction. In addition to Scale AI receiving a $14.3 billion investment from Meta and Palantir’s stock price soaring more than 5 times in one year in the Web2 world, DataFi also performed very well in Web3 financing. Here we give a brief introduction to these projects.

Sahara AI, @SaharaLabsAI, raised $49 million

The ultimate goal of Sahara AI is to build a decentralized AI super infrastructure and trading market. The first sector to be tested is AI Data. The public beta version of its DSP (Data Services Platform) will be launched on July 22. Users can obtain token rewards by contributing data, participating in data labeling and other tasks.

Link: app.saharaai.com

Yupp, @yupp_ai, raised $33 million

Yupp is an AI model feedback platform that collects user feedback on model output. The current main task is that users can compare the output of different models for the same prompt, and then select the one they think is better. Completing the task can earn Yupp points, which can be further exchanged for fiat stablecoins such as USDC.

Link:

Vana, @vana, raised $23 million

Vana focuses on converting users' personal data (such as social media activities, browsing history, etc.) into monetizable digital assets. Users can authorize to upload their personal data to the corresponding data liquidity pool (DLP) in DataDAOs. These data will be pooled and used to participate in tasks such as AI model training, and users will also receive corresponding token rewards.

Link:

Chainbase, @ChainbaseHQ, raises $16.5 million

Chainbase's business focuses on on-chain data, and currently covers more than 200 blockchains, turning on-chain activities into structured, verifiable and monetizable data assets for dApp development. Chainbase's business is mainly obtained through multi-chain indexing and other methods, and data is processed through its Manuscript system and Theia AI model. Ordinary users are currently not very involved.

Sapien, @JoinSapien, raised $15.5 million

Sapien aims to convert human knowledge into high-quality AI training data on a large scale. Anyone can perform data annotation on the platform and ensure the quality of the data through peer verification. At the same time, users are encouraged to build long-term reputation or make commitments through staking to earn more rewards.

Link:

Prisma X, @PrismaXai , raises $11 million

Prisma X wants to be an open coordination layer for robots, where physical data collection is key. This project is currently in its early stages. According to the recently released white paper, participation may include investing in robots to collect data, remotely operating robot data, etc. Currently, a quiz based on the white paper is open, and you can participate to earn points.

Link:

Masa, @getmasafi, raised $8.9 million

Masa is one of the leading subnet projects in the Bittensor ecosystem, and currently operates Data Subnet No. 42 and Agent Subnet No. 59. The data subnet is committed to providing real-time access to data. Currently, miners mainly crawl real-time data on X/Twitter through TEE hardware. For ordinary users, the difficulty and cost of participation are relatively high.

Irys, @irys_xyz, raised $8.7 million

Irys focuses on programmable data storage and computing, aiming to provide efficient and low-cost solutions for AI, decentralized applications (dApps) and other data-intensive applications. In terms of data contribution, ordinary users can not participate much at present, but there are multiple activities to participate in the current testnet stage.

Link:

ORO, @getoro_xyz, raised $6 million

What ORO wants to do is to empower ordinary people to participate in AI contribution. Support methods include: 1. Link your personal account to contribute personal data, including social accounts, health data, e-commerce and financial accounts; 2. Complete data tasks. The test network is now online and you can participate.

Link: app.getoro.xyz

Gata, @Gata_xyz, raised $4 million

Positioned as a decentralized data layer, Gata currently has three key products to participate in: 1. Data Agent: a series of AI Agents that can automatically run and process data as long as the user opens the web page; 2. AII-in-one Chat: a mechanism similar to Yupp's model evaluation to earn rewards; 3. GPT-to-Earn: a browser plug-in that collects users' conversation data on ChatGPT.

Link:

How do you view these current projects?

At present, the barriers to entry for these projects are generally not high, but it must be acknowledged that once users and ecological stickiness are accumulated, platform advantages will accumulate rapidly. Therefore, in the early stages, efforts should be focused on incentives and user experience. Only by attracting enough users can the big data business be made.

However, as labor-intensive projects, these data platforms should also consider how to manage labor and ensure the quality of data output while attracting labor. After all, a common problem of many Web3 projects is that most users on the platform are just ruthless profiteers. They often sacrifice quality for short-term benefits. If they are allowed to become the main users of the platform, bad money will inevitably drive out good money, and ultimately the data quality cannot be guaranteed and buyers cannot be attracted. At present, we have seen that projects such as Sahara and Sapien have emphasized data quality and strived to establish a long-term and healthy cooperative relationship with the labor on the platform.

In addition, lack of transparency is another problem of current on-chain projects. Indeed, the impossible triangle of blockchain has forced many projects to take the path of "centralization drives decentralization" in the startup phase. But now more and more on-chain projects give people the impression of "old Web2 projects in Web3 skin" - there are very few public data that can be tracked on the chain, and even the roadmap is difficult to see the long-term determination of openness and transparency. This is undoubtedly toxic to the long-term healthy development of Web3 DataFi, and we also hope that more projects will always keep their original intentions and accelerate the pace of openness and transparency.

Finally, DataFi's mass adoption path should also be divided into two parts: one is to attract enough toC participants to join the network, forming a new force for data collection/generation engineering and consumers of the AI economy, forming an ecological closed loop; the other is to gain recognition from the current mainstream to B companies. After all, in the short term, they are the main source of large data orders with their deep pockets. In this regard, we have also seen that Sahara AI, Vana, etc. have made good progress.

Conclusion

To be more fatalistic, DataFi is about using human intelligence to nurture machine intelligence in the long term, while using smart contracts as a contract to ensure that human intelligence labor is profitable and ultimately enjoys the feedback from machine intelligence.

If you are anxious about the uncertainty of the AI era, and if you still have blockchain ideals amid the ups and downs of the cryptocurrency world, then following the footsteps of a group of capital giants and joining DataFi is a good choice to go with the trend.

此页面可能包含第三方内容，仅供参考（非陈述/保证），不应被视为 Gate 认可其观点表述，也不得被视为财务或专业建议。详见声明。

赞赏
点赞
评论
分享

0/400

暂无评论

话题
1/3
1山寨季来了？
48762 热度
2稳定币监管风暴
28618 热度
3KOL星火计划破千
8657 热度
4以太坊突破3800
25102 热度
5Strategy增持比特币
17643 热度