You have to step right up. Come one, come all. The world has never seen a game like this one.
There is a lack of labeled data in the enterprise that is the main roadblock to progress in deep learning.
It is time to find the answer.
The Data Problem at the core of artificial intelligence is addressed by a staggering number of techniques. Under one of the cards is the secret to the next set of decacorns.
Weak supervision, unsupervised learning, foundation models, transfer learning, representation learning, semi-supervised learning, self-supervised learning, synthetic data, knowledge graphs, physical simulations, symbol manipulation, active learning, zero-shot learning and generative models.
To name a few
The concepts are split into two parts in strange ways. There isn't a single term that has a universally agreed upon definition. Even the savviest customers and investors are thrown off balance by the plethora of techniques and tools.
Which one do you choose?
We shouldn't have been watching the cards in the first place. The problem with The Data Problem was not about data in the first place. It's not exactly.
Data is not useful. I can set my computer to generate enough random noise to keep a neural network going until the heat death of the universe. With a little more effort and a single picture from a 10MP phone, I could create more data than the internet currently has.
The data is a vehicle. The information is what it is carrying. It is important that the two are not confused.
There is a lot of data, but not much else. The problem is reversed in systems like loan approvals and industrial supply chains. There are two types of thought and expression. It was like trying to mine a mountain with a pick axe.
This is where the data problem begins. There is a billion cars on the road, and it is both tangible and hard to find. Thousands of people and billions of dollars are carrying little loads of gravel and tailings in captcha tests.
There is a wave of buzzwords coming in. The motivations and core principles of the hundreds of papers are easy to understand. I credit the best and simplest explanation to the underspecification paper from the internet giant.
Imagine a neural network as a fuzzy space. It can do almost anything, but it doesn't do anything.
We don't know what we want the neural network to do. It's like unmolded clay. The amount of freedom left in a system is a mathematical formalization. We would need a lot of information and work to eliminate those possibilities.
We want to mimic humans today. Information and work must come from humans.
Humans have to make decisions to move forward. It must be a winnowing down of that huge space. There was a reduction in Shannon's number. It's impractical to find the perfect drop of water in an ocean of possibilities. It's like finding the right part of the ocean. Every option is equivalently optimal in this subset of the ocean.
You can tell.
The way that we winnow the ocean is through supervision. This is what you should do out of everything that you could do. To cut through the noise, you need clarity. Information flows are what you need to focus on, because there is no free lunch here.
The Omniverse replicator is a great example. There is a data platform. That doesn't tell you a whole lot. The data is described, but the simulations are the real thing. It is completely different from other synthetic data platforms that focus on using generative models to convert information trapped in personally identifiable data into non-identifiable synthetic data that contains the same information.
There is a case study about the active learning approach ofTesla. The data scientist is the main source of information. New training examples will cut down on your equivalence set if you specify an active learning strategy that is suited to the task. In one of Karpathy's recent talks, he talks about the improvement of this technique with the help ofTesla. Rather than having data scientists craft an optimal active learning strategy, they leverage several noisy strategies together to identify the most impactful examples.
Adding human intervention improves the performance. This would be considered a regression. Less automation is less good because of more intervention. This approach is seen through the lens of information. The rate of improvement has been dramatically improved by you.
The game is called this. A lot of the people that have co-opted those words have misinterpreted the promise in them. The words are indicative of progress. We have explored these fields for a long time and know that there is no magic bullet. There are still significant gains to be made by combining and unifying these supervision paradigms, even though each of these fields has led to benefits in its own right.
It is an era of great possibilities. We are able to use information from previously undiscovered sources. There is an embarrassment of wealth and a lack of understanding of noise. Remember, when it all seems too much, and you can't sort fact from fiction.
You should follow the information.
Indico Data has a founder and a chief technology officer.
The VentureBeat community welcomes you.
Data decision makers can share data related insights and innovation.
Join us at DataDecisionMakers to read about cutting-edge ideas and up-to-date information.
You could possibly contribute an article of your own.
Data decision makers have more to say.