NVIDIA: Currently in the second wave of intelligent agents, including software a
When ChatGPT was first released, everyone in the field of artificial intelligence was talking about the new generation of AI assistants. But in the past year, this excitement has shifted to a new target: AI agents.
AI agents played a significant role at Google's annual I/O conference in May 2024. At that time, the company launched a new AI agent called Astra, with which users can interact using audio and video.
OpenAI's new GPT-4o model is also referred to as an AI agent.
There is some hype involved, but it's not just hype. Technology companies are investing huge sums of money to create AI agents, and their research work may bring us the truly useful artificial intelligence we have been dreaming of for decades.
Many experts, including Sam Altman, have said that AI agents are the next industry focus. But what is it? How should we use it?How are they defined?
The study of agents is still in its early stages, and the field has not yet provided a clear definition for them.
Jim Fan, a senior research scientist at Nvidia and the head of the company's agent project, says that they are essentially artificial intelligence models and algorithms that can make decisions autonomously in a dynamic world.
The grand vision of an agent is a system that can perform a multitude of tasks, much like a human assistant.
Advertisement
In the future, it could help you book a holiday, but it would also remember whether you prefer luxury hotels, so it would only suggest hotels rated four stars or above, and then proceed to book one of them.It will also recommend the most suitable flights for your schedule and plan your itinerary according to your preferences. It can list the items you need to bring based on your travel plans and weather forecasts.
The intelligent agent may even send your itinerary to your friends and invite them to go together.
At work, it can analyze your to-do list and complete appropriate tasks, such as sending meeting invitations, memos, or emails.
One of the visions of the intelligent agent is multimodality, which means they can handle language, audio, and video simultaneously.
For example, in Google's Astra demonstration, users can point their smartphone camera at objects and ask questions to the intelligent agent. The agent can respond to text, audio, and video inputs.The head of the Artificial Intelligence Center at University College London, David Barber, stated that these intelligent agents can also make the processes of businesses and public organizations smoother. For instance, intelligent agents might be able to act as more sophisticated customer service robots.
The current generation of language model-based assistants can only generate the next likely word and form sentences, but intelligent agents will have the capability to autonomously process natural language commands and handle customer service tasks without supervision.
Barber gave an example, saying that intelligent agents will be able to analyze a customer's complaint email, then know how to check the customer's order number, access databases such as customer relationship management and delivery systems, to see if the complaint is valid, and handle it according to the company's policy.
Fan said that, broadly speaking, there are two different types of intelligent agents: software agents and embodied agents.
Software agents operate on computers or mobile phones and use applications. He said, "These agents are very useful for office work, sending emails, or completing a series of related tasks."Embodied agents are intelligent agents located in 3D worlds (such as video games) or robots. These agents enable people to interact with non-player characters controlled by artificial intelligence, making video games more engaging.
These agents can also help build more useful robots that assist us in completing daily chores, such as folding clothes and cooking.
Fan's team has built an intelligent agent called MineDojo in the popular computer game "Minecraft."
Using a vast amount of data collected from the internet, Fan's agent can learn new skills and tasks, allowing it to freely explore the virtual 3D world and complete complex tasks, such as fencing camels or scooping lava into a bucket.
Video games can simulate the real world well because they require agents to understand physics, reasoning, and common sense.Researchers at Princeton University in the United States stated in a new paper that has not yet been peer-reviewed that agents often have three different characteristics.
If artificial intelligence systems can attempt difficult goals in complex environments without guidance, they are considered "agents." If they can accept guidance in natural language and act autonomously without supervision, they can also be considered agents.
Finally, the term "agent" also applies to systems capable of using tools such as web searches and programming, or systems that can plan.
Are they new things?Chirag Shah, a professor of computer science at the University of Washington, stated that the term "agent" has been around for many years and has meant different things at different times.
Fan said that there have been two waves of agent enthusiasm. The current wave is attributed to the boom in language models and the rise of systems such as ChatGPT.
The previous wave was in 2016 when Google DeepMind launched AlphaGo, a powerful artificial intelligence system for the game of Go. AlphaGo was capable of making decisions and devising strategies. This relied on reinforcement learning, a technique that rewards artificial intelligence algorithms for making desirable behaviors.
Oriol Vinyals, Vice President of Research at Google DeepMind, stated: "But these agents do not perform other tasks."
They were created for very specific tasks, such as AlphaGo only playing Go. The new generation of AI based on foundational models makes agents more versatile because they can learn from the world of human interaction.Viniyars said: "You would feel that this model is interacting with the world, and then giving you better answers or better assistance, and so on."
What are the limitations?
There are still many unresolved questions that need to be answered. Kanjun Qiu, CEO and founder of the artificial intelligence startup Imbue, is committed to developing intelligent agents that can reason and program. She compares the current state of intelligent agents to self-driving cars more than a decade ago.
They can do some things, but they are not reliable enough and still do not have true autonomy.
Qiu said, for example, a programming intelligent agent can generate code, but sometimes it makes mistakes, and it does not know how to test the code it is creating.Therefore, human beings still need to actively participate in this process. Artificial intelligence systems are still unable to fully achieve reasoning, which is a key step in operating in the complex and ambiguous human world.
Fan said: "We are still far from having an agent that can automate all these household chores for us." This indicates that current systems "will have hallucinations, and they are not always strictly following instructions."
Another limitation is that after a while, the intelligent body will "forget" the content of the work it has done. Artificial intelligence systems are limited by their context window, which means the amount of data they can "think" about is limited.
"ChatGPT can write code, but it cannot handle particularly long content well. But for human developers, what we need to refer to is the entire GitHub codebase, which has tens of thousands of lines of code, and humans can read it completely." said Fan.
To solve this problem, Google has improved its model's ability to process data, which allows users to interact with them for a longer time, thus better remembering past interactions.The company stated that it is working to make its contextual window infinitely large in the future.
For embodied intelligent agents such as robots, there are even more limitations. We do not have enough training data to train them, and researchers are just beginning to harness the power of robot foundational models.
Therefore, amidst all the hype and excitement, what we must remember is that the study of agents is still in its early stages, and it may take us several years to fully experience their potential.
Can it be experienced now?To a certain extent, that is true. You may have already tried their early prototypes, such as OpenAI's ChatGPT and GPT-4. Qiu said, "If you are interacting with software that feels intelligent, then it is an agent."
She said that the best agents we currently have are systems with specific use cases, such as programming assistants, customer service robots, or workflow automation software like Zapier. But these are far from general agents capable of performing complex tasks.
Qiu said, "Today we have these computers, and they are really powerful, but we have to micromanage them."
Qiu said that OpenAI's ChatGPT plugin allows people to create artificial intelligence assistants for web browsers, which is an attempt at agents. But she said that these systems are still clumsy, unreliable, and unable to reason.
Nevertheless, Qiu believes that these systems will one day change the way we interact with technology, which is a trend that people need to pay attention to.She said: "This is not to say that suddenly we have general artificial intelligence. Rather, it means that my computer can do more things than it could five years ago."