Tesla had their 2nd AI day yesterday (Sep 30) following their 1st ever AI day about a year ago [1]. The amount of progress and details were super impressive to say the least. In order to solve the full vision-based full-self driving (FSD) challenge that they envisioned years ago, they’ve embarked on this multi-year pursuit of developing the arguably the largest realworld AI application so far.
They started their journey with then state-of-the-art vision neural network models on Nvidia GPU clusters (they still are — currently using 14 thousand of Nvidia GPU’s). Now they are reinventing everthing — hardware and software. They are building their ASIC silicon for neural compute, tile, rack, all the way to ExaFlop-scale super-computing cluster — Tesla Dojo [2]. When they get there, it will likely land in the top spot in the global supercomputer list. And Dojo’s scale is limited only by speed of light, and maybe heat dissipation :-) ?
On the software-side, their AI stack has evolved significantly — even from just a year ago. A year ago, they were still analyzing 2D images with object detections, lane line markings in their production deployment — not materially different from hobbyists’ ad hoc models, just much bigger and with much, much greater data. Today, they moved to reconstructing the 3D world around each Tesla vehicle using the surrounding cameras [3]. They would then trying to make sense of the 3D world around the vehicle and plan out a safe and comfortable execution path.
Here is a snapshot from their AI Day 2 presentation [1]. This is a schematic diagram of their vision-based AI stack starting with 8 surrounding cameras’ input around the vehicle and go through layers of neural computations to build the 3D vector field representation of the world, then doing object detections (kind, position, velocity, acceleration, intent, etc.), execution planning, etc.
This is very impressive in itself. What got me to think further was the future of realworld applied AI — it is all about designing & building the custom-AI stack with various neural compute modules to transform the input into intermediate (internal) representations, which can be viewed as abstractions of the world, and eventually to the output needed by the application for downstream execution.
From a pure architectural point of view, this is not too different from traditional simulation-based computational stack where you start from the input, go through layers of computation till you can get to the output. The key difference is that in these AI stacks mostly rely on neural network modules to transform and conceptualize input data, rather than, say, solving partial differential equations down to 16 significant-digits precision. Because AI-based applications need to mimic human sensory processing, which is great at pattern recognition and filling in the blanks (interpolation & extrapolation), but not so good at precision calculation (at least not efficiently). You can’t write a differential equation to detect a bird from vision input. That’s the biggest differentiating factor from traditional computational/processing stack.
Secondly, having said the above, it is not sufficient that our custom-AI stack only does sensory perceptions through neural network modules. We need conceptual representation of these objects, their attributes, and their relationship. In Tesla’s FSD stack, they need to abstract the surrounding world into objects (other vehicles, bicycles, pedestrians, pets, road signs, lanes, road surfaces…), attributes (position, velocity, acceleration, drivable space, non-drivable space, surface irregularities…), and their relationship → these can be further computed, either through rule-based, optimization-based algorithms or yet more neural networks to plan for responses — execution planning. The custom-AI stack is not just neural networks — it needs to be a hybrid architecture (maybe even include doing direct math and solving some differential equations).
Thirdly, although they haven’t spoken any in yesterday’s AI day presentation or maybe they haven’t gotten to that yet — that is memory. Human is very good at remembering and learning from past experiences. You hit a pot hole in a rough section of a road, you figure out a way to navigate into narrow garage with 3-point turn, or maybe you just want the vehicle to learn your particular driving style (given it is safe). That memory needs to be put somewhere as a background context — or a knowledge-base.
A framework for mimicking human intelligence?
You see where I am going with this. In the end, many realworld AI applications are not that different from Tesla’s full-self driving, including the Tesla bot. The input data format maybe different, but it comes down to visual, auditory, and language input, and in the future with humanoid robots — maybe touch, olfactory, and possibly taste… For the common visual, auditory, and language data, ever-powerful models are emerging on a weekly basis [4].
A realworld AI application stack would likely start with a combination with these perception neural network modules, conceptualize into the involved agents/objects, associated attributes, and their relationship — basically forming a knowledge representaiton of the world. Then we need to add further abstractions: need to understand numbers, able to do math, understand physical laws, accounting rules…… Finally, solve for most optimal actions/suggestions based on objectives— driving safely, optimally performing factory labor, or automating finance back-office routines. Throughout this process, you would interact with the Memory layer to retrieve previously learned knowledge, context, and writing new memories.
All paths lead to AGI
What is AGI, and what is human intelligence anyway? Unfortunately, there is no consensus on this. People would literally fight each other over these definitions.
Some say humans have emotions — okay, that may be true, but let’s not go there yet.
Some say human intelligence is all about making predictions — but the most primitive algeas can also do that — they can actively seek for sunlight and predict where nutrients can be found.
My humble opinion is that at least one defining characteristics is human’s ability to do long-term planning for long-term gain. Once you realize this, you see — language, social interactions, will-power, maybe even emotions or what have you — are all abstrations (human inventions) to allow this long-term planning to happen. The longer term it is, the more sophiscated abstractions there need to be. In fact, people would use “he/she always thinks three steps ahead of the competition” to indicate superior intelligence.
Going back to the realworld AI application stack, it is in this sense, I believe, all paths are eventually converging towards human-like intelligence or AGI.
Timeline for crossing over average human intelligence? Nobody knows precisely, but the median expectation is around 2050 based on surveys of domain experts [5]. Given what I saw from Tesla’s AI day presentation and knowing their execution power, I tend to think this timeline could be a lot closer.
References:
- https://www.youtube.com/watch?v=ODSJsviD_SU, Tesla AI Day 2, Sep 30, 2022
- https://www.servethehome.com/tesla-dojo-ai-system-microarchitecture/, https://www.servethehome.com/tesla-dojo-custom-ai-supercomputer-at-hc34/, Tesla Dojo Microarchitecture and Cluster.
- https://www.youtube.com/watch?v=hx7BXih7zx8&t=0s, Andrej Karpathy — AI for Full-Self Driving at Tesla, presentation at ScaledML 2020.
- https://paperswithcode.com/sota, State-of-the-Art ML models benchmarks in various categories.
- https://research.aimultiple.com/artificial-general-intelligence-singularity-timing/ AGI singularity timing.