Doug -so it seems we have 2 curves - exploding demand for inference meetinf and these disaggregation architectures allowing GPUs to be much more efficiently used.
Is this a net negative for NVDA as it no longer means their revenue hockey stick ticks up one for one with demand? The demand driver of more agentic/persistent context workloads translates to more of this new networked memory and slower growth in GPUs installed?
Hi Doug, that was a fantastic post. Thanks for sharing!
I'm trying to pinpoint the inflection point for disaggregated inference. As you mentioned, its deployment at scale seems to be taking off right now although we had some interesting papers about this 12-18m ago.
In your view, is this shift primarily a 'demand-pull' driven by the changing nature of AI workloads (prefill-heavy, long-context 'agentic' models)?
Or there is more of a 'supply-push' from the technology/hardware finally maturing and enabling this solution to be deployed at scale?
Thank you very much for sharing this interview with us Doug! I found it extremely helpful to shape my thinking.
Thoroughly enjoyed this pod. Appreciate it!
Doug -so it seems we have 2 curves - exploding demand for inference meetinf and these disaggregation architectures allowing GPUs to be much more efficiently used.
Is this a net negative for NVDA as it no longer means their revenue hockey stick ticks up one for one with demand? The demand driver of more agentic/persistent context workloads translates to more of this new networked memory and slower growth in GPUs installed?
Hi Doug, that was a fantastic post. Thanks for sharing!
I'm trying to pinpoint the inflection point for disaggregated inference. As you mentioned, its deployment at scale seems to be taking off right now although we had some interesting papers about this 12-18m ago.
In your view, is this shift primarily a 'demand-pull' driven by the changing nature of AI workloads (prefill-heavy, long-context 'agentic' models)?
Or there is more of a 'supply-push' from the technology/hardware finally maturing and enabling this solution to be deployed at scale?
A bit of both, think it becomes the standard via VLLM but the decode scale out for agents output tokens helps too