Running #LLMs requires significant computational power, which scales with model size and context length.
Ye (Charlotte) Qi from #Meta shares strategies to fit models across hardware types, plus techniques to optimize inference latency & throughput.
Watch the #InfoQ video: https://bit.ly/3FCugyK
Full #transcript included