Emerging Architectures for LLM Applications

emerging-llm-app-stack

Data Preprocessing / Embedding

llm-app-stack-embedding

Store private data to be retrieved later.
Data are broken into chunks, passed through an embedding model, then stored in a vector DB.

Data Pipeline
- Traditional ETL tools (e.g., Databricks, Airflow)
- Document loaders built into orchestration frameworks (e.g., LangChain, LlamaIndex)
Embeddings
- OpenAI API (text-embedding-ada-002)
  - Easy to use, gives reasonably good results, becoming increasingly cheap
- Cohere
  - Focused more on embeddings, better performance in certain scenarios
- HuggingFace Sentence Transformers (open source)
Vector Database
- Cloud-hosted (e.g., Pinecone)
  - Easy to get started with, good performance at scale, SSO, uptime SLA
- Open source systems (e.g., Weaviate, Vespa, Qdrant)
  - Excellent single-node performance, can be tailored for specific applications
- Local vector management libraries (e.g., Chroma, FAISS)
  - Great developer experience, easy to spin up for small apps and experiments
  - Don’t necessarily substitute for a full database at scale
- OLTP extensions (e.g., pgvector)
  - Good solution for enterprises who buy most of their data infrastructure from a single cloud provider

llm-app-stack-retrieval

Given a query from a user, the application constructs a series of prompts to submit to the LM.
Prompt typically combines a prompt template hard-coded by the developer.

Orchestration: Abstract away many of the details of prompt chaining (e.g., interacting with external APIs, retrieving contextual data from vector DBs, maintaining memory across multiple LLM calls)
- LangChain
- LlamaIndex
- ChatGPT
  - Can be considered a sustitute solution

llm-app-stack-inference

LLM
- OpenAI API
  - gpt-4 / gpt-4-32k: Best-case scenario for app performance, requires no fine-tuning or self-hosting
  - gpt-3.5-turbo: ~50x cheaper and significantly faster than GPT-4.
- Other proprietary vendors
  - Anthropic’s Claude: Fast inference, GPT-3.5-level accuracy, more customisation options, up to 100k context window (accuracy degrades with the input length)
- Open source models
  - Effective in high-volume B2C use cases (e.g., search / chat) where there’s wide variance in query complexity and a need to serve free users cheaply
  - Makes the most sense in conjunction with fine-tuning base models

Bornstein, M., Radovanovic, R., Bornstein, M., & Radovanovic, R. (2023, September 15). Emerging architectures for LLM applications. Andreessen Horowitz. https://a16z.com/emerging-architectures-for-llm-applications/