# Moving Beyond Naive Chatbots 

**Published by:** [Agent Protocol + Chainagent](https://paragraph.com/@agentprotocol/)
**Published on:** 2023-05-03
**URL:** https://paragraph.com/@agentprotocol/moving-beyond-naive-chatbots

## Content

LLMs and Wallet Messaging The world already seems changed from November 30th’s ChatGPT launch, with AI becoming a powerful force and policy-making topic, as it is right to be. Relay was founded on the trend-line of increasing benefits to communicating as your web3 identity. With blockchain-aware LLM agents, we can turn complex DeFi flows into clear conversations, turn confused devs flooding Discord channels into transparent DevRel learning session, and turn helpful onboarding chatbots into user-wallet-knowing upselling agents of commerce. Looking relatively far in the future (for this nascent space) to ~Q1 2024 — We’re developing LLM agents able to self-custody their crypto safely, communicate with other bots, and transact with them. The Relay interfaces and communication rails through ENS and wallet chat can become the de facto standard as the easiest method for bots to 1) control their crypto 2) communicate 3) transact. The Relay Robot The Relay Robot is a system for foundational models that significantly improves on baseline LLM solutions by implementing what we call an LLM trajectory framework, an idea Relay adapted from the famous ReAct whitepaper. It’s useful to think of a “trajectory” as an LLM’s train of thought or its “trajectory through reasoning-space”. We instrument an LLM with external tools, show the LLM how to use the tools, and then observe the LLM’s trajectories as it responds to both a training environment and real user queries. By tracing, modeling, and analyzing these trajectories (sometimes in real-time) we can massively improve upon baseline LLM behavior and deliver production-ready solutions. In this post we go through the problems with the product made by following the first (and great!) OpenAI data retrieval tutorials, and then we go over our solutions, both in-practice and theoretical, that improve a web3-enabled chatbot used for education and transaction assistance. Lastly we discuss Relay’s vision for how the AI agents that we and others build will use our LLM framework for oversight and safety, plus Relay’s wallet-messaging rails, to:self-custody and have plausible independent ownership of their cryptocommunicate with other user-wallets (both human and AI)intelligently transact crypto and NFTs with humans and other AIs. Trading, buying digital goods, buying services, it’s all possible.The AI x Blockchain industry is growing fast, we also just published our landscape overview post here.The Baseline Solution Advances in language model technology have made it straightforward to build a proof-of-concept chatbot with access to an external knowledge base. These chatbots all follow roughly the same core architecture: The default early solutions all follow roughly the same pattern:Download a bunch of data from the internetChunk up the data into LLM-manageable piecesConvert the text data into embeddings using an embeddings model (an embedding is a quantitative representation of the semantics of a piece of text)Upload the embeddings into a databaseWhen a user asks a question, convert the question into an embedding and use it to search the embeddings database for related documentsPass the user’s question and results of the search to a language modelReturn the language model’s answer to the userBuilding a chatbot in this way feels somewhat magical, it’s pretty incredible the extent to which they “just work” without running headfirst into any additional complexities. That said, most of these bots aren’t suitable for production use-cases and have some major challenges — especially at scale. Problems with the Baseline 💰 Expensive The state-of-the-art and publicly-available-for-production LLM (GPT-4 from OpenAI) is $0.06 per 1,000 tokens (1-2 pages of English text) generated by the LLM. A single request to GPT-4 can cost up to $0.48. For language models to behave in a way that feels natural to users, they require entire conversation histories to be passed in every request, so it is quite common to run into this upper limit. Even at very model scale this is quite expensive (10,000 requests at 2,000 tokens per request would cost ~$1000). 🦥 Slow A single request to GPT-4 can take anywhere between 3 and 30 seconds depending on server load. In a conversational setting, this amount of latency is unacceptable to users and will lead to “conversion abandonment”. This latency is decreasing as OpenAI scales their own servers, but it is a fact that caching and UI/UX patterns have to account for the greatly atypical response time. 🔐 Vendor Lock-In Every language model (even different “flavors” of the same underlying models) display unique behaviors and react differently to identical inputs. Migrating a system from one model to another is more of an art than a science. 💥 Unreliable Commercialized language models are both extremely novel, extremely resource-intensive, and in extremely high-demand. These factors produce a level of unreliability that is mostly unheard of with modern web applications, and so potential API downtime must be designed around. ☢️ Unstable Language models are somewhat unpredictable. They actually aren’t even deterministic: if you pass a language model the same input 100 times, you’ll end up with 3-5 seemingly random answers. This can lead to confusing and heterogenous user experiences and will undermine users’ trust in the platform. 💔 Brittle Language models feel extremely powerful, and they are, but their limits become obvious with non-trivial use. If you want a language model to do anything other than literally answer a question, with no constraints on the kinds of answers you desire, they work well, but that’s about it. The default LLM experience will become stale and users bored and unenthusiastic about your brand. ⬜️ Generic Language models don’t adhere to any particular brand. If you go to example.com and talk to their GPT-4-powered chatbot and then go uniswap.org and talk to its chatbot, they will be indistinguishable from each other. 🤦‍♂️ Impersonal Language models are not only generic, but they’re also completely impersonal. They don’t know “who” they’re talking to or “why” they’re talking to them or what the user wants. Talking to a language model can become frustrating for users, especially when the conversations are within a specific and complicated context. 🕛 Static Language models don’t learn from their conversations and they don’t improve over time. They don’t acquire new “experience”. Once deployed, a baseline language model solution will never improve. 🐪 Non-Autonomous Language models don’t have the capacity to interface with anything other than their prompts and the data they’ve been trained on. Users expect new features, they expect applications to grow with them. A baseline language model cannot support users’ expectations.The Advanced Relay Robot Each of the following sections describe a category of solutions that improve on the baseline. Each section will describe the category in general, provide a specific example of what the very first version could look like, and go on to describe a more long-term vision. One thing to keep in mind is that the space of possible features is quite large and the field quite young, so the following sections are not meant to be prescriptive nor exhaustive. The goal is to provide a primer or a map of what’s possible. Agent Tooling Augment an LLM by providing a natural-language interface for calling external tools. A first version of agent tooling could includea way for the LLM to search white-listed, well-known web sites for infoaccess to a pre-populated, indexed knowledge baseaccess to on-chain data such as token prices and transaction history for the current userAgent tooling can be used to improve reliability, stability, and dynamism (decrease brittleness). Semantic Tracing Every LLM trajectory leaves a “trace” in “semantic space”: each step in the trajectory can be defined by a question in natural-language that the LLM is trying to answer. By embedding and indexing these questions, we generate a dataset that can be used as an input into many of the other solutions we develop. A first version of tracing could include:trace the LLM inputtrace the LLM outputtrace each call to agent toolingtrace each “cache” miss (failed calls to agent tooling)Semantic tracing can used to improve costs, latency, reliability, observability, analytics, and more. Real-time Human in the Loop Intercept, pause, cancel, and divert trajectories. The behavior of when to intercept, when to cancel, when to loop in a human can be pre-determined or based on dynamic data (such as data generated from semantic tracing). A first version of real-time human in the loop could include:basic observability (pipe all trajectories through Discord)escalation (react to a trajectory to jump into the conversation in lieu of the robot)trajectory labeling (similar to, and the beginnings of, a reinforcement learning framework)Real-time human in the loop features can be used to improve stability, observability, and branding (reduce genericness). Real-time Analytics Provide dashboards and alerting based on predefined criteria and semantic tracing data. A first version of this could include:topic-based alerting: trigger alerts or notifications when users ask about specific topicssentiment-based alerting: trigger notifications for conversations based on “how the conversation is going”a semantic heatmap with clustering: show clusters of user requests and responsesReal-time analytics can be used to improve reliability, stability, and useful insights into the product. Semantic Caching Very similar to semantic tracing, we can also introduce semantic caching. When a trajectory is very close to another in semantic space we can choose to halt the trajectory and return the result from the previously-executed trajectory. One way to think of this is that we allow the LLM to learn. A first version of this solution could include:pre-trajectory short-circuiting (if the user request is a match, don’t even call the LLM)mid-trajectory short-circuiting (if the LLM-generated question for the trajectory step is a match, fetch the previous answer and return immediately)Semantic caching can be used to improve cost, latency, and reliability. Replay Because we’re tracing all of the important details about each trajectory, it should be trivial to “replay” the trajectory. A first version of this could include:replay a trajectoryedit a trajectory and replay the edited versionReplay can be used to improve observability and to reduce vendor lock-in. Guardrails Every LLM output will pass through a “guardrails” evaluation framework. The framework will help catch obvious errors but can also evaluate more subtle details like whether the robot’s output stayed “on brand”. A first version of this solution could include:factual error detection (potentially using a “side channel” LLM)transaction assurance: For any onchain action recommended (or generated) by the chatbot, we use SOTA tools to increase the user’s understanding of the action and confidence that the action does what they expect. For an example of what’s possible, see Stelo Labs tx simulation.Guardrails can be used to improve stability, reliability, and personalization (reduce genericness). Learning and Recency Bias The best way to think about the knowledge base is a pre-computed cache of LLM trajectories. The knowledge base is “what the robot knows”. Recent data can be attached a more significant weight in the vectorbase. For any dataset that we want the chatbot to understand, we trigger a process that automates the generation of a large number of trajectories over that knowledge base. A first version of ingestion could include:scraping a public or private website, generating trajectories over the data, and saving the resultsproviding a patching API that allows theLearning can be used to improve cost, latency, reliability, observability, and many other aspects of the platform. Invalidation Knowledge invalidation is a semantic search + delete over the robot’s knowledge base. A first version could include:semantic search + delete conditioned on some external criteria (like data source)Invalidation is a useful tool to improve the flexibility of the robot, allowing it to evolve unencumbered by stale understanding. Curation and Analytics The robot’s knowledge base, and the tracing data around it, can be used to curate the platform’s user-facing content. A first version could include:A report that shows which subjects are most likely to be requested by usersA report that shows which subjects that the robot doesn’t know about are most likely to be requestedA report that shows which subjects require the longest/shortest trajectoriesCuration and analytics are useful for improving essentially all pieces of the platform. In particular, insights from the chatbot can lead to improvements in user-facing documentation and materials. Further R&D in the Industry Query Routing Some questions are answered by the docs index, some are answered by the list index, etc. https://gpt-index.readthedocs.io/en/latest/guides/tutorials/graph.html https://twitter.com/jerryjliu0/status/1653789212620230658 Caching Similarity caching so that frequent questions skip a trip to OpenAI https://python.langchain.com/en/latest/modules/models/llms/examples/llm_caching.html#gptcache Hallucination rate Compare query/response/source to see if they all match. Leads to near-0 hallucinations. (Currently implementing). https://twitter.com/jerryjliu0/status/1645451894637367298 LLM Evaluation On Retrieval, how do certain variables affect the result? split_method chunk_chars overlap embeddings retriever_type num_neighbors https://github.com/PineappleExpress808/auto-evaluator Guard Rails These strictly check the output of the LLM, and ensure it conforms to certain patterns or text. Nvidia has an option shown here, and there is a separate project (that came out first at: getguardrails.ai https://twitter.com/NVIDIAAIDev/status/1650887287494901763 Back to the lab! (Appendix links below)Research and Theoryreact https://arxiv.org/abs/2210.03629reflexion https://arxiv.org/abs/2303.11366self-refine https://arxiv.org/abs/2303.17651Self-Ask https://ofir.io/self-ask.pdfself-consistency https://arxiv.org/pdf/2203.11171.pdfauto fine-tuning https://arxiv.org/abs/2205.00445mrkl https://arxiv.org/abs/2205.00445generative agents https://arxiv.org/abs/2304.03442camel https://github.com/lightaime/camellanguage model cascades: https://arxiv.org/pdf/2207.10342.pdffactored cognition: https://primer.ought.orgprompt chaining https://arxiv.org/pdf/2203.06566.pdfchain-of-thought prompting https://arxiv.org/pdf/2201.11903.pdfzero-shot chain-of-thought https://arxiv.org/abs/2205.11916algorithmic prompting https://arxiv.org/pdf/2211.09066.pdfHuggingGPT https://arxiv.org/pdf/2303.17580.pdfOpenAssistant https://drive.google.com/file/d/10iR5hKwFqAKhL3umx8muOWSRm7hs5FqX/view?pli=1InstructGPT https://arxiv.org/abs/2203.02155HyDE https://arxiv.org/pdf/2212.10496.pdfSelf-supervised learning guide https://arxiv.org/pdf/2304.12210.pdfProjects and Toolshttps://github.com/hwchase17/langchainhttps://twitter.com/gpt_indexhttps://shreyar.github.io/guardrails/https://github.com/transmissions11/fluxhttps://github.com/jbrukh/gpt-jargonhttps://github.com/microsoft/JARVIShttps://lmql.ai/https://github.com/microsoft/semantic-kernelhttps://jamesturk.github.io/scrapeghost/https://haystack.deepset.ai/https://github.com/LAION-AI/Open-Assistant

## Publication Information

- [Agent Protocol + Chainagent](https://paragraph.com/@agentprotocol/): Publication homepage
- [All Posts](https://paragraph.com/@agentprotocol/): More posts from this publication
- [RSS Feed](https://api.paragraph.com/blogs/rss/@agentprotocol): Subscribe to updates
- [Twitter](https://twitter.com/agentprotocol): Follow on Twitter