Anthropic's Opinionated Memory Bet

In a previous post, I covered how the Claude app uses tool calls to retrieve memories, and how this differs from ChatGPT's approach. A few days ago, Anthropic brought this same architecture to their developer platform with the release of a memory tool. While the AI world was distracted by flashier launches, this one flew under the radar. But I think it deserves attention for two reasons.

First, this is the first time a model provider has taken an opinionated stance on how memory should work. Until now, teams have been left to figure it out themselves—building with RAG, summarization, knowledge graphs, whatever made sense for their use case. With this release, Anthropic is saying: if you're building with Claude, this is how you should handle memory.

Second, it's fast, flexible, powerful, and genuinely good.

In this post, I'll walk through how the memory tool works, show you different ways to use it, discuss the bets Anthropic is making here, and help you figure out whether its a good fit for what you're building.

You can view the accompanying code, including Python CLI and Next.js implementations, here:

shlokkhemani/claude-memory-tools

claude-memory-tools

View on GitHub

What is the Memory Tool

The memory tool is a specialized tool that ships with Anthropic's SDK, alongside their file search and web search tools. Claude models are specifically trained on these tools—the model knows when to check memory, what to look for, and how to update it. Unlike user-defined tools that Claude encounters for the first time at runtime, this is native behavior built into the model.

The system itself is intentionally straightforward: Claude stores memories as files in a directory on your filesystem (typically /memories). When Claude needs to remember something, it creates or updates these files. When it needs to recall something, it reads them. It doesn't use embeddings, vector databases, or knowledge graphs—just files.

Here's what happens during a typical interaction:

You ask Claude to help with a task
Claude automatically checks the memory directory
It finds and reads relevant files
It uses that context to inform its response
If it learns something new worth remembering, it updates or creates files

How it Works

Before diving into the technical details, the best way to gain an intuition for the memory tool is to see it in action. Get yourself an Anthropic API key and try this fitness tracker implementation below. You can also run it locally with the Python CLI or Next.js implementations.

I earlier called this system powerful because of its separation of concerns. Anthropic has defined six fundamental operations—view, create, str_replace, insert, delete, and rename—that form the interface between Claude and your storage system. While Anthropic defines what these tools should do conceptually, you implement them to create the actual system:

Storage: You handle everything about how memories are physically stored: the file format (JSON, XML, plain text), where data lives (local disk, S3, encrypted volumes), and how it's secured (encryption, access controls, audit logs). Claude doesn't care about these implementation details—it just needs those six operations to work.
Strategy: Through the system prompt, you guide Claude on how to use these tools effectively. Without instructions, Claude might dump all memories into a single file or invent its own schemas. But with proper guidance, you can create sophisticated memory structures.

Consider the fitness tracking app above. Our system prompt instructions include:

Maintain six distinct files: profile, diet, workouts, progress, goals, and notes
Always check memories first before responding to understand the user's current state
Create timestamped entries for every diet and workout log
Keep 30 days of diet history and 60 days of workout history
Track not just exercises but also energy levels and how the user felt
Update progress measurements while preserving historical trends
Set expiration dates on goals to track achievement

Now, when a new user starts chatting, Claude uses view to check if memories exist. Finding none, it uses create to set up the six-file structure with placeholders for required information. When the user mentions "I'm 80kg and 183cm," Claude uses str_replace on /memories/profile.xml to replace the weight and height placeholders with actual values. A casual "I snacked on a large banana" triggers insert to add a timestamped entry in /memories/diet.xml. "Did legs today, squats felt heavy" becomes a structured workout log with exercises, weights, and recovery notes.

You don't need to create parsers or complex logic to extract and structure this information, nor do you need to give explicit instructions on how to use each of the six tools. The model's reasoning capabilities, world knowledge, and tool-use training handle the heavy lifting. It knows that "squats felt heavy" is worth recording as a fatigue indicator, that diet entries need timestamps, and that measurements should preserve historical trends. You also don't need complex retention logic—tell Claude to keep 30 days of diet history, and on day 31, it automatically archives the oldest entry.

This combination of natural language instructions and flexible, general-purpose tools gives you extraordinary range. You can build anything from a simple key-value store to a sophisticated multi-agent system with different retention policies, access patterns, and data relationships. Want memories that expire after certain events? Memories that track confidence levels? Memories organized by project, relationship, or time? The same six operations can support them all. And you can define all of this the behavior simply through natural language rather than code!

Anthropic's Bets

Anthropic is making a few big bets with this architecture that shape how developers should think about memory.

Memory Unification

Most memory solutions today split into three separate systems: retrieval (vector search or keyword matching decides what's relevant), storage (an async pipeline extracts and saves memories in the background), and the main conversation thread (the LLM responds to the user). Each operates independently with its own logic—the retriever guesses what might be relevant, the extractor decides what's worth remembering, and the LLM works with whatever context it gets.

Anthropic is betting on unification. With tool calls, Claude handles all three in a single conversation flow—deciding what to retrieve, when to retrieve it, what to store, and how to respond. Memory becomes a core part of Claude's reasoning process, not a separate system.

Of course, unification has trade-offs. Latency increases with each tool call, though this will improve—every major lab is racing to make tool use faster (the new Sonnet 4.5 is already rapid). Cost is the bigger challenge. A single interaction that reads, processes, and writes memory requires three LLM calls versus one in traditional systems. If your application is cost-sensitive, this might be a dealbreaker.

Files over databases

Anthropic chose text files over databases or knowledge graphs. This is a bet on flexibility. Structured systems force you to define schemas upfront: what fields matter, what relationships exist, what queries you'll need. By using unstructured files, Anthropic gives Claude room to evolve its memory organization based on what it learns, not what you predicted it would need to remember.

For example, in a fitness app, Claude might start with six files but naturally split "workouts" into "strength_training" and "cardio" when it notices distinct patterns, or create an "injuries" file when the user mentions knee pain. With a database, you'd need migrations and new fields. With files, Claude just creates what makes sense.

When you do need structure, you have options: trust Claude to follow instructions in the system prompt (ISO dates, specific schemas, consistent field names), or implement validation that throws errors and forces a retry when Claude breaks the rules.

No Search

Anthropic's third bet is counterintuitive: there's no search function for memories. The view tool can explore directories and list available files. And when Claude decides to read a file, it reads all of it.

This follows a broader pattern emerging across LLM apps: there's no such thing as too much context anymore. A year ago, we worried that irrelevant information would confuse models or dilute their focus. We built elaborate retrieval systems to serve only the most relevant snippets. But as models have gotten smarter, cheaper, and gained larger context windows, this calculus has flipped. More context means better results.

When Claude reads an entire file instead of search results, it might spot patterns you wouldn't think to query for. In a fitness app, Claude might notice that workout performance drops three days after poor sleep, or that certain foods correlate with energy levels. These connections would never surface without complete context from multiple files.

The trade-off, once again, is cost—reading entire files means processing more tokens. Developers need to decide whether the insight gains justify the expense.

Beyond Memory

Most people think of memory as a way to preserve user context over time—solving the stateless nature of LLMs. But Anthropic is positioning these file operations as a general-purpose tool for context management.

Some other use cases include agent communication, where one Claude instance writes findings to /analysis/market_research.md and another reads and builds on them. Or complex task management, where Claude uses files as a workspace, storing intermediate results step by step rather than holding everything in context. Most intriguingly, self-improvement—Claude can document what it learns from each interaction, building its own knowledge base over time.

If you've used Sonnet 4.5 recently, you've probably noticed its (somewhat annoying) tendency to create markdown files unprompted—saving its analysis, documenting its reasoning, preserving work for later. While it's still early days, Anthropic is clearly betting on files as extensions of working memory for agents tackling increasingly complex, long-running tasks.

Teams like Cognition (who built sophisticated memory architectures for Devin) are watching these tools closely—why maintain custom memory infrastructure when the model provider handles it natively?

Who Should Use This

Let's start with the limitations. This isn't for cost-sensitive applications—multiple LLM calls for each interaction add up quickly. It's still early, so teams with battle-tested memory infrastructure shouldn't rip it out for something less proven. You'll be locked into Claude since these tools are exclusive to Anthropic's SDK. And if your use case requires predictable, structured memory with well-defined schemas, you're probably better off with traditional approaches.

But every team building memory systems should still experiment with this tool. It offers a glimpse into how model providers are thinking about memory's future—where the model itself decides what's worth remembering and how to organize it. If you're building applications with unpredictable memory needs—coding agents, general assistants, or any system where you can't predict what users will want to remember—this flexibility is essential.

Of course, you can use this to complement your existing memory architecture—there's no requirement to go all-in.

shlokkhemani/claude-memory-tools

claude-memory-tools

View on GitHub

Regardless of how developers adopt it, I think it's fascinating that Anthropic launched this tool and planted the seed of a powerful, opinionated memory stack. I'll be tracking how it evolves, along with other developments in the memory space. Follow me on X and subscribe to my newsletter below to follow along!

Writing