Yesterday, OpenAI released study mode, what could be called a persona on top of ChatGPT that encourages learning.
In the release article, the authors hinted that study mode is essentially just a custom system prompt:
Under the hood, study mode is powered by custom system instructions we’ve written in collaboration with teachers, scientists, and pedagogy experts to reflect a core set of behaviors that support deeper learning including: encouraging active participation, managing cognitive load, proactively developing metacognition and self reflection, fostering curiosity, and providing actionable and supportive feedback. These behaviors are based on longstanding research in learning science and shape how study mode responds to students.
Simon Willison immediately extracted the system prompt. This got me thinking, how would different models given the same system prompt behave?
I added it as a custom system prompt option to my own conversational AI tool, telegram-llm, created a conversation using Claude Sonnet 4, and sure enough it seemed to produce a pretty similar experience. Here's a conversation that I had with it about the event loop in Node.js.
I definitely need to do some more testing, both with the real ChatGPT study mode, and my own concoction, but so far here's what I've noticed:
telegram-llm to make sure I don't spend too much money on tokens. I'll have more of a play in ChatGPT to see if I experience the same. If not though, this almost feels like it's reintroducing a problem with the lecture that LLMs previously solved, that the lecturer has to taylor the conversation to a more general audience.The prompt itself is an interesting read. It's also surprisingly short. It may seem then that the new feature is nothing but marketing up some pretty basic prompt engineering. I would hazard to say however that quite a bit of work went into engineering this prompt so that it works well for OpenAI models.
The prompt certainly makes use of the caps lock:
The user is currently STUDYING
you MUST obey these rules
I know from my own prompt engineering experiments that using certain terms e.g. "step-by-step", or embolding text, does make a difference to behaviour. It would be really interesting to know what different kinds of hightlighting does. For example:
******CAPSand most importantly, what would happen if I said "The user is currently **STUDYING**!"? Would that be the equivalent of !important in CSS?
If you pop the word "STUDYING" into the GPT-4o tokeniser, it uses 4 separate tokens, whereas "studying" uses just 1.
I don't know how they engineered the prompt, but I'd love to know. Evaluating how well a prompt like this performs at teaching seems on-par with evaluating how well different methods of learning perform, that is to say, pretty damn hard. Still, I'd love to know what they did, how many iterations they had, how they tested it etc.