Transcript: ‘He Built An AI Audience Simulator. It’s The Future of Customer Research.’

‘AI & I’ with prompt engineer Michael Taylor

14

The transcript of AI & I with Michael Taylor is below. Watch on X or YouTube, or listen on Spotify.

Timestamps

  1. Introduction: 00:01:32
  2. AI can simulate human personalities with remarkable precision: 00:04:30
  3. How Michael simulated a Hacker News audience: 00:08:15
  4. Push AI to be a good judge of your work: 00:15:04
  5. Best practices to run evals: 00:19:00
  6. How AI compresses years of learning into shorter feedback loops: 00:23:01
  7. Why prompt engineering is becoming increasingly important: 00:27:01
  8. Adopting a new technology is about risk appetite: 00:44:59
  9. Michael demos Rally, his market research tool: 00:47:20
  10. The AI tools Michael uses to ship new features: 00:55:03

Transcript

Dan Shipper (00:01:30)

Michael, welcome to the show.

Michael Taylor (00:01:31)

Yeah, good to be here, Dan.

Dan Shipper (00:01:33)

Good to have you. So, for people who don't know, you describe yourself as a recovering agency owner. So you had a marketing agency that grew to 50 people and then you sold it in 2020. And then you just dove into AI stuff. You wrote the prompt engineering book for O'Reilly. And you are also a regular columnist on Every. You have an amazing column called “Also True for Humans.” I'm psyched to have you.

Michael Taylor (00:02:02)

Yeah, it's good to be here. And I've been watching a bunch of these episodes and just kind of thinking about how other people use AI. So it'd be interesting to see how I differ.

Dan Shipper (00:02:10)

Amazing. So, I'm excited to have you. I think one of the reasons I really love your work is you have a tinkerer's mindset and you're just always playing with new things and getting psyched about them and exploring the limits of what the current technology can do. I think that's why you're a really good prompt engineer and why you think about it so much. I think you're really good at building little workflows for yourself to automate things. And I'm always into or interested in what people like you are currently playing around with because I think like what you're into right now, other people are going to be into in six months or a year. And the thing that you're working on that I'm most excited about is you're thinking about using AI to do simulations of people. Let's start there. I want you to lay the groundwork for me. Explain what that is and why you think that's interesting. And then let's talk about what you're doing with that.

Michael Taylor (00:03:05)

Yeah. So, I studied economics and then went into a career in marketing, but always was interested in this idea of being able to predict behavior. I think that's what got me into economics. Because with microeconomics, you could somewhat predict behavior and then get into growth marketing that way, because you could run A/B tests and you could see how to predict behavior. Would they click on this ad or that ad? And I guess this is kind of natural that when I got into AI would start messing around that stuff too. And I use roleplay a lot, right? That's pretty obvious. The number one prompt engineering tactic that everyone tries first is, as a researcher in this field, give me an answer. And it actually really does change the answer. And because the training data is so wide, it can kind of roleplay as almost any character. And the logical leap there is, well, if it can roleplay as one character, why not multiple? And you can have it be a whole focus group for you. You can have it be an entire audience. And then you can test things and kind of look at the assumptions that you're making and see in a risk-free environment, whether your idea would work or not before you actually put it out there in the world.

Dan Shipper (00:04:25)

Right. I want to stop there because I think people may not even really understand the full weight of what you're saying. There are a lot of studies that show if you tell an AI to adopt a certain personality, it will make a lot of the same decisions with very high overlap with someone else who is actually a real person who has that personality. So one thing that I did early on this is maybe a year ago-ish. It was when GPT-4 was still good, which should tell you it feels like forever ago.

Michael Taylor (00:04:52)

Yeah, it's like 100 years ago now.

Dan Shipper (00:04:55)

But one thing that I did was— Because I was sort of, I sort of knew about this research too. And I was kind of into it. And I got GPT-4 to look at my tweets and then, based on my tweets, write a personality profile for me. And then I gave that personality profile to another GPT-4. And I was like, be this person and adopt this personality. And then I had to take a personality test as me. So I took the personality test and it took the personality test to see what the overlap in the scores were. And the overlap was extremely high. And then I was like, I wonder if someone who knows me really well in my life would be able to do what GPT-4 just did just based on knowing me. And so I had my girlfriend at the time and my mom pretend to be me, and take a personality test and try to think about what I would pick. And GPT-4 is better than both of them, just from my Twitter account, which is, it's crazy. It's really wild.

Michael Taylor (00:05:53)

Yeah. Maybe they don't read enough of your tweets, I guess.

Dan Shipper (00:05:56)

Yeah. It's really not that GPT-4 is good. It's just an indictment of my ex-girlfriend.

Michael Taylor (00:06:00)

No, I mean, I've done the same thing. Very early on, I was doing a lot of writing and I was like, why don't I just automate my entire job? And then actually I did it and people couldn't tell and then people would ask me like, hey Mike, I really liked that thing you posted on LinkedIn. And I was like, you're going to have to remind me what it was.

Dan Shipper (00:06:25)

I'm too busy and important and popular now to even know what I tweeted or what I put on LinkedIn.

Michael Taylor (00:06:30)

Must be how famous people feel when they have someone managing their account. But yeah. I always thought it’s actually surprisingly good at this. And then I saw, like you said, a bunch of these studies. There was one that jumped out at me. I'd have to find the link, but they basically said that if you do a two-hour interview with someone and you take the transcript and then create a personality profile you get 80 percent accuracy in big five personality traits, personality tests, the decisions they make in economic games and how they vote as well. And the crazy thing is that 80 percent is already pretty amazing, but it's actually as accurate as if you interview the same person again the second time. So there's always drift. When you interview if you interview you today and interview you next month, Dan is only going to agree with 80 percent of what Dan said today. So virtual Dan is as good as that.

Dan Shipper (00:07:32)

I contain multitudes, Mike. Yeah, where you're going with that is if you're a founder and you want to interview your customers or know what your customers want, and it's in an industry that you're not as familiar with, a good place to start is just talking to ChatGPT or talking to Claude, which I've done. It really works. It's crazy how well it works just from basic prompting. If you're thinking about it, I want to build a product in X industry. Let me figure out how that kind of person would think and what their day is and all that kind of stuff. It's really, really good. But you're taking it a step further. Tell us about what you've been building behind the scenes.

Michael Taylor (00:08:15)

Yeah. So this will kind of stem from a post I did for you guys, actually—about personas of thought—my last one in the column, where I had like a prompt I was using or I said, okay, let's generate a bunch of people who are relevant personas who could answer this question first and then kind of fill in the blanks of what that person would say in answer to the question. So I had that, I was doing that all in one prompt. But then I wanted to automate it, so I wrote a script and then I think we were talking, you were like, help me solve a debate. yeah, I think you, you liked the— You were trying to describe—

Dan Shipper (00:08:55)

How should we describe Every? A “meta media company” or a “multimodal media company”? It was the question. And Kate, our editor in chief was like, I don't like meta. And I was like, but I like it. So we were like, well, let's test the wisdom of the crowds. Let's see what would happen. And then you put it into the script.

Michael Taylor (00:09:20)

Yeah, I ran the script. And then the thing that kind of convinced me to get working on this a bit more was that you changed your mind. It's very rare that I see a CEO change their mind about anything. I was like, okay, this might be something here. And then I was using it for everything. I was like, oh, I'm, I'm actually changing my mind too. And it was an unexpected result. I didn't expect it. I thought people would be really skeptical of the AI responses. But I think because each AI is answering itself it gives the reasons. And then when you summarize those reasons, then you can kind of look at them and go, oh, I agree with those reasons and therefore I can change my mind without being too stuck in my ways. I don't know. How was that? Is that kind of a similar thought you had at the time or—

Dan Shipper (00:10:08)

Basically. I mean I think my basic way of being as a CEO or writer or whatever is, as I talk, I'm paying attention to how people respond and I'll like get a sense for like, ooh, this little thing I just tried worked really well and then that's how I write headlines or that's how I figure out ideas and all that kind of stuff is I'm just constantly trying stuff. And so feedback was very important to me. And that's why I love using X because I can just put something out and lots of people will interact with it or not, or doing sales meetings, the same thing, all that kind of stuff. And so this feels like an extension of that where I can take way more risks. And it's a little more quantitative. It's a little bit more like I can hit one thing vs. another. Whereas in real life, people remember what you just said. So you can't just wind back the clock and be, what if I had said it another way? How would you respond? It's hard to do a really fair test.

And so I was immediately like, wow, that makes a lot of sense. And also, I do think having the thoughts of the model is really helpful because I think they were like, meta just sounds like Facebook to me or whatever. And that's what Kate was saying. And I was like, I don't know, Kate, I'm the CEO. I don't think of Meta when I think of it or whatever, but other people were thinking. And I was like, okay, well that makes a lot of sense. And then I changed my mind and yeah, I'll do anything to get this to be part of our bundle because I love this product. I think it's so cool.

My workflow now is when we do an announcement or I'm writing a post or whatever, I will use Spiral, which is our content repurposing platform to figure out what I want to tweet or what the headline should be or whatever. And then I'm just pinging you or you have a little demo of the little MVP that I've been using and I can just put it in there and get some wisdom from the crowds and I love it. And what's interesting to me too is I think where people are going,where people's heads might be going is, okay, cool. You can just automate everything. And you'll always have the right answer to every question. And you're going to make super, super, crazy, amazing content, every single time or whatever. And yeah. I think what's interesting about this is there's so many different variables to it. So, for example, one variable is, what is the audience? And you can spin up different audiences with different demographics and different viewpoints and personalities or whatever. So we have an Every audience. And then we have like a Hacker News audience that we can test and all that kind of stuff.

And then also the results that you get are extremely dependent on the question that you ask. And so, if I'm pitting one thing vs. another, it's not like going and finding what is the best possible way to phrase this? It's just pitting one thing vs. another. And maybe there would be a way to automatically try tens of thousands of messages or whatever. But even then you start to realize that the kind of space of possibilities is so huge. It's so huge that anything— You can only ever test a small portion of it and different people are going to start in different places in the landscape. And so, yes, these tools are super powerful. And it makes my process way more efficient to have it available. But I think It's not gonna do the thing that people think where it's like, oh, we're just gonna turn into zombies and everyone's marketing messages are gonna be the same because the space of possibility is really big.

Michael Taylor (00:13:50)

Yeah, exactly. There’s those two things combined, right? The space that you're testing it in and the personas you're using and then. Because a Hacker News audience is gonna be very different from your Twitter audience or whatever. But that was something that we did a lot of in the agency. We were testing tons of creative ideas with Facebook ads and you could just never have enough budget—even the biggest brands don't have enough budget to test everything. So yeah, this kind of expands the space in which you can play in, but it's still your duty to play. You still have to come up with good ideas and AI can help you come up with good ideas too. I tend to use it in the brainstorming phase, and then I use it in the testing phase, but I'm in the middle. I see quite often it's better at judging than it is at finalizing the copy. And this is something I saw a lot of as a prompt engineer, working over the past few years. It's just that LLMs are actually pretty good at judging the results of tasks even if they can't do the task very well themselves which kind of led me down this path in some respects as well.

Dan Shipper (00:15:00)

That's really interesting. I want to push you there because I've had experiences with Claude, for example, where I'm putting in like an essay or whatever. And I'm like, grade this essay. And it just, always gives you a B-plus the first time or A-minus the first time. And then if you make some revisions, it always moves up to an A or whatever. So yeah. Where are the tasks where it is good at judging and where are the tasks where it's kind of going to give you like that kind of [bleep]? Oh, it's an A-minus or whatever. And here's and the next turn it's an A.

Michael Taylor (00:15:32)

Yeah. It's really bad at grading. So you can't grade things on a Likert scale very easily. You can't give it the stars out of five. It tends to always overestimate the middle. So, it's almost like being too nice to you. Everything's a four out of five.

Dan Shipper (00:15:47)

Yeah. I'm like, be mean!

Michael Taylor (00:15:50)

I know. Yeah. And there are tricks for getting it to be more mean, but actually what I use it for a lot in terms of grading is I'll have something that is a one or a zero. So I'll have some criteria of, for me, a good article is one that is as concise as possible. And so I have a one and zero on the concise scale. Is this concise, right? I'll have another one which has a compelling hook? And I'll kind of build up maybe 20 of these for any tasks that I'm automating. And then I'll run my testing suite and each one is an LLM judge that just checks that 0–1. And when I combine that score together, it just kind of gives me an aggregate score that is much more reliable. When you run it again, it's not changing as much.

Dan Shipper (00:16:48)

That's interesting. So what about like so first of all, I like the idea of breaking it up into subtasks, but then like within a specific subtasks, like the hook. Is it still only going to be able to do 0–1, good hook or bad hook? Or could you do 0–5? And would it be able to do that?

Michael Taylor (00:17:04)

Yeah, it's not as good when you go 0–5. So what I would try and do is break that up into subtasks. break that classification down into subclassifications. Right? So, I would think about, okay, what is a good hook? A good hook grabs your attention. And a good hook may name drops or references something famous or credible. A good hook does this. And the way I get these, by the way, I use an LLM judge for this too. I think the other thing it's really good at is comparison. So you can give it two different articles that you've written. You say what is different between these two articles and then you look at the differences and then some of those will then become judge criteria.

Dan Shipper (00:17:46)

That's a whole other tool that would be super useful is just recursively making a template classifier thing that takes a bunch of examples of good and bad, and then uses that to create like the most detailed possible rubric for another LLM to use.

Michael Taylor (00:18:00)

Yeah. I did think about doing that. I worried it was like way too far down the rabbit hole. I don't know how much people actually care about the quality of their writing in most places.

Dan Shipper (00:18:10)

Yeah, maybe not for writing, but in general, I feel like there's got to be use cases for that, but it may not, it may not be a today thing. It might be that in a year people will be more sensitive to this kind of thing.

Michael Taylor (00:18:25)

Yeah, I was doing a lot for prompt engineering with my clients. And then, once you have a list of these criteria, then you can build a metric and then and then you can use it as a prompt optimization library. And that goes pretty deep into the weeds like DSPy or something like that. because then you can improve the accuracy of the classifier. And then once the classifier is much better, then you can improve the accuracy of the generation task as well. So, it depends on how deep you want to go on this, but this is something I spend a lot of time thinking about.

Dan Shipper (00:18:55)

I want to go deep. I mean, the thing I'm thinking about is we're talking about making evals, basically. And I'm just thinking about Cora, our email tool and trying to improve those summaries and then trying to figure out what is a good summary? And we're basically doing this where every week I go in and look at the summaries that Cora generates for me with Kieran, who's the GM of Cora. And I literally rewrite my inbox for him personally—

Michael Taylor (00:19:25)

That’s usually what I tried to get clients to do.

Dan Shipper (00:19:30)

Yeah. And just sort of hope that over time we'll have enough examples and enough rewrites and enough pulling out all those principles that it starts to work. And it's really fun. It's really interesting. And it's also really hard. And I love all of the mapping, all of the little things that I know that are not explicit that I just sort of know because I'm, this summary is wrong. Or here's how I would summarize this. But I can't explain it until I look at it, and then I’m like, well, this is the rule I'm following. And it's really fun.

Michael Taylor (00:20:05)

And sometimes you don't know the rules you're following as well.

Dan Shipper (00:20:10)

You mostly don't know the rules you're following. That's part of the fun.

Michael Taylor (00:20:15)

Yeah. Once you see it, then you're like, oh, I get it. So I was trying to make a Tinder like interface when I'm doing this, where I'm, one to zero, one to zero, thumbs up, thumbs down.

Dan Shipper (00:20:25)

Yeah, it’s like the eye doctor.

Michael Taylor (00:20:30)

Yeah, but so I'll walk you through how to make quick evals for this, right? So if you take that big list of all the emails where you have the one that it generated, and then the one that you rewrote. Then what you want to do is run a big analysis—you can do this in a Jupyter notebook or get the engineers to do it or do it manually to take a long time. But you just get your compare, just ask Claude or GPT-4, what is the difference between the rewritten and the original? And then you take all of those differences and then summarize and say, what are the main differences between a good email and a bad email? And then you ask it for bullet points, and then each bullet point becomes your evaluation criteria. And then you can then you can look at the accuracy of the evaluation criteria because you have now you have for every one of your rewritten emails you have, whether it's scored one or zero on that evaluation criteria, whether it was present or not. So then you can look at the false positives, false negatives. This example, I said, didn't have a good hook. But my LLM judge said it had a good hook, so it was wrong, right? So that's like a false positive. So that would mark the score of the classifier down. So, yeah, that’s what you would do.

Dan Shipper (00:21:58)

And you would do this in a Jupyter notebook or what's the format that you keep all the emails and examples in?

Michael Taylor (00:22:05)

Yeah, I would love to use a tool to do this and I have tried a lot of them. And I find the abstractions change so often with AI that I keep going back to just no framework, no tool, just a Jupyter notebook. And then like, I mean, I'm doing it in Cursor and I'm just letting Claude YOLO create the interface, save everything locally in CSVs and stuff. And it's probably not the best way to do it, but I'm the sort of guy that just doesn't really care that much about the elegance of my code and just wants to kind of get the job done.

Create a free account to continue reading

The Only Subscription
You Need to Stay at the
Edge of AI

The essential toolkit for those shaping the future

"This might be the best value you
can get from an AI subscription."

- Jay S.

Mail Every Content
AI&I Podcast AI&I Podcast
Monologue Monologue
Cora Cora
Sparkle Sparkle
Spiral Spiral

Join 100,000+ leaders, builders, and innovators

Community members

Already have an account? Sign in

What is included in a subscription?

Daily insights from AI pioneers + early access to powerful AI tools

Pencil Front-row access to the future of AI
Check In-depth reviews of new models on release day
Check Playbooks and guides for putting AI to work
Check Prompts and use cases for builders

Comments

You need to login before you can comment.
Don't have an account? Sign up!