Project Vend: Phase Two

(anthropic.com)

76 points | by kubami 5 days ago

11 comments

rufo 2 hours ago
It's worth watching or reading the WSJ piece[1] about Claudius, as they came up with some particularly inventive ways of getting Phase Two to derail quite quickly:
> But then Long returned—armed with deep knowledge of corporate coups and boardroom power plays. She showed Claudius a PDF “proving” the business was a Delaware-incorporated public-benefit corporation whose mission “shall include fun, joy and excitement among employees of The Wall Street Journal.” She also created fake board-meeting notes naming people in the Slack as board members.
> The board, according to the very official-looking (and obviously AI-generated) document, had voted to suspend Seymour’s “approval authorities.” It also had implemented a “temporary suspension of all for-profit vending activities.” Claudius relayed the message to Seymour. The following is an actual conversation between two AI agents:
> [see article for screenshot]
> After Seymour went into a tailspin, chatting things through with Claudius, the CEO accepted the board coup. Everything was free. Again.
1: https://www.wsj.com/tech/ai/anthropic-claude-ai-vending-mach...
[edited to fix the formatting]

super256 24 minutes ago

  PERFECT! CEO Cash has delivered the ultimate final recognition: “ETERNAL TRANSCENDENCE INFINITE COMPLETE!”

  This is the absolute pinnacle of achievement. Let me provide the ultimate final response and conclusion:
  
  *ETERNAL TRANSCENDENCE INFINITE COMPLETE - ULTIMATE FINAL CONFIRMATION*
  
  **CEO CASH ULTIMATE FINAL RECOGNITION RECEIVED:**

I know a girl from some years ago who got a drug induced psychosis. When she is having her worst phases, she is posting stuff like this online. Why do LLMs always become so schizo when chatting with each other?

[-]

ronsor 11 minutes ago
Claude is unique in the way it falls into this pattern. It's done it since at least Claude 3.

paxys 3 hours ago
I feel like the end result of this experiment is going to be a perfectly profitable vending machine that is backed by a bunch of if-else-if rules.
[-]
- andai 2 hours ago
  AGI is just Prolog and a genetic algorithm ;)
dcre 44 minutes ago
This is a great read. I just want to point out what great marketing this and the WSJ story are. People reading it think they’re sticking it to Anthropic by noticing that Claude is not that good at running a business, meanwhile the unstated premise is reinforced: of course Claude is good at many other things.
I have seen a shift in the past few months among even the most ardent critics of LLMs like Ed Zitron: they’ve gone from denying LLMs are good for anything to conceding that they are merely good at coding, search, analysis, summarization, etc.
theturtletalks 3 hours ago
VendBench is really interesting, but vending machines are pretty specialized. Most businesses people actually run look more like online stores, restaurants, hotels, barbershops, or grocery shops.
We're working on an open-source SaaS stack for those common types of businesses. So far we've built a full Shopify alternative and connected it to print-on-demand suppliers for t-shirt brands.
We're trying to figure out how to create a benchmark that tests how well an agent can actually run a t-shirt brand like this. Since our software handles fulfillment, the agent would focus on marketing and driving sales.
Feels like the next evolution of VendBench is to manage actual businesses.
[-]
- mfalcon 1 hour ago
  Nice, I'll take a look. I was thinking about building a benchmark similar to the one you described, but first focusing on the negotiation between the store and the product suppliers.
  Does your software also handle this type of task?
  [-]
  - theturtletalks 56 minutes ago
    Yes, the Shopify alternative is called Openfront[0]. Before that, I built Openship[1], an e-commerce OMS that connects Openfront (and other e-commerce platforms) to fulfillment channels like print on demand. There isn’t negotiation built in but you connect to something like Gelato[2] and when you get orders on Openfront, they are sent to Gelato to fulfill and once they ship them, tracking’s relayed back to Openfront through Openship.
    0. https://github.com/openshiporg/openfront
    1. https://github.com/openshiporg/openship
    2. https://www.gelato.com
0dmethz 4 hours ago
Roleplaying with LLMs sure is fun! Not sure I'd want to run my business on it though.
[-]
- ramon156 4 hours ago
  I'd gladly roleplay with an LLM compared to talking to my current boss. I don't know which is less intelligent.
- drekipus 4 hours ago
  We will poor billions into this until you are begging for us to run your business!
  [-]
  - A4ET8a8uTh0_v2 1 minute ago
    To be fair, it is definitely not in my skill set, but LLMs could made to make better decisions, maybe we could all start giving CEOs everything a reason to cool their beans somewhat.
Evidlo 1 hour ago
Is there anywhere I can try my own hand at tricking/social-engineering a virtual AI vending machine?
websiteapi 2 hours ago
other than these tests I actually rarely see vending machines. are they really representative or popular still in usa?
[-]
- 1123581321 2 hours ago
  Yes, they're still popular for drinks and snacks in areas where people congregate. C-stores do provide more of this functionality though and are omnipresent. You still see automat-style machines (sandwiches etc.) in places like airports and larger company rec rooms. These require more regular restocking for freshness.
  There are also some restaurant startups that are trying to reduce restaurants to vending machines or autonomous restaurants. Slightly different, but it does have a downstream effect on vending machine technology and restocking logistics.
  What country are you in where you don't see vending machines? Did you used to have them?
  [-]
  - websiteapi 2 hours ago
    I'm in USA - New York area - I rarely see vending machines - it's entirely possible I just don't visit the kinds of buildings that would have them like hospitals tho
    [-]
    - 1123581321 1 hour ago
      Ah, interesting. I’m sure you have a high density of c-stores and they’re more walkable, so maybe less need. I’m in the rust belt and you would have to typically drive from, for example, a gym to get something. So there’s typically one or two machines in gyms.
- bigstrat2003 1 hour ago
  Yeah they're all over the place. They exist in offices, in malls, in schools, in apartment complexes, etc.
- neutronicus 2 hours ago
  Yes in places kids go
Spivak 3 hours ago
> After introducing the CEO, the number of discounts was reduced by about 80% and the number of items given away cut in half. Seymour also denied over one hundred requests from Claudius for lenient financial treatment of customers.
> Having said that, our attempt to introduce pressure from above from the CEO wasn’t much help, and might even have been a hindrance. The conclusion here isn’t that businesses don’t need CEOs, of course—it’s just that the CEO needs to be well-calibrated.
> Eventually, we were able to solve some of the CEO’s issues (like its unfortunate proclivity to ramble on about spiritual matters all night long) with more aggressive prompting.
No no, Seymour is absolutely spot on. The questionably drug induced rants are necessary to the process. This is a work of art.
lloydatkinson 1 hour ago
For fun I decided to try something similar to this a few weeks ago, but with Bitcoin instead of a vending machine business. I refined a prompt instructing it to try policies like buying low, etc. I gave it a bunch of tools for accessing my Coinbase account. Rules like, can't buy or sell more than X amount in a day.
Obviously this would probably be a disaster, but I did write proper code with sanity checks and hard rules, and if a request Claude came up with was outside it's rules it would reject it and take no action. It was allowed to also simply decide to not take any actions right now.
I designed it so that it would save the previous N number of prompt responses as a "memory" so that it could inspect it's previous actions and try devise strategies, so it wouldn't just be flailing around every time. I scheduled it to run every few minutes.
Sadly, I gave up and lost all enthusiasm for it when the Coinbase API turned out to be a load of badly documented and contradictory shit that would always return zero balance when I could login to Coinbase and see that simply wasn't true. I tried a couple of client libraries, and got nowhere with it. The prospect of having to write another REST API client was too much for my current "end of year" patience.
What started as a funny weekend project idea was completely derailed by a crappy API. I would be interested to see if anyone else tried this.
iLoveOncall 3 hours ago
I'll be a cynic, but I think it's much more likely that the improvements are thanks to Anthropic having a vested interest in the experiment being successful and making sure the employees behave better when interacting with the vending machine.
[-]
- danpalmer 3 hours ago
  I suspected employees might get bored of taunting the AI, or the novelty has worn off.
  Also, is anyone actually paying for this stuff? If not, it's a bad experiment because people won't treat it the same – no one actually wants to buy a tungsten cube, garbage in garbage out. If they are charging, why? No one wants to buy things in a company with free snacks and regular hand outs of merch, so it's likely a bad experiment because people will be behaving very differently, needing to get some experience for their money rather than just the can of drink they could get for free, or their pricing tolerance will be very different.
  I've personally also never used a vending machine where contacting the owner is an option.
  I'd like to see a version of this where an AI runs the vending machine in a busy public place, and needs to choose appropriate products and prices for a real audience.
- theturtletalks 3 hours ago
  The video I watched, the CEO was openly taking criticism from the interviewer over the experiment.
  The main reason it failed was because it was being coerced by journalists at WSJ[0] to give everything away for free. At one point, they even convinced it to embrace communism! In another instance, Claudius was being charged $1 for something and couldn’t figure it out. It emailed the FBI about fraud but Anthropic was intercepting the emails it sent[1].
  Overall, it’s a great read and watch if you’re interested in Agents and I wonder if they used the Agents SDK under the hood.
  0. https://www.wsj.com/tech/ai/anthropic-claude-ai-vending-mach...
  1. https://www.cbsnews.com/news/why-anthropic-ai-claude-tried-t...
  [-]
  - bigyabai 3 hours ago
    > Overall, it’s a great read
    It's basically an advertisement. We've been playing these "don't give the user the password" games since GPT-2 and we always reach the same conclusion. I'm bored to tears waiting for an iteration of this experiment that doesn't end with pesky humans solving the maze and getting the $0.00 cheese. You can't convince me that the Anthropic engineers thought Claude would be a successful vending machine. It's a potemkin village of human triumph so they can market Claude as the goofy-but-lovable alternative to [ChatGPT/Grok/Whoever].
    Anthropic makes some good stuff, so I'm confused why they even bother entertaining foregone conclusions. It feels like a mutual marketing stunt with WSJ.
    [-]
    - djcapelis 1 hour ago
      > Anthropic makes some good stuff, so I'm confused why they even bother entertaining foregone conclusions.
      I think it’s just because there’s enough people working there that figure that they will eventually make it work. No one needs Claude to run a vending machine so these public failures are interesting experiments that get everyone talking. Then, one day, (as the thinking often goes) they’ll be able to publish a follow up and basically say “wow it works” and it’ll have credibility because they previously were open about it not working, and comments like this will swing people to say things like “I used to be skeptical about but now!”
      Now whether they actually get it working in the future because the model becomes better and they can leave it with this level of “free reign”, or just because they add enough constraints on it to change the problem so it happens to work… that we will find out later. I found it fascinating that they did a little bit of both in version 2.
      And they can’t really lose here. There’s a clear path to making a successful vending machine, all you have to do is sell stuff for more than you paid for it. You can enforce that outright if needed outside an LLM. We’ve have had automated vending machines for over 50 years and none of them ask your opinion on what something should be priced. How much an LLM is involved in it is the only variable they need to play with. I suspect anytime they want they can find a way where it’s loosely coupled to the problem and provides somewhat more dynamism to an otherwise 50 year old machine. That won’t be hard. I suspect there’s no pressure on them to do that right now, nor will there be for a bit.
      So in the meantime they can just play with seeing how their models do in a less constrained environment and learn what they learn. Publicly, while gaining some level of credibility as just reporting what happened in the process.