Computer use in Gemini 3.5 Flash

(blog.google)

87 points | by swolpers 2 hours ago

9 comments

revolvingthrow 49 minutes ago
People using google’s models: am I holding it wrong or are the guardrails really overtuned?
I had the dubious pleasure of testing gemini of late and I kept running into refusals. How do I transfer a sim number from one provider to another? No. What should I consider when making backups on ntfs less prone to data loss and more bitrot resistant? No. Evaluate this piece of code? No.
I’m not sure if it’s cold feet from the mythos situation or what, but it reminds me of the dark days where you couldn’t use ai for much of anything. But then I go to chatgpt 5.5 and it does mostly everything I want outside of the usual cybersecurity boogeyman that you run into now and then.
[-]
- Chu4eeno 6 minutes ago
  I've always found all versions of gemini to be (for a lack of a better word) lazy.
  I guess it's economic wrt. token use, but it often either refused for absurd safety reasons, or other weird stuff like responding that an LLM like itself wasn't a suitable tool for the job, and very quickly gives up.
  Claude is on the other end of the spectrum, which makes it more noticeable when switching between them.
- k8sToGo 23 minutes ago
  The context window size is also very small if you use Gemini in the app. It starts forget quite fast. In my opinion Gemini on app is useless additionally to the guardrails.
- nout 30 minutes ago
  I just asked gemini the question with sim number and it gives me full step by step guide.
- kordlessagain 37 minutes ago
  I love antigravity. I’ve had zero issues with it.
airstrike 2 hours ago
Computer use is such a terrible idea. It's slow, insecure, error prone, expensive.
I guess if you're trying to get people to tokenmaxx it may look like a valid strategy, but ain't no way this will be delightful to users.
I think it's a symptom of just not understanding how LLMs should interface with the OS because we're still in their early days.
Eventually there'll be an iPhone moment for the ergonomics of LLM usage outside of coding
[-]
- gdudeman 55 minutes ago
  Computer use is a great idea. It gets the job done when nothing else will.
  If you're a person trying to get their job done at a big company, but half your job is in 1-2 proprietary tools or is stuck behind an API you can't program against, computer use can allow you, a non-techie, to do your job more efficiently.
  I think it's an awesome way to circumvent gate keepers and the IT department to let people accomplish their goals.
  [-]
  - uejfiweun 30 minutes ago
    Yeah, it's not that computer use is the most theoretically optimal paradigm, but there's a reasonable case that given the constraints of modern software systems and how they're built, that it's the most realistically optimal paradigm.
- thorum 1 hour ago
  The “correct”, elegant way for AI to interact with existing software would take decades and billions of dollars to build. Someone would have to do the hard work of building new APIs, solving decades of accessibility issues, etc.
  Or you can show an AI screenshots and ask it where to click.
  [-]
  - sarreph 1 hour ago
    I disagree if your application is networked. Most SaaS is built on RESTful APIs that can be converted trivially into interfaces / contracts for tool use.
    [-]
    - chatmasta 39 minutes ago
      So you can either wait for every application to do that, or at least make it possible for an LLM to do it… or you can make the LLM use a computer interface that works with every application by definition.
  - jubilanti 20 minutes ago
    it takes decades and billions of dollars to develop APIs?
- orbital-decay 43 minutes ago
  Spreadsheet is such a terrible idea. It may look like a valid tool, but ain't no way it's delightful to users. Most of the time people need a database instead. Eventually there'll be an iPhone moment for this.
  Meanwhile, the entire world economy:
- api 1 hour ago
  It's great for testing and QA automation for UIs. It's also possibly good for the vision impaired.
  [-]
  - orbital-decay 24 minutes ago
    UI QA only works well if your model plausibly matches the average user behavior and/or real-world edge cases. These models are far from that, and they are much less random than you'd like them to be for fuzzing (mode collapse).
- nzach 1 hour ago
  > Computer use is such a terrible idea. It's slow, insecure, error prone, expensive.
  And yet having an agent able yo use a computer on your behalf is really useful.
  Recently I gave a Nix OS vm to my hermes agent and it has been a good experience. I don't really care if destroy the machine I can just rollback to an earlier version, and for any meaningful data he creates for me I make sure he creates a repo, commit and pushes to my private Gitea instance.
  [-]
  - dbbk 46 minutes ago
    > And yet having an agent able yo use a computer on your behalf is really useful.
    I honestly cannot think of a single use case
  - airstrike 1 hour ago
    > And yet having an agent able yo use a computer on your behalf is really useful.
    It is, but there's no need for it to be viewing your screen, browsing websites and watching ads.
    That stuff is for humans, not for LLMs.
    [-]
    - nzach 52 minutes ago
      Sure, I don't want an agent watching MY screen. That's why I gave him his own environment, and pretty quickly he discovered that you can open chrome and make it render to a framebuffer, this way he is able to 'view' the website. And apparently with this he is able to bypass a lot of 'anti-bot' measures.
satvikpendem 2 hours ago
There's still no MCP support in the Gemini app, which is very useful to get various pieces of info as a user just via chatting. For example I recently wanted to get an Airbnb and wanted to filter by specific criteria including house image analysis and Gemini couldn't do it so I had to do it in Codex.
[-]
- anticorporate 2 hours ago
  Yeah, it seems like this is the biggest missing feature from the Gemini ecosystem.
  If I can't connect MCP, there's really no selling point for me to use Gemini from my watch, car, smart speaker, etc. If I'm already bound to using my own front end, then I'm only evaluating Gemini as a model/API, at which point it has many competitors that may be cheaper or better fit for the task.
  [-]
  - thejaycampbell 2 hours ago
    agreed... this is where they lost me too
- mitchell_h 58 minutes ago
  I'm fairly convinced Claude's strongest point is the app. AI users aren't anywhere near as mature or smart as youtube/hn would have folks believe. The claude app is amazing for bridging that gap.
  [-]
  - dr_dshiv 21 minutes ago
    Didn’t it take them like 2 days to build the first one?
  - dr_dshiv 21 minutes ago
    Didn’t it take them like 2 days to build it?
- solarkraft 27 minutes ago
  They only fixed stopping the model mid-generation losing the entire session pretty recently.
  The Gemini apps suck.
- tonyrice 2 hours ago
  This is why I don't always use the official Gemini Web app. Lately I've found that it's more useful to utilize a CLI. I'm looking forward to the day they add MCP in the web.
  [-]
  - pregseahorses 1 hour ago
    Gemini CLi now requires antigravity subscription..
  - singingtoday 1 hour ago
    CLI doesn't work with my subscription..
mlmonkey 2 hours ago
It's funny how in their own graph, https://storage.googleapis.com/gweb-uniblog-publish-prod/ima... Gemini 3.5 Flash is beat hands down by both Opus 4.8 and GPT 5.5, and yet the graph is drawn as if Gemini wins ... :-D
[-]
- mroche 1 hour ago
  The graph has Gemini 3.5 Flash matching Sonnet 4.6, losing to Opus 4.8, and slightly behind GPT-5.5 by 0.3 points... That's not that much of a hands-down loss for Gemini for this specific workload benchmark.
  The methodology used:
  https://deepmind.google/models/evals-methodology/gemini-3-5-...
  Methodology: All Gemini scores are pass @1 except where otherwise noted. "Single attempt" settings allow no majority voting or parallel test-time compute. All of the results are all run with the Gemini API for the model-id gemini-3.5-flash with default sampling settings unless indicated otherwise below. To reduce variance, we average over multiple trials for smaller benchmarks.
  All the results for non-Gemini models are sourced from providers' self reported numbers unless otherwise mentioned below. For Claude Opus 4.7 , Sonnet 4.6, and GPT-5.5 we default to reporting maximum thinking/reasoning settings available, but when reported results are not available we use best available reasoning results.
- sheept 2 hours ago
  It highlights the Gemini models blue since that's what the article is about. The bar heights seem consistent with the values.
- gb2d_hn 1 hour ago
  It's honest - people who know what they are looking at will take speed and token costs into account. I don't use Gemini 3.5 for coding, but I use it as something in between a search engine and agent.
- data-ottawa 1 hour ago
  I think 3.5 flash is trying to target agentic work, like Google Search or ADK (agent development kit) use cases.
  It’s something cheap enough you’d put out in front of your customers, and Opus is expensive enough you wouldn’t.
fridder 58 minutes ago
I wonder if it will be better at building TUI's. It has been absolutely abysmal at interacting with them and building them
[-]
- chatmasta 41 minutes ago
  Claude can build UI but it sucks at testing it and iterating on it. Fable showed some improvements in this regard but alas.
knollimar 1 hour ago
Where is 3.5 pro?
beastman82 2 hours ago
No UI like their competitors Claude CoWork or Codex. This is vaporware
villgax 1 hour ago
Will it skip Ads lol
[-]
- humblyCrazy 1 hour ago
  I looked at their demo and it does not
  [-]
  - chatmasta 39 minutes ago
    Better question might be will it skip recaptcha?
zuzululu 1 hour ago
performance is quite impressive given that its 3x cheaper than 5.5