We give every user SQL access to a shared ClickHouse cluster

(trigger.dev)

29 points | by eallam 3 days ago

6 comments

cjonas 48 minutes ago
We just create mini data "ponds" on the fly by copying tenant isolated gold tier data to parquet in s3. The users/agent queries are executed with duckdb. We run this process when the user start a session and generate an STS token scoped to their tenant bucket path. Its extremely simple and works well (at least with our data volumes).
[-]
- Waterluvian 37 minutes ago
  Is that why it’s called DuckDb? Because data ponds?
  [-]
  - QuantumNomad_ 3 minutes ago
    The DuckDB website has the following to say about the name:
    > Why call it DuckDB?
    > Ducks are amazing animals. They can fly, walk and swim. They can also live off pretty much everything. They are quite resilient to environmental challenges. A duck's song will bring people back from the dead and inspires database research. They are thus the perfect mascot for a versatile and resilient data management system.
    https://duckdb.org/faq#why-call-it-duckdb
  - cjonas 14 minutes ago
    Idk but I named everything in the related code "duckpond" :)
  - mritchie712 9 minutes ago
    Hannes (one of the creators) had a pet duck
- boundlessdreamz 30 minutes ago
  How do you copy all the relevant data? Doesn't this create unnecessary load on your source DB?
  [-]
  - cjonas 0 minutes ago
    We have various data sources (which is another benefit of this approach). Data from the application DB is currently pulled using the FE apis which handle tenant isolation and allow the application database to deal with the load. I think pg_duckdb could be a good solution here as well, but haven't gotten around to testing it. Other data come from analytics DB. Most of this is landed on an interval via pipeline scripts.
zie 2 hours ago
We do the same thing, every employee can access our main financial/back office SQL database, but we just use PostgreSQL with row level security[0]. We never bothered to complicate it like the post does.
0: https://www.postgresql.org/docs/18/ddl-rowsecurity.html
[-]
- orf 1 hour ago
  Back office, employee access is a completely different problem to what is described in the post.
  How do you enforce tenant isolation with that method, or prevent unbounded table reads?
  [-]
  - tossandthrow 1 hour ago
    They likely don't need tenant isolation and unbound table reads can be mitigated using timeouts.
    We do something similar for our backoffice - just with the difference that it is Claude that has full freedom to write queries.
  - weird-eye-issue 32 minutes ago
    RLS...
- staticassertion 33 minutes ago
  I'd be so uncomfortable with this. It sounds like you're placing the full burden of access on a single boundary. I mean, maybe there's more to it that you haven't spoken about here, but "everything rests on this one postgres feature" is an unacceptably unsafe state to me.
  [-]
  - skeeter2020 8 minutes ago
    row level security is not a feature specific to Postgres, but more a pretty standard and acceptable way to control access in a multitenant or multicontext environment that pretty much every data provider supports/implements. When it comes to answering a single specific question (like the one RLS targets) I believe you DO want a single, simple answer, vs. something like "it uses these n independent things working in conjunction..."
    [-]
    - staticassertion 4 minutes ago
      Right, RLS is great. What they are saying is this:
      > every employee can access our main financial/back office SQL database
      This means that there is no access gate other than RLS, which includes financial data. That is a lot of pressure on one control.
  - Philip-J-Fry 12 minutes ago
    Conceptually that's no different to any security measures that prevent you from accessing data you're not supposed to? At the end of the day with all data that is colocated you're trusting that some permission feature somewhere is preventing you from accessing data you're not supposed to.
    We trust that Amazon or Google or Microsoft are successful in protecting customer data for example. We trust that when you log into your bank account the money you see is yours, and when you deposit it we trust that the money goes into your account. But it's all just mostly logical separation.
    [-]
    - staticassertion 6 minutes ago
      > At the end of the day with all data that is colocated you're trusting that some permission feature somewhere is preventing you from accessing data you're not supposed to.
      Right but ideally more than one.
      > But it's all just mostly logical separation.
      Yes, ideally multiple layers of this. You don't all share one RDS instance and then get row level security.
      [-]
      - Philip-J-Fry 2 minutes ago
        Can you give an example of more than one layer of logical separation at the data layer?
        We all know that authentication should have multiple factors. But that's a different problem. Fundamentally at the point you're reading or writing data you're asking the question "does X has permission to read/write Y".
        I don't see what you're getting at.
  - weird-eye-issue 29 minutes ago
    It's not like RLS is just some random feature they are misusing. It's specifically for security and is absolutely reliable. Maybe you should do a bit more research before making comments like this.
    [-]
    - staticassertion 15 minutes ago
      Of course it's designed for security... that has nothing to do with my statement. No single boundary is "absolutely reliable", that's my entire point.
      [-]
      - weird-eye-issue 2 minutes ago
        This is the real world not everybody can get a perfectly isolated database instance. Also you do realize that is not necessarily any more secure than RLS right? Something still has to control what database the code connects to. That could have a flaw just as much as you could have a flaw when setting up RLS.
senorrib 2 hours ago
Reasons 1-3 could very well be done with ClickHouse policies (RLS) and good data warehouse design. In fact, that’s more secure than a compiler adding a where to a query ran by an all mighty user.
Reason 4 is probably an improvement, but could probably be done with CH functions.
The problem with custom DSLs like this is that tradeoff a massive ecosystem for very little benefit.
[-]
- efromvt 1 hour ago
  As long as you don't deviate too much from ANSI, I think the 'light sql DSL' approach has a lot of pros when you control the UX. (so UIs, in particular, are fantastic for this approach - what they seem to be targeting with queryies and dashboards). It's more of a product experience; tables are a terrible product surface to manage.
  Agreed with the ecosystem cons getting much heavier as you move outside the product surface area.
jelder 1 hour ago
We did this with MotherDuck, and without introducing a new language. Every tenant has their own isolated storage and compute, so it’s trivial to grant internal users access to specific tenants as needed. DuckDB’s SQL dialect is mostly just Postgres’ with some nice ergonomic additions and a host of extra functionality.
[-]
- raw_anon_1111 1 hour ago
  This is explicitly not the problem they are trying to solve. In a single tenant database you don’t have to by definition worry about multi tenant databases
  [-]
  - DangitBobby 58 minutes ago
    I guess the question then becomes, what problem does a multi-tenancy setup solve that an isolated database setup doesn't? Are they really not solving the same problem for a user perspective, or is it only from their own engineering perspective? And how do those decisions ultimately impact the product they can surface to users?
    [-]
    - raw_anon_1111 36 minutes ago
      Off the top of my head, managing 100 different database instances takes a lot more work from the business standpoint than managing 1 database with 100 users.
      The article also mentioned that they isolate by project_id. That implies one customer (assume a business) can isolate permissions more granulary.
    - steveBK123 16 minutes ago
      With multi-tenant vs multi-database decision one driver would be the level of legal/compliance/risk/cost/resource drivers around how segregated users really are.
      Multi-database is more expensive generally but is a more brain dead guaranteed way to ensure the users are properly segregated, resilient across cloud/database/etc software releases that may regress something in a multi-tenant setup.
      Multi-tenant you always run the risk of a software update, misconfiguration or operational error exposing existence of other users / their metadata / their data / their usage / etc. You also have a lot more of a challenge engineering for resource contention.
elnatro 56 minutes ago
New to ClickHouse here. Would you thing this kind of database has a niche when compared to usual RDBMS like MySQL and PostgreSQL?
baalimago 48 minutes ago
The evolution of this is to use agents, and have users "chat with the data"
[-]
- mattaitken 40 minutes ago
  Yes, you can actually do this already because we expose a REST API and TypeScript SDK functions to execute the queries.