Enabling Codex to Analyze Two Decades of Hacker News Data

(modolap.com)

43 points | by ronfriedhaber 4 hours ago

12 comments

zeroxfe 2 hours ago
I've done this kind of thing many times with codex and sqlite, and it works very well. It's one prompt that looks something like this:
- inspect and understand the downloaded data in directory /path/..., then come up with an sqlite data model for doing detailed analytics and ingest everything into an sqlite db in data.sqlite, and document the model in model.md.
Then you can query the database adhoc pretty easily with codex prompts (and also generate PDF graphs as needed.)
I typically use the highest reasoning level for the initial prompt, and as I get deeper into the data, continuously improve on the model, indexes, etc., and just have codex handle any data migration.
[-]
- huflungdung 1 hour ago
  [dead]
Brajeshwar 31 minutes ago
The “Hacker News - Complete Archive” on Hugging Face,[1] recently popped up here. “The data is stored as monthly Parquet files sorted by item ID, making it straightforward to query with DuckDB, load with the datasets library, or process with any tool that reads Parquet.”
Out of curiosity, I tinkered with it using Claude to see trends and patterns (I did find a few embarrassing things about me!).
1. https://huggingface.co/datasets/open-index/hacker-news
sd9 14 minutes ago
I'm kind of surprised that postgres was quite that dominated by mongodb back in the day. I remember the mongo fever, but I always thought postgres held reasonable market share. I guess it was other SQL dbs back then, I guess MySQL was still viable.
mike_hearn 3 hours ago
I don't quite understand how Modolap differs from just asking AI to use any other OLAP engine? Both your website and the github readme just emphasise that it's idiosyncratic and your personal approach, without explaining what that is or why anyone should care.
[-]
- ronfriedhaber 3 hours ago
  Appreciate the feedback. I shall certainly revamp the README; it is rather stale.
  > "how Modolap differs from just asking AI to use any other OLAP engine"
  There presently exist two components, the OLAP query engine and the remote infrastructure service. The service enables systems like Codex (or developers as well) to manage datasets, maintain version control over queries, and offload the computational burden to dedicated machines. This is especially beneficial given the current trend of running agents inside micro-VMs.
  In addition, it is designed with AI usage in mind. There is significant value in co-design. One could argue that models can use Polars or DuckDB just as well, and that there is no room for improvement, but I do not think this is true.
  [-]
  - bastawhiz 2 hours ago
    What room for improvement is there?
  - esafak 1 hour ago
    I don't get the value proposition either; your landing page is underdeveloped. Tracking the query history is trivial. Offloading computation could be done with Polars Cloud or MotherDuck. Can you expand on the "manage datasets" part?
hakrgrl 23 minutes ago
That last chart showing the average comment length shows a clear negative downtrend, especially in recent months. I wonder why that is.
[-]
- hakrgrl 23 minutes ago
  I noticed some topics and comments that were usually in violation of HN guidelines are no longer flagged, and discourse decays into reddit-like jabs and echo chambers. Only a small percent, but still, more than the previous 0% I was accustomed to.
  Would be interesting to see how many comments violate the guidelines over time. https://news.ycombinator.com/newsguidelines.html
voidUpdate 1 hour ago
When searching for references to Go, what does it actually look for? "Go" is a relatively common word, and I hardly see anyone referring to it as Golang
xnorswap 1 hour ago
5% of all comments mention Claude code?
Am I reading that right?
[-]
- epaga 1 hour ago
  Well now it's 5.00001%.
moralestapia 54 minutes ago
Do not estimate/plot DAUs/MAUs, it's not a pretty picture :'(.
[-]
- hakrgrl 14 minutes ago
  Why do you say that?
throwaway290 2 hours ago
HN data is open? Under what conditions it's distributed?
[-]
- bastawhiz 2 hours ago
  There's an API link at the bottom of every page.
avib99 2 hours ago
[dead]
huflungdung 1 hour ago
[dead]
benreesman 2 hours ago
[dead]