More

lacoolj · 2025-12-24T17:34:06 1766597646

Normally I'd say "it's never too late!" but clearly would diverge and require an entirely new project, maintaining two bases for the same thing, etc.

Good to see you alive and kicking. Happy holidays

lacoolj · 2025-12-24T16:22:59 1766593379

Probably because some apps aren't NSFW apps but have it (Reddit)

lacoolj · 2025-12-18T19:01:02 1766084462

lol I love how OpenAI just straight up doesn't compare their model to others on these release pages. Basically telling us they know Gemini and Opus are better but they don't want to draw attention to it

qwesr123 · 2025-12-18T19:06:51 1766084811

Not sure why they don't compare with others, but they are actually leading on the benchmarks they published. See here (bottom) for a chart comparing to other models: https://marginlab.ai/blog/swe-bench-deep-dive/

mistercheph · 2025-12-18T19:34:27 1766086467

It's like apple, they just don't want users or anyone to even be thinking of their competitors, the competition doesn't exist, it's not relevant.

whimsicalism · 2025-12-18T20:02:35 1766088155

is swe-bench saturated? or they switch to swe-bench pro because...?

Mkengin · 2025-12-18T22:45:22 1766097922

At least on swe-rebench it does pretty well: https://swe-rebench.com/

dbbk · 2025-12-18T20:14:02 1766088842

This was the one thing I scanned for. No comparison against Opus. See ya.

Mkengin · 2025-12-18T22:49:10 1766098150

Though this Codex version isnt on the leaderboard, GPT-5.2-Medium already seems to be a bit better than Opus 4.5: https://swe-rebench.com/

gizmodo59 · 2025-12-18T23:36:50 1766101010

Is that your website or something? You keep promoting it

Mkengin · 2025-12-21T13:48:02 1766324882

No, I am not affiliated with the website, I just want to see more discussions based on uncontaminated benchmarks and feel that people rely too much on benchmarks that companies can conduct themselves. If that is the case, I don't feel I can trust them. For general LLM capabilities, for example, I would also tend to rely on dubesor [1] rather than artificial analysis or similar leaderboards.

[1] https://dubesor.de/benchtable

lacoolj · 2025-12-17T00:30:48 1765931448

This is a bigger problem than just Instacart. The prices are much harder to revert after they've been established (as we've seen since 2020).

lacoolj · 2025-12-12T18:33:26 1765564406

I'm reading these articles in Addy's voice. It's quite wonderful

lacoolj · 2025-12-11T22:58:20 1765493900

This is a whole bunch of patting themselves on the back.

Let me know when Gemini 3 Pro and Opus 4.5 are compared against it.

lacoolj · 2025-12-10T18:13:28 1765390408

How did you run a 123B model locally? Or did you do this on a GPU host somewhere? If so, what spec was it?

simonw · 2025-12-10T22:04:50 1765404290

I haven't run the 123B one locally yet. I used Mistral's own API models for this.

lacoolj · 2025-12-10T18:12:24 1765390344

Wonder why Gemini 3 Pro and Sonnet 4.5 are on this comparison but Opus 4.5 is not?

lacoolj · 2025-12-10T18:07:50 1765390070

Based on the prompt you wrote, this woulnd't be a "Hallucination"

And as I write this critique of your HN title, I see you have edited it since I last refreshed. I'm guessing a few others have already echoed this sentiment a few times.

dang · 2025-12-10T18:08:42 1765390122

Indeed so: https://news.ycombinator.com/item?id=46216933, https://news.ycombinator.com/item?id=46213179

lacoolj · 2025-12-10T18:00:09 1765389609

I never knew "Parisian" was how to refer to "one from Paris"

Learn something new every day.

To be clear, I just didn't think anyone would refer to someone from Paris specifically (rather than, "French").

I mean, a lot of places you would add "-ite" but I'm guessing that would be a less-than-ideal suffix for this particular city lol

Freedom2 · 2025-12-10T19:30:30 1765395030

As someone who knows a lot of New Yorkers and Texans, it's definitely curious that people would refer to themselves as from a city versus from the country itself.