Should control down-weight negative net-sabotage-value threats?

Fabien Roger16 Jan 2026 4:18 UTC

24 points

0 comments10 min readLW link

Total utilitarianism is fine

Abhimanyu Pallavi Sudhir16 Jan 2026 0:32 UTC

2 points

1 comment3 min readLW link

Test your interpretability techniques by de-censoring Chinese models

Khoi Tran, aryaj, Senthooran Rajamanoharan and Neel Nanda

15 Jan 2026 16:33 UTC

50 points

4 comments20 min readLW link

Reflections on TA-ing Harvard’s first AI safety course

Roy Rinberg15 Jan 2026 16:28 UTC

60 points

2 comments9 min readLW link

I Made a Judgment Calibration Game for Beginners (Calibrate)

Luise Woehlke15 Jan 2026 15:04 UTC

12 points

1 comment1 min readLW link

Corrigibility Scales To Value Alignment

PeterMcCluskey15 Jan 2026 0:05 UTC

7 points

5 comments5 min readLW link

(bayesianinvestor.com)

Deeper Reviews for the top 15 (of the 2024 Review)

Raemon14 Jan 2026 23:59 UTC

42 points

0 comments5 min readLW link

If we get primary cruxes right, secondary cruxes will be solved automatically

Jordan Arel14 Jan 2026 22:44 UTC

1 point

1 comment4 min readLW link

Boltzmann Tulpas

Mariven14 Jan 2026 21:45 UTC

20 points

2 comments13 min readLW link

(mariven.substack.com)

Status In A Tribe Of One

J Bostock14 Jan 2026 20:44 UTC

7 points

1 comment2 min readLW link

Quantifying Love and Hatred

RobinHa14 Jan 2026 20:40 UTC

8 points

7 comments1 min readLW link

Why we are excited about confession!

Boaz Barak, Gabriel Wu and Manas Joglekar

14 Jan 2026 20:37 UTC

82 points

11 comments9 min readLW link

(alignment.openai.com)

Why Motivated Reasoning?

johnswentworth14 Jan 2026 19:55 UTC

60 points

12 comments5 min readLW link

The Many Ways of Knowing

Gordon Seidoh Worley14 Jan 2026 17:00 UTC

14 points

1 comment5 min readLW link

(www.uncertainupdates.com)

GD Roundup #4 - inference, monopolies, and AI Jesus

Raymond Douglas14 Jan 2026 15:43 UTC

32 points

0 comments6 min readLW link

Backyard cat fight shows Schelling points preexist language

jchan14 Jan 2026 14:10 UTC

114 points

13 comments3 min readLW link

Parameters Are Like Pixels

omegastick14 Jan 2026 13:45 UTC

13 points

6 comments2 min readLW link

(dumbideas.xyz)

The Evolution of Agentic AI Evaluation

Dinkar Juyal14 Jan 2026 6:35 UTC

−6 points

0 comments11 min readLW link

If researchers shared their #1 idea daily, we’d navigate existential challenges far more effectively

Jordan Arel14 Jan 2026 6:25 UTC

5 points

3 comments2 min readLW link

How Much of AI Labs’ Research Is Safety?

Lennart Finke14 Jan 2026 1:40 UTC

14 points

6 comments3 min readLW link