🧿🪬🍄🌈🎮💻🚲🥓🎃💀🏴🛻🇺🇸<p>What are the results of the '<a href="https://mastodon.social/tags/AccountingBench" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>AccountingBench</span></a>' <a href="https://mastodon.social/tags/benchmark" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>benchmark</span></a>, which tests an <a href="https://mastodon.social/tags/AI" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>AI</span></a> model for monthly <a href="https://mastodon.social/tags/accounting" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>accounting</span></a> tasks?</p><p>> <a href="https://mastodon.social/tags/Gemini" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>Gemini</span></a> 2.5 Pro, <a href="https://mastodon.social/tags/chatGPT" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>chatGPT</span></a> o3, and o4-mini were unable to close the books for a month and gave up midway. <a href="https://mastodon.social/tags/Claude" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>Claude</span></a> 4 and <a href="https://mastodon.social/tags/Grok" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>Grok</span></a> 4 maintained accuracy of over 95% for the first few months, but Grok's score dropped sharply in the fifth month. Claude 4's score also gradually dropped, eventually falling below 85%.</p><p><a href="https://gigazine.net/gsc_news/en/20250724-accountingbench/" rel="nofollow noopener" translate="no" target="_blank"><span class="invisible">https://</span><span class="ellipsis">gigazine.net/gsc_news/en/20250</span><span class="invisible">724-accountingbench/</span></a></p><p><a href="https://mastodon.social/tags/llm" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>llm</span></a></p>