techhub.social is one of the many independent Mastodon servers you can use to participate in the fediverse.
A hub primarily for passionate technologists, but everyone is welcome

Administered by:

Server stats:

4.8K
active users

#MINERVA

2 posts1 participant0 posts today

MINERVA: Evaluating Complex Video Reasoning

Arsha Nagrani, Sachit Menon, Ahmet Iscen, Shyamal Buch, Ramin Mehran, Nilpa Jha, Anja Hauth, Yukun Zhu, Carl Vondrick, Mikhail Sirotenko, Cordelia Schmid, Tobias Weyand
arxiv.org/abs/2505.00681 arxiv.org/pdf/2505.00681 arxiv.org/html/2505.00681

arXiv:2505.00681v1 Announce Type: new
Abstract: Multimodal LLMs are turning their focus to video benchmarks, however most video benchmarks only provide outcome supervision, with no intermediate or interpretable reasoning steps. This makes it challenging to assess if models are truly able to combine perceptual and temporal information to reason about videos, or simply get the correct answer by chance or by exploiting linguistic biases. To remedy this, we provide a new video reasoning dataset called MINERVA for modern multimodal models. Each question in the dataset comes with 5 answer choices, as well as detailed, hand-crafted reasoning traces. Our dataset is multimodal, diverse in terms of video domain and length, and consists of complex multi-step questions. Extensive benchmarking shows that our dataset provides a challenge for frontier open-source and proprietary models. We perform fine-grained error analysis to identify common failure modes across various models, and create a taxonomy of reasoning errors. We use this to explore both human and LLM-as-a-judge methods for scoring video reasoning traces, and find that failure modes are primarily related to temporal localization, followed by visual perception errors, as opposed to logical or completeness errors. The dataset, along with questions, answer candidates and reasoning traces will be publicly available under github.com/google-deepmind/nep\#minerva.

arXiv logo
arXiv.orgMINERVA: Evaluating Complex Video ReasoningMultimodal LLMs are turning their focus to video benchmarks, however most video benchmarks only provide outcome supervision, with no intermediate or interpretable reasoning steps. This makes it challenging to assess if models are truly able to combine perceptual and temporal information to reason about videos, or simply get the correct answer by chance or by exploiting linguistic biases. To remedy this, we provide a new video reasoning dataset called MINERVA for modern multimodal models. Each question in the dataset comes with 5 answer choices, as well as detailed, hand-crafted reasoning traces. Our dataset is multimodal, diverse in terms of video domain and length, and consists of complex multi-step questions. Extensive benchmarking shows that our dataset provides a challenge for frontier open-source and proprietary models. We perform fine-grained error analysis to identify common failure modes across various models, and create a taxonomy of reasoning errors. We use this to explore both human and LLM-as-a-judge methods for scoring video reasoning traces, and find that failure modes are primarily related to temporal localization, followed by visual perception errors, as opposed to logical or completeness errors. The dataset, along with questions, answer candidates and reasoning traces will be publicly available under https://github.com/google-deepmind/neptune?tab=readme-ov-file\#minerva.

🟡 Minerva & Grammatica 16th Century Roman Greek Goddess⁣

#16thCentury, #Arts, #Expression, #Goddess, #Gods, #Grammar, #Grammatica, #Greek, #Inspiration, #Justice, #Language, #MarinBonnemere, #Minerva, #Roman, #Wisdom

Vintage ◦ Classic ◦ Historical | Art ◦ Design ◦ Inspiration | Restored ◦ Enhanced ◦ Remixed⁣

Prints, T-Shirts, Stickers, & More by @rocketshipretro via RedBubble → bigplanetprints.com/go/OB1R2v

🪷 March 19 is the feast of #Athena #Minerva as goddess of weaving and weavers. Kiss an LGBT tapestry weaver artist today. #Ovid wrote: "Minerva teaches us how to weave on the upright loom warp with a shuttle. She tightens the loose threads with a comb. Worship her, you who create wonders in wool in bright colours. Cherish her, you who carve and sculpt in stone, or you who paint brightly coloured pictures. Minerva is the goddess of a thousand works. Surely she is the goddess of poetry as well." 🪷

Frisch gebloggt: Am 9. März 2025 präsentieren wir im #Bezirksmuseum #Neubau die neue Hörstation, die Besucher:innen in die Klangwelt der Nachkriegszeit entführt. In Anlehnung an das diesjährige Motto "Wien 1945–1955" habe ich dafür ein historisches #Minerva-#Radio umgebaut, um die Geschichte der Firma Minerva und dieses speziellen Radiomodells in einer vierteiligen Radioreihe zu erzählen – fast wie auf Radio Rot-Weiß-Rot in den 1950er-Jahren. Kommt alle :-)
weblog.co.at/minerva-super-bab

EGM Weblog - Archive Edition · Minerva Super Baby Hörstation - EGM Weblog - Archive EditionEin originales Minerva-Radio der 1950er als modernisierte Hörstation im Bezirksmuseum Neubau - ab 9. März 2025. Jetzt entdecken!

@erebion @inaruck Ich widerspreche da vehement, da es naiv ist alles in de Verantwortung eines einzelnen Anbieters zu geben.

Nicht kann das Vermitteln von #ITsec, #InfoSec, #OpSec & #ComSec ersetzen, und alle die naiv daran glauben, dass @signalapp / #Signal deren Arsch retten wird, dürften genauso door reinglotzen wie die Opfer von #MINERVA / #RUBIKON aka. #CryptoLeaks.

Replied in thread

@dalias @lauren
@pixelschubsi

Also the blatant dismissal of absolitely basic #OpSec & #ComSec is just flabberghasting.

Only #decentralized, #OpenSource & #OpenStandards can actuall survive long-term and remain #secure.

It's the same reasons we use #PGPG/MIME & #SSH and not #X400 & #X25!

IOW: Think "How can you weaponize Signal?" and see what you csn do just holding key people in contempt...

The less #info a provider has, the less they can be forced to snitch upon customers.

"#JustUseSgnal!" is a form of dangerous "#TechPopulism" aimed at bamboozling #TechIlliterates who don't know better, abusing information asymetry to pull rank instead of investing the time and effort to *explain "how" and "why" this is indeed a good or bad idea.

The only ones that have a chance to beat that are @delta / #deltaChat but that's just #PGP/MIME #eMail in a nice UI...

  • You may now laugh at me and think my "#TinfoilHat sits too tight" but I'm shure sooner or later I'll be evidenced as correct...
Hachyderm.ioCassandrich (@dalias@hachyderm.io)@kkarhan@infosec.space @signalapp@mastodon.world @monocles@monocles.social @lauren@mastodon.laurenweinstein.org Very few systems promoted as Signal alternatives match the cryptographic privacy properties (see: ratcheting, etc.) of Signal. The claims about "located in the USA" and "Cloud Act" are all nonsense because the only threat to Signal users from this is availability (seizure and shutdown of the server infrastructure), not undetected breakage of privacy properties. There are presently no systems with superior privacy properties to Signal *and* level of functionality on par with what general public expects. There are a lot (like the XMPP stuff, *sigh*, and Matrix) that are worse in both regards. If you're happy with reduced functionality, Cwtch (and possibly some other similar Tor-based systems) or VeilidChat are stronger, but it's gonna be a while before you convince normies to use them, and in the mean time they're still going to be on insecure shit like WhatsApp, FB Messenger, Telegram, etc...
Replied in thread

@sylv_a personally, I'd recommend #XMPP+#OMEMO (and #PGP/MIME - encrypted #eMail) for real #E2EE with #SelfCustody of Keys as well as actual #decentralization.

Cuz I noone's gonna risk jailtime for (non-paying!) users - it at all…

In fact I'd call U.S. MIL/INTEL as "criminally incompetent" if they didn't manage to plant multiple people inside @signalapp / #Signal or any other single-vendor / single-provider messenger.

Personally, solutions like Signal & #Threema have a stench like #CryptoAG / #MINERVA / #Rubikon and #ANØM / #OperationIronside / #OperationTrøjanShield.

By contrast: #OpenStandards like XMPP+OMEMO & PGP/MIME are independently verifyable and not dependent on on a single individual/organization for maintenance/survival/implementation/development.

Personally I'd still recommend @monocles / #monocles with #monoclesChat & #gajim...

Twitterthaddeus e. grugq on Twitter“I’m gonna tell you a secret about “logless VPNs” — they don’t exist. Noone is going to risk jail for your $5/mo https://t.co/Q2aOQJkG4g”