Literature Review: PARTONOMY: Large Multimodal Models with Part-Level Visual Understanding
This work addresses a persistent limitation in current Large Multimodal Models (LMMs): while they can recognize a “car” or a “plane,” they struggle significantly with part-level understanding (i.e., distinguishing a “fuselage” from a “cockpit”). The authors introduce PARTONOMY, a benchmark dataset explicitly designed for “Explanatory Part Segmentation,” requiring models to not only identify parts but also reason about their relationships (intersection, difference, part-to-whole). The authors also propose PLUM (Part-Level Understanding LMM), an architecture that uses text span tagging and introduces a visual feedback loop to condition future segmentations on past predictions.
Key Insights
-
Critique of Special Segmentation Tokens The paper criticizes the use of special tokens for segmentation that were absent during the LLM’s pretraining as this causes a distribution shift that degrades the model’s general reasoning capabilities. PLUM circumvents this by using a Span Extractor, a bidirectional self-attention module that performs BIO (Beginning, Inside, Outside) tagging on the text embeddings directly. This keeps the LLM operating within its pre-trained manifold, preserving its reasoning while enabling pixel-level grounding.
-
Visual Feedback Loops via FiLM Most segmenting LMMs operate in a “fire-and-forget” mode, where generated masks are discarded after production. PLUM implements a recursive mechanism where previous mask predictions are encoded and injected back into the generic SAM (Segment Anything Model) decoder using Feature-wise Linear Modulation (FiLM) layers. This allows the model to maintain spatial consistency and context for tasks like “Whole-to-Part” reasoning where the sum of parts must conceptually align with the object label.
-
The Explanatory Part Segmentation Task The authors argue that simple segmentation is insufficient for high-level reasoning and propose a hierarchy of tasks:
- Part Identification: Grounding specific named parts.
- Part Comparison: Logical operations on parts (Intersection and Difference) between two objects (i.e., “What parts does this banana boat have that a fishing boat does not?”).
- Part-Whole Reasoning: Deducing the object identity from its constituent parts and vice versa.
Example
Consider an image of an agricultural airplane. In a standard LMM interaction, a user might ask, “What is this?” and the model replies “A plane.” In the PARTONOMY framework, the interaction is much deeper. The user asks, “What visible parts does this agricultural airplane have?”
- Span Extraction: The LLM generates the text: “The agricultural airplane has fixed landing gear, a propulsion component, and a spraying rig.” The Span Extractor identifies the bolded text as distinct entities.
- Segmentation: The PLUM module takes the embeddings for “fixed landing gear,” projects them, and the modified SAM decoder generates the mask.
- Feedback: When generating the mask for the next part (“propulsion component”), the model is conditioned on the location of the landing gear, helping it disambiguate spatially and semantically related components.
Ratings
Novelty: 3.5/5 The shift from special tokens to span tagging in LMMs aligns with NLP fundamentals.
Clarity: 4/5 The paper does an excellent job of isolating exactly why current models fail (distribution shift and lack of history), providing a clear motivation for the proposed solution.
Personal Perspective
While this paper presents a solid dataset and a justified methodology, one has to wonder if we are over-engineering the “capability” aspect of the Large Language Model. I’m personally thinking about the integration of segmentation: is this truly a reasoning capability being learned by the LLM, or is the LLM acting as a complex router for a frozen, pre-trained SAM (Segment Anything Model) decoder? The paper essentially trains the LLM to output text that aligns with what a segmentation decoder can process. On top of that, I think there is a valid debate to be had regarding whether segmentation should be an intrinsic parameter-level capability of an LMM or simply an external tool call. LLMs are, fundamentally, statistical models predicting the next token. By forcing them to manage pixel-level distributions via internal embeddings, we might be asking them to do too much heavy lifting in a modality they aren’t optimized for.
Furthermore, the generalization capabilities on Out-Of-Distribution (OOD) data remain an open question in my opinion. The model performs well on the PARTONOMY dataset, but is it learning the abstract concept of “compartmentalization,” or is it just memorizing the topology of airplanes and cars present in the training data? If presented with a completely alien object, would the feedback loop and span extractor actually help it decompose the object, or would it fail because it lacks the semantic priors for those specific shapes? The reliance on supervised fine-tuning suggests the latter may be a significant hurdle.
Enjoy Reading This Article?
Here are some more articles you might like to read next: