What visual patterns transfer between infographic and UI tasks when trained jointly?
This explores what a model actually learns in common when you train it on both infographics (charts, posters, diagrams) and app screens at the same time — and what those two visual worlds turn out to share.
This explores what a model actually learns in common when you train it on both infographics and UI screens together — and what visual skills carry over between the two. The clearest answer in the corpus is ScreenAI Can one model understand both UIs and infographics equally well?, which treats both under a single schema and pretrains on a screen-annotation task: identify what each element is and where it sits on the screen. That works because infographics and UIs are the same problem underneath — both are dense 2D layouts where meaning lives in the spatial arrangement of labeled regions (a button next to text, a legend next to a bar). Learning to parse that layout once transfers to both, which is how a relatively small 5B model reaches state-of-the-art across benchmarks: the shared signal is *spatial-semantic structure*, not surface appearance.
The transferable unit, then, is roughly "this is a typed region at this location" — and the corpus keeps confirming that this layout signal is the load-bearing one, by showing what breaks when it's missing. OmniParser Why do vision-only GUI agents struggle with screen interpretation? finds vision models fail on raw screenshots when forced to identify element meaning *and* act at once; pre-parsing screens into structured semantic elements rescues them. DocLLM Can bounding boxes replace image encoders for document understanding? pushes this further, showing bounding-box coordinates plus disentangled attention capture text-spatial alignment well enough to skip the image encoder entirely. The common thread across infographics, documents, and UIs: the reusable representation is the geometry of labeled boxes, and models do better when that structure is handed to them rather than inferred end-to-end.
The interesting catch is what *doesn't* transfer the way you'd hope. The benefit of joint training is bounded by what was actually in pretraining: across 34 models, multimodal zero-shot performance tracks how often a concept appeared in the data rather than genuine generalization Does multimodal zero-shot performance actually generalize or interpolate?. So "transfer between infographic and UI" is strongest for the layout primitives both share heavily, and weakest for rare element types either domain sees infrequently — the shared schema helps most where the visual vocabulary overlaps.
There's also a perception-vs-reasoning split worth knowing. Adding more verbal reasoning *hurts* these fine-grained visual tasks: the real bottleneck is where the model looks, not how much it explains Does verbose chain-of-thought actually help multimodal perception tasks?. That reframes joint training — what transfers between infographics and UIs is a *perceptual* skill (attend to the right region), and Agent S Can structured interfaces help language models control GUIs better? gets traction precisely by separating that grounding step from planning rather than fusing them.
So the thing you might not have known you wanted to know: the transfer here isn't "the model learns charts and reuses them on buttons." It's that both infographics and UIs reduce to the same low-level task — locate and type the regions on a screen — and almost everything that improves either domain comes from making that shared spatial-parsing step explicit instead of asking one network to do it all at once.
Sources 6 notes
ScreenAI unifies UIs and infographics under one schema, using screen-annotation pretraining to identify UI element types and locations. These annotations auto-generate QA and navigation data, enabling a 5B-parameter model to achieve state-of-the-art performance on multiple benchmarks.
OmniParser demonstrates that GPT-4V fails when forced to simultaneously identify icon meanings and predict actions from raw screenshots. Pre-parsing screenshots into structured semantic elements with descriptions lets the model focus solely on action prediction, removing the composite-task bottleneck.
DocLLM shows that bounding-box spatial information combined with decomposed transformer attention can capture text-spatial alignment in documents without pixel-based visual encoding. Pretraining on text-infilling objectives suited to irregular layouts achieves this at substantially lower computational cost than multimodal LLMs using image encoders.
Across 34 models and 5 datasets, multimodal models require exponentially more pretraining data for linear performance gains on downstream tasks. Performance correlates with how often test concepts appeared during pretraining, not genuine generalization ability.
Long rationales and text-token RL help reasoning but hurt fine-grained perception tasks because the actual bottleneck is visual attention allocation, not verbalization. Standard CoT optimization trains the wrong policy target.
Agent S's dual-input design—visual input for environmental understanding plus image-augmented accessibility trees for grounding—achieved 9.37% improvement over baseline by factoring planning and grounding into separate optimization paths rather than forcing end-to-end prediction.