🤔 We identify several limitations in coordinate-generation based methods (i.e., output screen positions as text tokens x=..., y=...) for GUI grounding, including ...
Abstract: The rapid development of multimodal large language models has resulted in remarkable advancements in visual perception and understanding, consolidating several tasks into a single visual ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results