Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. learn more
As enterprises around the world double down on their AI projects, access to high-quality training data has become a major obstacle. While the public webs have largely disappeared as a data source, major players such as OpenAI and Google are securing it special partnerships to expand their proprietary databases, further restricting access to others.
To address the growing issue, Sales force has taken a big step in the field of visual training data. The company has just introduced ProVision, a new framework that programmatically generates visual management data. These datasets are systematically synthesized to enable the training of high-performance multimodal linguistic models (MLMs) that answer image queries.
The company has already released the ProVision-10M data with this approach and is using it to boost the performance and accuracy of various multimodal AI models.
For data professionals, this framework represents a major breakthrough. By programmatically generating high-quality visual guidance data, ProVision reduces the reliance on limited or inconsistent databases, a common challenge in training multimodal systems.
In addition, the ability to systematically synthesize databases ensures better control, scalability and consistency, enabling faster iteration cycles and reducing the cost of data acquisition. is domain specific. This work contributes to ongoing research in the field of synthetic data generation and comes just a day later Nvidia launches Cosmosa series of purpose-built global base models for generating physics-based videos from a variety of inputs, such as text, image and video, for physical AI training.
Visual guidance data: a key ingredient for multimodal AI
Today, directional databases are central to AI pre-training or tuning. These specialized databases help models to follow and respond effectively to specific instructions or questions. In the case of multimodal AI, the models gain the ability to analyze content such as images after learning from several different data points, along with question-answer pairs – or visual instruction data – describing them.
Now, here's the thing: It's really hard to produce these visual directional databases. If an enterprise manually creates the data for each training image, it ends up spending a lot of time and human resources to complete the project. On the other hand, if he chooses to use proprietary language models for the task, he has to deal with high computational costs and the risk of hallucinations, where the quality and accuracy of the question answer pairs will not be good enough.
In addition, the use of proprietary models is also a black-box tool because it makes it difficult to define and control the data generation process or customize results in detail.
Enter Salesforce ProVision
To address these gaps, Salesforce's AI research team has created ProVision, a framework that uses scene graphs in conjunction with human-written programs to combine vision-based instructional data. put systematically.
At its core, a view graph can be described as a structured representation of image semantics, where the objects in the content are represented as nodes. The attributes of each object – such as color or size – are assigned directly to their respective nodes, while the relationships between these objects are shown as directed edges connecting the corresponding nodes. These representations can be obtained from manually annotated databases such as Visual Genome, or they can be generated with the help of a scene graph generation pipeline that combines several modern visualization models covering different aspects of image semantics, from object and find features to depth estimation.
Once the view graphs are ready, they power programs written using Python and text templates that are full-fledged data generators capable of creating question and answer pairs for AI training pipelines.
“Each (data) generator uses hundreds of predefined templates, which systematically integrate these annotations to produce diverse instructional data. These generators are designed to … compare, retrieve, and reason about basic visual concepts of objects, attributes, and relationships based on the detailed information encoded in each graph view,” wrote the researchers behind the frame in a paper.

ProVision-10M database for AI training
In its work, Salesforce used both approaches—augmenting view graphs with manual annotations and generating from scratch—to establish view graphs that power 24 single-image data generators and 14 multi-image generators.
“With these data generators, we can automatically synthesize questions and answers with an image visualization graph. For example, given an image of a busy street, ProVision can generate questions such as, “What is the relationship between the pedestrian and the car?” or “Which object is closer to the red building, (the) car or a pedestrian?” Lead researchers Jieyu Zhang and Le Xue noted in a blog post.
The data generators with the first approach, complementing Visual Genome's depth-sensing view graphs and segmentation from Depth Anything V2 and SAM-2, helped them create 1.5 million single-image directional data points and 4.2 million multi-image directional data points. Meanwhile, the other, using 120,000 high-resolution images from the DataComp database and models such as Yolo-World, Coca, Llava-1.5 and Osprey, generated 2.3 million single-image directional data points and 4.2 million points multi-image directional data.
In total, the four splits together make up ProVision-10M, a database with over 10 million unique directional data points. It is now available on Face Hugging and has already been very effective in AI training pipelines.
In particular, when the company introduced ProVision-10M in AI multi-modal tuning recipes – LLaVA-1.5 for single-image guidance data and Mantis-SigLIP-8B for multi-image guidance data – it saw notable improvements , with the average performance of the models higher than with tuning without ProVision data.
“When adopted at the direction tuning stage, our single-image direction data provides up to a 7% improvement on the 2D segmentation and 8% on the 3D segmentation of CVBench, along with a 3% increase in performance on QBench2, RealWorldQA, and MMMU. Our multi-image guidance data leads to an 8% improvement over Mantis-Eval,” the researchers said in the paper.

Synthetic data is here to stay
Although there are several tools and platformsincluding the new Cosmos global base models from Nvidia, for generating different data formats (from images to videos) that can be used for multimodal AI training, only a handful have looked at take on the problem of creating the instructional databases that pair with that data.
Salesforce is addressing that bottleneck with ProVision, giving enterprises a way to go beyond manual labels or black-box language models. The approach to programmatically generating guidance data ensures that the definition and control of the generation process and scales is efficient while maintaining factual accuracy.
In the long term, the company hopes that researchers can build on this work to enhance the scene graph generation pipelines and create more data generators covering new types of directional data, such as those for videos.
Source link