CS663 | computer vision
  • outline
  • projects
  • syllabus
  • links

 

Gemini Fine-Tuning for Concise Structured Output - Training Data Prep Tips

In the discussion below we will consider a Multi-modal (Vision + Language) type problem where the Input is image's of nutrition labels plus a question and the answer is simple and focused on nutritional values.

 

Example

Text input "What is the sugar content of this food?

Nutrition Label image

 

OUTPUT (ground Truth): 0.0g per 100 gram

 

 

 

PROBLEM the predicted output (generated output) from a tuned (lightly) model is

“Based on the Nutrition Facts label:

  • Total Carb. is 0g.

  • The label also states at the bottom: 'Not a significant source of... total sugars, added sugars.'

Therefore, the sugar content of this food is 0g per serving.”

 

QUESTION: Is there a way to get answers that are simpler and more alligned with the Ground Truth in structure


1) Choose your answer style & be consistent

This is especially true when you want concise answers that take on simple or structured format for your GENERATED content (the prediciton). For your nutrition VQA I’d use:

Answer with the minimum text necessary: for quantities return "value unit" (e.g., 0 g). No sentences, no reasoning, no restating the question.  

2) Add system instruction to every training example

Vertex AI SFT supports a systemInstruction field per example. It’s optional, but when present it conditions the model for that exampl e. (Docs show the field and purpose explicitly. (Google Cloud))

Before (one JSONL line)

 

your original might look like this

{"contents":[{"role":"user",
               "parts":[{"fileData":{"mimeType":"image/jpeg","fileUri":"gs://.../train/images/3.jpg"}},
                        {"text":"What is the sugar content per serving?"}]},
             {"role":"model","parts":[{"text":"0 g"}]}]}  

After (system instruction added)

{    "systemInstruction": {"parts": [ "text": "Answer with the minimum text necessary: for quantities return \"value unit\" (e.g., 0 g). No sentences, no reasoning, no restating the question."}]    },   
     "contents":[{"role":"user",
               "parts":[{"fileData":{"mimeType":"image/jpeg","fileUri":"gs://.../train/images/3.jpg"}},
                        {"text":"What is the sugar content per serving?"}]},
             {"role":"model","parts":[{"text":"0 g"}]}]
}  
  • role inside systemInstruction is ignored; only the text matters. (Google Cloud)

  • Keep your targets consistently terse (e.g., "0 g", never sentences).


 

3) Also set the inference system instruction

Even after SFT, repeat the rule at inference to keep responses short: (warning: setting low max_output_tokens can stop the generation of output --so should be used carefully or not at all.)

Vertex AI SDK (Python):

from vertexai.generative_models import GenerativeModel, Part

    model = GenerativeModel(model_name="projects/…/locations/…/endpoints/…/tunedModels/YOUR_TUNED_MODEL",
                            system_instruction=("Answer with the minimum text necessary: for quantities return " "value unit" (e.g., 0 g). No sentences, no reasoning, no restating the question.')  )


    resp = model.generate_content(  [Part.from_uri("gs://.../test.jpg", mime_type="image/jpeg"),  "Q: Sugar per serving? Output: value + unit."], generation_config={"temperature": 0.1, "max_output_tokens": 8}  )  

    print(resp.text)  

System instructions apply to the whole request; Vertex docs recommend them for steering style. (Google Cloud Documentation)


Sanity checklist

  • Every JSONL line now has systemInstruction.

  • All targets are terse (e.g., 0 g, 15 mg, Yes).

  • Units and spacing are consistent across the dataset.

  • Inference uses the same style + small max_output_tokens (≈8–15) and low temperature.

If you want, I can also add a tiny cell that verifies every target is short and normalized (unit, spacing) before you launch the job.

cs663:computer vision

  • home
  • outline
  • projects
  • syllabus
  • links