๐๐ FoodExtract-Vision: Fine-tuned SmolVLM2-500M
- Base model: https://huggingface.co/HuggingFaceTB/SmolVLM2-500M-Video-Instruct
- Fine-tuning dataset: https://huggingface.co/datasets/berkeruveyik/vlm-food-4k-not-food-dataset
- Fine-tuned model: https://huggingface.co/berkeruveyik/FoodExtract-Vision-SmolVLM2-500M-fine-tune-v3
๐ Overview
This demo showcases the power of fine-tuning for structured output generation. Compare a base vision-language model against its fine-tuned version specialized in extracting food and drink items from images in JSON format.
The base model often fails to follow the required output structure, producing inconsistent or unstructured responses. The fine-tuned model reliably generates valid JSON outputs matching the specified schema.
๐ฏ Task Description
Both models receive identical input prompts requesting food/drink classification and extraction:
Classify the given input image into food or not and if edible food or drink items are present, extract those to a list. If no food/drink items are visible, return empty lists.
Only return valid JSON in the following form:
```json
{
'is_food': 0, # int - 0 or 1 based on whether food/drinks are present (0 = no foods visible, 1 = foods visible)
'image_title': '', # str - short food-related title for what foods/drinks are visible in the image, leave blank if no foods present
'food_items': [], # list[str] - list of visible edible food item nouns
'drink_items': [] # list[str] - list of visible edible drink item nouns
}
```
๐ง Training Details
The fine-tuned model was trained on 3,698 images from the vlm-food-4k-not-food-dataset:
- Food images: Multiple categories from the Food270 dataset including various cuisines, ingredients, and prepared dishes
- Non-food images: Random internet images to teach the model to correctly identify non-food content
- Each image is labeled with structured JSON outputs including classification, titles, and extracted food/drink items
Examples