PreferThinker: Reasoning-Based Personalized Image Preference Assessment

Reasoning-based Personalized Assessment

Watch PreferThinker analyze and reason step-by-step

Preferred

Preferred Image 1
Processing...
Preferred Image 2
Processing...
Preferred Image 3
Processing...

Non-preferred

Non-preferred Image 1
Processing...
Non-preferred Image 2
Processing...
Non-preferred Image 3
Processing...

Candidate Images

Image A

Candidate Image A
Processing...

Image B

Candidate Image B
Processing...
Analyzing image preferences...
PreferThinker AI Reasoning
Human:

Personalized Preference Assessment

The practical yet challenging task of ​​personalized image preference assessment​​ aims to align generative models with individual user tastes. Its core challenges are the ​​limited amount of personalized data​​ per user and the ​​complex, ill-defined nature​​ of personal preferences (e.g., style, color), which make it difficult for models trained on general preferences to handle effectively.

Existing methods (CLIP-based or MLLM-based) struggle with personalization. They either lack interpretability or fail to explicitly leverage the critical prior information in a user's reference images. Merely feeding references to a model (like ViPer) is an ​​implicit​​ use that lacks interpretable reasoning steps​ to explain how the final assessment is derived from the individual's specific preferences. Therefore, a reasoning-based approach is necessary to make the assessment process transparent, grounded in the user's data, and multi-dimensional.

We propose ​​PreferThinker​​, a reasoning-based system with ​​preference profile prediction​​. Its core insight is to bridge users via a common set of visual elements (color, style, etc.).

Personalized Preference Assessment Illustration

Illustration of the motivation of personalized preference assessment.

Personalized Preference Dataset

Personalized Preference Dataset

Illustration of the proposed dataset PreferImg-CoT.

To address the difficulty in acquiring personalized data with annotated preference profiles, we constructed a large-scale dataset with ​​80K simulated users​​, each having a unique preference profile:

  • ​Preference Profile Generation​​: ​Randomly sampled five ​​visual preference elements​​ to assign preference/non-preference profiles, 20% of users (16K) assigned multiple profiles​​ to simulate real-world multi-preference
  • ​​Image Generation​​: Fed profiles and initial prompts into ​​text-to-image models​​ (e.g., Stable Diffusion) to generate ​​Reference images​​ (preferred/non-preferred pairs) and ​​Candidate images​​ (for evaluation), Selected ​​190K initial prompts​​ from ​​Lexica​​, ​​DiffusionDB​​, and ​​COCO​​ to ensure diversity
  • ​Dataset Scale​​: Finalized an annotated dataset with ​​80K user profiles​​ and ​​1.36 million images​​, laying the foundation for CoT-style dataset construction

Training Strategy and Proposed Prediction Reward

Illustration of training strategy and proposed prediction reward

Illustration of training strategy and proposed prediction reward

We employ a two-stage training strategy to elicit and incentivize the model’s structured reasoning capabilities.

Stage 1: Supervised Fine-tuning for Cold-start Initialization.

In Stage 1, the model Qwen2.5-VL-7B is initialized via supervised fine-tuning (SFT) using the PreferImg-CoT dataset. It learns to predict user preference profiles from reference images and uses them to generate interpretable scores for candidate images. Training employs autoregressive language modeling with token-level cross-entropy loss.

Stage 2: GRPO-based Reinforcement Learning for Post-training.

After SFT, Group Relative Policy Optimization (GRPO) is applied for reinforcement learning-based post-training. This method enhances generalization by generating diverse reasoning paths without a critic model. It calculates group-relative advantages from rewards of multiple responses and optimizes policies with a KL divergence constraint to balance exploration and stability.

Similarity-aware Prediction Reward Design.

To enhance the accuracy of the model’s prediction of user preference profiles, a dual-modality reward mechanism is proposed. It combines text similarity (computed via SBERT) between predicted and ground-truth preference/non-preference profiles, and image similarity (computed via DreamSim) between images generated from predicted and ground-truth profiles using a T2I model. The final reward is a weighted sum of both similarities.

Together, these two stages and the proposed reward result in a model that is not only accurate, but also interpretable and trustworthy in its reasoning process.

More Examples

Explore additional examples of PreferThinker's reasoning capabilities across diverse scenarios.

Preferred 1 Preferred 2 Preferred 3 Preferred 4 Preferred 5 Preferred 6
Non-Preferred 1 Non-Preferred 2 Non-Preferred 3 Non-Preferred 4 Non-Preferred 5 Non-Preferred 6
Candidate A A
Candidate B B
Preferred 1 Preferred 2 Preferred 3 Preferred 4 Preferred 5 Preferred 6
Non-Preferred 1 Non-Preferred 2 Non-Preferred 3 Non-Preferred 4 Non-Preferred 5 Non-Preferred 6
Candidate A A
Candidate B B
Preferred 1 Preferred 2 Preferred 3 Preferred 4 Preferred 5 Preferred 6
Non-Preferred 1 Non-Preferred 2 Non-Preferred 3 Non-Preferred 4 Non-Preferred 5 Non-Preferred 6
Candidate A A
Candidate B B
Preferred 1 Preferred 2 Preferred 3 Preferred 4 Preferred 5 Preferred 6
Non-Preferred 1 Non-Preferred 2 Non-Preferred 3 Non-Preferred 4 Non-Preferred 5 Non-Preferred 6
Candidate A A
Candidate B B

Acknowledgments

We would like to express our gratitude to the open source community for their valuable contributions and inspiration.

This website design is inspired by and adapted from the Rex-Thinker project website, which provides an excellent template for academic paper presentations with interactive demonstrations and modern web design.

We also thank the developers and contributors of the following open-source projects and resources:

  • Font Awesome for providing excellent icons
  • GitHub Pages for free hosting
  • The open source community for inspiration and tools