Background and Motivation
A few months ago Apple’s research team released a study on training AI to generate functional UI code that actually compiles and matches user prompts. Building on that work, the same group introduced a new paper titled Improving User Interface Generation Models from Designer Feedback, which critiques the limits of traditional Reinforcement Learning from Human Feedback (RLHF) for UI design.
Why Conventional RLHF Falls Short
Standard RLHF relies on simple thumbs‑up/down or ranking signals that do not reflect the nuanced workflows of professional designers. Designers typically use comments, sketches, and hands‑on edits to justify and improve UI decisions—information that traditional RLHF ignores.
Designer‑Native Feedback Loop
The researchers recruited 30+ designers with 2‑30+ years of experience across UI/UX, product, and service design. Participants performed design reviews ranging from a few times a month to several times a week. Their workflow involved:
- Critiquing model‑generated interfaces with written comments.
- Creating sketch annotations that illustrate desired improvements.
- Directly editing the UI (e.g., moving components, changing colors).
These before‑and‑after changes were transformed into 1,460 paired “preference” examples, which served as training data for a reward model.
Reward Model Architecture
The reward model takes two inputs: (i) a rendered screenshot of the UI and (ii) a natural‑language description of the target UI. It outputs a scalar score calibrated so that higher‑quality designs receive larger values. To evaluate HTML code, the pipeline renders the code into screenshots via browser automation before scoring.
Fine‑Tuning the Generator
Apple used Qwen2.5‑Coder as the base UI generator and applied the designer‑trained reward model to fine‑tune it. Smaller Qwen variants were also fine‑tuned to test scalability. Although the pipeline resembles a classic RLHF loop, the learning signal originates from designer‑native feedback rather than binary ratings.
Results: Quality Gains and Surprising Wins
Key findings include:
- Models fine‑tuned with designer feedback (especially sketches and direct edits) consistently outperformed the base models and those trained with conventional ranking data.
- The best model, Qwen3‑Coder fine‑tuned with sketch feedback, surpassed GPT‑5 on UI generation benchmarks, despite using only 181 sketch annotations.
- High‑quality expert feedback enabled smaller models to exceed larger proprietary LLMs.
Agreement rates between the research team and designers were 63.6% for sketch‑based feedback and 76.1% for direct edits, highlighting the effectiveness of richer feedback modalities.
Challenges: Subjectivity and Variance
The study acknowledges that UI quality is inherently subjective, leading to high variance in responses. Traditional ranking feedback struggles with this variance, whereas designer‑native signals provide clearer, more actionable guidance.
Implications and Future Directions
This work suggests that a modest amount of expert, design‑focused feedback can dramatically improve UI generation, making AI‑assisted design tools more reliable and aligned with professional workflows. Future research may explore scaling the approach, integrating multimodal feedback (e.g., video walkthroughs), and extending it to other design domains.