Apple Researchers Boost UI Generation with Designer Feedback

Apple’s new paper demonstrates that fine‑tuning UI generators with designer comments, sketches, and edits yields higher‑quality interfaces, even outperforming larger models like GPT‑5.

6 February 2026 by

TechStora Editorial Board

Background and Motivation

A few months ago Apple’s research team released a study on training AI to generate functional UI code that actually compiles and matches user prompts. Building on that work, the same group introduced a new paper titled Improving User Interface Generation Models from Designer Feedback, which critiques the limits of traditional Reinforcement Learning from Human Feedback (RLHF) for UI design.

Why Conventional RLHF Falls Short

Standard RLHF relies on simple thumbs‑up/down or ranking signals that do not reflect the nuanced workflows of professional designers. Designers typically use comments, sketches, and hands‑on edits to justify and improve UI decisions—information that traditional RLHF ignores.

Designer‑Native Feedback Loop

The researchers recruited 30+ designers with 2‑30+ years of experience across UI/UX, product, and service design. Participants performed design reviews ranging from a few times a month to several times a week. Their workflow involved:

Critiquing model‑generated interfaces with written comments.
Creating sketch annotations that illustrate desired improvements.
Directly editing the UI (e.g., moving components, changing colors).

These before‑and‑after changes were transformed into 1,460 paired “preference” examples, which served as training data for a reward model.

Reward Model Architecture

The reward model takes two inputs: (i) a rendered screenshot of the UI and (ii) a natural‑language description of the target UI. It outputs a scalar score calibrated so that higher‑quality designs receive larger values. To evaluate HTML code, the pipeline renders the code into screenshots via browser automation before scoring.

Fine‑Tuning the Generator

Apple used Qwen2.5‑Coder as the base UI generator and applied the designer‑trained reward model to fine‑tune it. Smaller Qwen variants were also fine‑tuned to test scalability. Although the pipeline resembles a classic RLHF loop, the learning signal originates from designer‑native feedback rather than binary ratings.

Results: Quality Gains and Surprising Wins

Key findings include:

Models fine‑tuned with designer feedback (especially sketches and direct edits) consistently outperformed the base models and those trained with conventional ranking data.
The best model, Qwen3‑Coder fine‑tuned with sketch feedback, surpassed GPT‑5 on UI generation benchmarks, despite using only 181 sketch annotations.
High‑quality expert feedback enabled smaller models to exceed larger proprietary LLMs.

Agreement rates between the research team and designers were 63.6% for sketch‑based feedback and 76.1% for direct edits, highlighting the effectiveness of richer feedback modalities.

Challenges: Subjectivity and Variance

The study acknowledges that UI quality is inherently subjective, leading to high variance in responses. Traditional ranking feedback struggles with this variance, whereas designer‑native signals provide clearer, more actionable guidance.

Implications and Future Directions

This work suggests that a modest amount of expert, design‑focused feedback can dramatically improve UI generation, making AI‑assisted design tools more reliable and aligned with professional workflows. Future research may explore scaling the approach, integrating multimodal feedback (e.g., video walkthroughs), and extending it to other design domains.