Term Project Presentation and Poster Session

  • Date: 2026/06/10 (Wednesday)
  • Time: 3:30PM- 5:00PM
  • Location: Room 106 and 1st Floor, Building 301
  • Schedule
  • 3:30PM ~ 3:55PM : Short video presentation (Room 106, Building 301)
  • 4:00PM ~ 5:00PM : Poster Session (1st Floor, Building 301)
  • Poster Session Page
  1. Algorithmic Foundations of Robot Learning
  • Rollout-Free Demonstration Curation for Imitation and Offline Reinforcement Learning in Robotic Manipulation (Kunze Leonhard) - Best Poster Award!

  • Adaptive Expectile Scheduling for Goal-Conditioned Offline Reinforcement Learning (Taewook Kang)

  • Latent Predictive Representation Learning for Robot Control: An Empirical Study on LIBERO (Woohyeon Park)

  • Learning Composable Effect-Conditioned Skills via Successor Feature Control (Hochan Bang)

  • Model-Agnostic Sim-to-Real Adaptation via Trajectory Alignment with Iterative Sim-Real Refinement (Hyeondal Son)

  • Flow Matching Policy with Paramteric Curve Actions (Minchang Song)
  • Target Bridge Hindsight Relabeling for Sparse Goal Conditioned Reinforcement Learning (Seoyeon Ahn)
  • Evaluating Geometric Horizon Models for Image-based Offline Reinforcement Learning (Jaeseok Yang)
  • Leveraged GAIL (L-GAIL): Adversarial Imitation from Positive and Negative Demonstrations via a Constrained MDP (Younghwan Lee)
  • Counterfactual Near-Miss Risk Augmentation for Safe Reinforcement Learning (Seoyoung Lim)
  1. Learning for Robotics & AI Systems
  • Cost-Effective Adaptive Curriculum Learning for Distributional RLVR Training of Vision Language Models (Dohyung Kim)

  • Goal-Conditioned Reinforcement Learning for Planar Visual Servoing: A Compressed State-Space Proxy Study (Byungju Kim)
  • Action-Conditioned 3D Gaussian World Models via Point-Cloud Flow Prediction (Jooyoung Kim)
  • One-Shot VLA Adaptation under Environmental Shifts via Weight-Space Analogy Operations (Taeheon Kim)
  • CarryMimic: Stable Whole-Body Humanoid Object Carrying via Force-Closure Reward (Kyungrok Rho) - Best Poster Award!
  • Active Deep Descriptor and Keypoint Seeding for Local-Trajectory-Aware Visual Odometry (Wisoo Song)
  • Context-Aware Physics-Informed World Models for Non-Stationary Robotic Manipulation (Sungkwon On)
  • Beyond Accuracy: Diagnosing Reward-Induced Behavioral Changes in Mathematical Reasoning (Yongha Lee)
  • Proprioceptive Motion Memory for Efficient Receding-Horizon VLA Control (Hosung Lee)
  • Leveraging Single-Robot Action Models for Multi-Robot Manipulation via Learned Residual Coordination (Yunseok Han)
  • Test-Time Action Optimization over Learned Dynamics for Robust Non-Prehensile Pushing (Jeonghwa Heo)

Algorithmic Foundations for Robot Learning

  • Rollout-Free Demonstration Curation for Imitation and Offline Reinforcement Learning in Robotic Manipulation (Kunze Leonhard)

Offline robot learning often depends on demonstration datasets with mixed-quality behavior, where not all trajectories are equally useful for policy learning. This project studies whether simple rollout-free data curation can improve learning from such datasets without collecting additional online interaction. Trajectory-level scoring methods based on task reward, trajectory length, action statistics, and motion regularity are used to construct curated training subsets for robotic manipulation. The approach is evaluated on low-dimensional robomimic manipulation data using both recurrent behavior cloning and offline reinforcement learning methods, with a particular focus on Implicit Q-Learning. The experiments compare full-data training, random or metadata-based subsets, and several curated subset variants. Results show that curated datasets can substantially improve imitation learning performance, while offline reinforcement learning performance is highly sensitive to the selected data distribution and training recipe. The project highlights rollout-free curation as a practical and compute-efficient tool for robot learning, while also showing that trajectory-quality heuristics must be validated empirically rather than assumed to transfer across algorithms.

  • Adaptive Expectile Scheduling for Goal-Conditioned Offline Reinforcement Learning (Taewook Kang)

Goal-conditioned offline reinforcement learning aims to learn goal-reaching policies from pre-collected data by estimating goal-conditioned value functions. Many recent approaches rely on expectile regression for value estimation, where a single global expectile controls the emphasis placed on high-return transitions. However, a fixed expectile is overlyrigid: distant state–goal pairs require conservative updates to avoid overestimation under sparse successes, while near-goal pairs benefit from more exploitative learning. We propose Adaptive Expectile Scheduling (Adaptile), which replaces the fixed expectile with a state–goal–dependent expectile computed from an estimated temporal distance between states and goals. Adaptile is a drop-in modification to the value regression step and can be implemented via a heuristic schedule or a lightweight learned scheduler. On long-horizon tasks in OGBench, Adaptile outperforms existing baselines, improving average success rates by up to 13 percentage points and accelerating training.

  • Latent Predictive Representation Learning for Robot Control: An Empirical Study on LIBERO (Woohyeon Park)

Self-supervised visual pretraining is a natural way to reduce the amount of labeled demonstration data needed for robot manipulation, but the best pretraining signal for control remains unclear. This paper studies whether action-conditioned latent prediction, implemented in the style of a Joint Embedding Predictive Architecture (JEPA), provides a better initialization for behavior cloning than pixel reconstruction or contrastive pretraining. Using the LIBERO-Spatial benchmark, we compare four transfer settings: training from scratch, masked pixel reconstruction, contrastive temporal prediction, and JEPA-style latent prediction. The main result is negative: after fine-tuning, latent prediction reaches only 0.115 mean success rate, compared with 0.300 for contrastive pretraining, 0.265 for pixel reconstruction, and 0.135 for training from scratch. Frozen-transfer performance is even more revealing: the JEPA encoder obtains 0.005 mean success rate, indicating that the learned representation is not directly usable by a policy head. We further evaluate loss, target-momentum, horizon, variance-regularization, and hybrid reconstruction ablations, plus a data-scale study with 10, 25, and 50 demonstrations per task. The results show that short-horizon latent prediction improves the JEPA family to 0.210 success, but still does not match the strongest baselines; pixel reconstruction is more reliable at 10 and 25 demonstrations, while contrastive pretraining benefits most from the full 50-demonstration regime. Overall, low self-supervised latent prediction loss does not guarantee control-relevant state representations in this small-data setting, motivating stronger anti-collapse constraints and denser auxiliary supervision for latent-predictive robot pretraining.

  • Learning Composable Effect-Conditioned Skills via Successor Feature Control (Hochan Bang)

Unsupervised skill discovery aims to acquire reusable behaviors without task-specific rewards, but many existing methods encourage diversity primarily through distinguishable state visitation patterns. While effective for exploration, such skills are not necessarily composable or aligned with downstream control objectives, especially in continuous robotic domains where useful behaviors often correspond to controllable effects rather than isolated states. We propose an effect-conditioned successor feature framework for learning composable skills in continuous control. Instead of using an arbitrary latent skill variable, we condition a universal low-level policy on a semantic weight vector (w), where each dimension corresponds to a reward-relevant effect feature. The policy is trained to maximize the scalarized successor feature objective induced by (w), encouraging each skill direction to control a distinct future effect while allowing mixed weights to represent compatible objective compositions. This formulation connects unsupervised skill learning with successor feature-based multi-objective control: (w) serves both as a task descriptor and as a skill-conditioning variable. We further discuss regularization strategies for preserving axis-wise controllability and mixed-weight co-activation, preventing the learned policy from collapsing into disconnected latent modes. Experiments in continuous control environments evaluate whether the proposed method learns skills that are not only diverse, but also reusable and composable under novel objective combinations. The proposed framework provides a step toward task-agnostic pretraining of control primitives that can be recombined through interpretable effect-level objectives.

  • Model-Agnostic Sim-to-Real Adaptation via Trajectory Alignment with Iterative Sim-Real Refinement (Hyeondal Son)

Deploying simulation-trained robot policies to physical hardware is impeded by dynamics mismatches in contact, friction, and actuation. Residual reinforcement learning addresses this by adding corrective actions to a frozen base policy, but additive corrections in the ambient action space are insufficient to capture the nonlinear geometric discrepancy between the joint state-action trajectory manifolds of simulation and reality. We propose a model-agnostic framework that learns geometric mappings between these manifolds without modifying the base policy or requiring real-world reward signals. Three lightweight networks are trained externally: a State Steerer aligns observation spaces via a dynamics consistency objective; a state-conditioned Forward Adapter translates simulation action chunks to real-world counterparts using a Gromov-Wasserstein loss over joint state-action trajectory distributions; and a state-conditioned Backward Projector, trained with simulator rewards evaluated on back-projected real trajectories, enables automatic quality assessment of real-world behavior without reward instrumentation. Exploration in the real environment is confined to the tangent space of the estimated real-world trajectory manifold to reduce physical constraint violations. The framework is evaluated on Sim-to-Sim transfer between MuJoCo MJX and Isaac Sim on tabletop manipulation and humanoid locomotion and manipulation tasks with controlled dynamics gap magnitudes, and compared against residual reinforcement learning baselines on task success rate, trajectory alignment error.

  • Flow Matching Policy with Paramteric Curve Actions (Minchang Song)

This proposal presents a flow matching policy that represents robot action chunks as smooth parametric curves rather than discrete-time sequences of target actions. While recent diffusion- and flow-based visuomotor policies have shown strong performance by predicting short future action chunks, their discrete action representation can produce discontinuities between consecutive chunks, and offer limited robustness when the robot deviates from demonstrated states. To address these limitations, the proposed method learns a conditional generative policy over curve parameters, allowing each predicted action to define a continuous trajectory in Euclidean space or SE(3). By separating boundary conditions from shape parameters, the policy can enforce smoother transitions, preserve the geometric path under temporal rescaling, and generate recovery-oriented actions from perturbed states. The method first converts demonstrated action chunks into parametric curve representations through boundary extraction and residual fitting, then trains a conditional flow matching model to generate curve parameters from robot states and observations. At execution time, the generated curve can be sampled at the controller frequency, connected to neighboring chunks through via-point modulation, and executed at different speeds without altering the underlying path. The proposed approach will be evaluated through controlled synthetic tasks, and the LIBERO simulation benchmark. The expected contribution is a structured action representation for generative visuomotor policies that maintains task success and multimodal behavior while improving smoothness, and robustness to perturbations.

  • Target Bridge Hindsight Relabeling for Sparse Goal Conditioned Reinforcement Learning (Seoyeon Ahn)

Sparse reward goal conditioned reinforcement learning is difficult because agents receive useful reward feedback only after reaching the desired goal. Hindsight Experience Replay (HER) mitigates this issue by relabeling failed trajectories with goals that were achieved later in the same trajectory, but standard HER typically samples such hindsight goals without considering their relevance to the original task. This project studies Target Bridge Hindsight Relabeling (TBH), a relabeling strategy that prioritizes future achieved goals that can serve as useful intermediate targets for reaching the original desired goal. For each sampled transition, TBH scores future achieved goals using two criteria: progress toward the commanded target and temporal reachability from the current state. The progress term favors goals that better align with the original task, while the temporal term favors goals that can serve as practical bridges rather than goals that are either too immediate or too far along the trajectory. The method modifies only the HER replay buffer goal sampling rule and keeps the base off policy reinforcement learning algorithm unchanged. Experiments on sparse PointMaze environments compare vanilla HER, Progress HER (PH), which uses target progress alone, and TBH, which combines target progress with a reachability based preference. The project also considers a learned temporal geometry extension inspired by Hilbert representations, where Euclidean target progress can be replaced by a reachability aware distance estimate. The study examines whether target aware and reachability calibrated hindsight relabeling improves sample efficiency, and how the mixture ratio and distance metric of target biased relabeling affect the balance between directed learning and replay diversity.

  • Evaluating Geometric Horizon Models for Image-based Offline Reinforcement Learning (Jaeseok Yang)

World models for offline reinforcement learning suffer from compounding errors when longhorizon futures are predicted through repeated one-step rollouts. Geometric Horizon Models (GHM) address this problem by directly sampling discounted future states, but applying them to image-based environments requires handling high-dimensional observations, leading to substantial computational cost. To make this tractable, we train a TD-Flow-based GHM in compact latent spaces extracted from pretrained visual encoders and evaluate its integration with ReBRAC for image-based offline reinforcement learning. Although the trained GHM generates visually plausible future states, its predictions become blurry and inaccurate for objects with diverse dynamics, such as robotic arms. Consequently, GHM-augmented ReBRAC performs substantially worse than baseline ReBRAC without GHM, suggesting that inaccurate sampled future states produce unreliable critic targets and destabilize actor-critic learning. Motivated by this limitation, we further evaluate an existing Universal Horizon Model (UHM) framework for controllable N-step future-state sampling. UHM-generated samples show clearer future states with reduced blurriness across specified horizons, indicating that horizon-conditioned future prediction may provide more reliable critic targets for future image-based offline RL pipelines.

  • Leveraged GAIL (L-GAIL): Adversarial Imitation from Positive and Negative Demonstrations via a Constrained MDP (Younghwan Lee)

Imitation learning (IL) typically assumes access only to expert demonstrations, discarding the abundant non-expert and failure data that arise naturally in human teleoperation. We propose Leveraged GAIL (L-GAIL), a deep adversarial imitation framework that exploits both positive and negative demonstrations. Building on the leverage concept of Leveraged Inverse Reinforcement Learning (LGPIRL) — where positive demonstrations encode "what to do" and negative ones "what not to do" — L-GAIL reinterprets negative demonstrations as a soft safety cost and casts imitation as a constrained Markov decision process (CMDP). A primary discriminator supplies GAIL's imitation reward, while a second learned discriminator defines an avoidance cost that is held below a leverage budget d and enforced with a PID-Lagrangian. Unlike prior mixed-quality approaches that rely on fixed state-distance heuristics and reward shaping, L-GAIL learns the negative signal in a scalable, state–action form and admits a principled reduction to vanilla GAIL when negative data are absent. We hypothesize that the induced saddle point minimizes a signed Jensen–Shannon divergence. We will evaluate L-GAIL on Safety-Gymnasium and MuJoCo against BC, GAIL, AIRL, and MixGAIL, measuring task return, sample efficiency, and constraint-violation rate.

  • Counterfactual Near-Miss Risk Augmentation for Safe Reinforcement Learning (Seoyoung Lim)

Safe reinforcement learning is important for deploying autonomous agents in real-world robotic and navigation systems, where unsafe exploration can cause physical damage or constraint violations. Existing methods often rely on cost functions to guide policy learning, but these costs are commonly represented as sparse binary indicators that become informative only after unsafe events occur. As a result, agents may need to experience unsafe interactions before learning meaningful safety-aware behavior. To address this limitation, we propose a counterfactual near-miss risk augmentation method. The proposed method focuses on time steps near potential safety violations, restores the corresponding simulator states, applies local perturbations around the policy action, and evaluates short-horizon rollout outcomes to estimate whether nearby actions would have approached or caused constraint violations. This provides an action-conditioned safety signal for critical states where the original binary cost alone may be too sparse or delayed. The estimated risk is incorporated into learning as an auxiliary safety signal, either by adding a reward penalty for risky decisions or by augmenting the cost used in constrained policy optimization. We examine both variants on a goal-reaching navigation task, compare them with unconstrained and Lagrangian baselines, and analyze their effects on the safety-return trade-off.

Learning for Robotics & AI Systems

  • Cost-Effective Adaptive Curriculum Learning for Distributional RLVR Training of Vision Language Models (Dohyung Kim)

Reward-maximizing RL methods have shown to be capable of enhancing the reasoning performance of LLMs, but often lead to reduced generation diversity. Recent works address this issue by adopting GFlowNets, training LLMs to match a target distribution while jointly learning its partition function. PACED-RL further enhances the training dynamics by showing that the partition function learned during GFlowNet training can be used for more than normalization. Specifically, the learned partition function can be interpreted as a per-prompt expected-reward estimate, providing an online signal of prompt difficulty or accuracy. This signal enables training to prioritize prompts near the frontier of learnability, where the model is neither consistently correct nor consistently incorrect. In addition, PACED-RL uses the error in these accuracy estimates to guide prioritized replay, allowing training to revisit prompts whose predicted difficulty is poorly calibrated. By reusing information already produced during GFlowNet optimization, PACED-RL improves sample efficiency without requiring additional rollout-time supervision or separate reward estimation models. In this work, we examine if the PACED-RL framework can be extended to the multi-modal reasoning domain using VLMs. Empirical results indicate this to be a promising direction towards a more sample efficient post-training framework for VLMs.

  • Goal-Conditioned Reinforcement Learning for Planar Visual Servoing: A Compressed State-Space Proxy Study (Byungju Kim)

Image-Based Visual Servoing (IBVS) steers a camera toward a desired view through the analytic image Jacobian, but is known to stall in local minima, lose features under large displacements, and exhibit the camera-retreat problem under large rotations. Learned pixel-to-velocity policies remove the Jacobian, yet are typically trained for a single fixed goal view and do not transfer to new targets without retraining. We ask whether goal conditioning removes this limitation. As a compressed proxy for a goal-image conditioned policy, we cast planar visual servoing as an SE(2) pose-reaching Markov decision process: the agent emits a continuous body-frame twist (v_x, v_y, ω_z) to align its pose with a goal pose, observing pose features rather than 64×64 image pairs. We train a Soft Actor-Critic (SAC) policy that conditions on the goal and compare it against (i) an analytic proportional / IBVS-like controller and (ii) a fixed-goal SAC agent that shares the architecture but never observes the goal. On 200 randomly sampled start–goal pairs, the goal-conditioned policy attains 90% success versus 3% for the fixed-goal baseline, and retains 72% on a held-out out-of-distribution rotation band — indicating that goal conditioning, not added capacity, drives generalization. We present this state-space result as a controlled stepping stone toward the full image-based system in the original proposal.

  • Action-Conditioned 3D Gaussian World Models via Point-Cloud Flow Prediction (Jooyoung Kim)

3D scene understanding is crucial for robots to perceive, reason, and act in unstructured environments. However, robots observe the 3D world through 2D camera projections, making it difficult to infer dynamic 3D structure during interaction. Recent advances in Gaussian splatting have enabled high-quality 3D and 4D scene reconstruction from visual observations, but existing 4D Gaussian world models typically assume dense static views or smoothly moving cameras, which do not match common robotic operation settings with sparse, weakly overlapping cameras. To address this problem, we propose an action-conditioned 3D Gaussian world model grounded in point-cloud flow prediction. We first fine-tune a pretrained point-cloud world model, PointWorld, to predict future scene point clouds conditioned on robot actions. We then anchor Gaussian centers to the predicted point cloud and optimize the representation through image reconstruction losses using differentiable Gaussian splatting. This design combines the structural grounding of point-cloud prediction with the renderability of Gaussian splatting, enabling explicit 3D prediction of scene dynamics while allowing the predicted representation to be rasterized back into 2D camera views for image-level supervision. We evaluate our method on object interaction data collected with a UR5 robot under a five-static-camera setup. We assess reconstruction quality using standard novel-view synthesis metrics such as PSNR, SSIM, and LPIPS, and further evaluate future video prediction quality using FVD, frame-level FID, and temporal consistency measures. Our results show that this approach improves action-conditioned dynamic scene reconstruction under sparse robotic camera setups, enabling robots to predict future 3D scene changes in a more interpretable and visually grounded manner.

  • One-Shot VLA Adaptation under Environmental Shifts via Weight-Space Analogy Operations (Taeheon Kim)

Vision-Language-Action (VLA) models often fail to perform the same learned tasks under environmental shifts, such as changes in camera pose and shifts to a different but similar robot (e.g., from Panda to UR5e). Adapting these models to the shifted environment (i.e., target domain) often requires training on multiple demonstrations for each task, which are costly to collect. To reduce the burden of data curation and training, we propose a simple analogy-based method that adapts VLA models under environmental shifts through weight vector arithmetic with domain-specific information addition, named Domain ARiThmetic (DART). Unlike prior approaches, DART requires collecting only a single demonstration, enabling efficient adaptation. To accurately isolate domain-specific information for addition, DART performs subspace alignment between singular components in weight vectors to filter out noisy components. In both simulated and real-world experiments, DART outperforms existing VLA adaptation methods in one-shot scenarios across diverse visual and embodiment shifts.

  • CarryMimic: Stable Whole-Body Humanoid Object Carrying via Force-Closure Reward (Kyungrok Rho)

Can a humanoid carry objects dynamically without dropping them? Imitation-based methods have made rapid progress in whole-body humanoid control, recently extending it to object carrying. But this extension relies on (1) manual contact annotation of reference motions, (2) additional sensors, or (3) multi-stage training, since kinematic tracking alone provides little signal on how to stably grasp the object. We introduce CarryMimic, which augments a standard motion-imitation with a single reward term grounded in force-closure theory: the classical criterion for whether a contact configuration can resist arbitrary external wrenches. The force-closure reward alone raises the carrying success rate from 39.8% to 62.3% in simulation and enables highly agile real-world behaviors, including running with a box held overhead.

  • Active Deep Descriptor and Keypoint Seeding for Local-Trajectory-Aware Visual Odometry (Wisoo Song)

Deep keypoint detectors such as XFeat produce richer and more robust feature correspondences than classical pipelines, yet visual odometry systems continue to extract keypoints passively. This strategy wastes computation on fea- tures that will exit the field of view before contributing a geometric constraint or lie on image regions with unstable appearance. We propose a learned active seeding policy that modulates the XFeat keypoint confidence heatmap with a low- resolution spatial grid mask and simultaneously adapts the Non-Maximum Suppression (NMS) threshold τ, concentrating keypoint extraction on image regions that remain co-visible and pose-invariant across the local trajectory. The policy is formulated as a Markov decision process (MDP) whose state encodes XFeat’s keypoint confidence and reliability maps, a crude local trajectory, the current tracker occupancy, per- keypoint stereo disparity, and the local bundle-adjustment pose residual. Leveraging expert scoremaps, we train the policy via Diffusion-Reward Adversarial Imitation Learning (DRAIL), which replaces hand-crafted parameters related to feature selec- tion; the Actor network of the resulting Actor-Critic provides dynamic adaptation of both the spatial mask and the NMS threshold. We benchmark our system (XFeat+DRAIL+RTAB- Map) against a passive XFeat+RTAB-Map baseline on the EuRoC MAV dataset, measuring ATE and RPE to isolate the gain from imitation learning.

  • Context-Aware Physics-Informed World Models for Non-Stationary Robotic Manipulation (Sungkwon On)

Real-world robotic manipulation is inherently non-stationary: hidden factors such as object mass, friction, contact compliance, and actuator behavior can change the resulting dynamics even under identical observable states and actions. Standard world models that learn a fixed transition function, tend to absorb these variations as noise, limiting their reliability under hidden context shifts. This project studies how latent context inference and physics-structured dynamics can improve world-model prediction for the Franka Panda Lift task in NVIDIA Isaac Lab. The study compares a family of world models along two complementary axes. The first axis concerns context representation, ranging from a no-context MLP baseline to VAE-based latent context models using FiLM modulation or input concatenation, as well as a structured Context-DeLaN model that infers a latent context from state-action history. The second axis concerns physics structure, ranging from black-box MLP dynamics to DeLaN-based models with learned positive-definite inertia, Coriolis, and gravity terms. In addition, the project introduces an extention of DeLaN with explicit joint damping, smoothed Coulomb friction, and a residual dynamics head designed to capture actuator- and discretization-induced modeling bias. All variants are trained on PPO-policy rollouts collected under randomized object and robot physical parameters. They are evaluated using multi-step joint-state prediction error and endeffector/object-pose rollout error, with separate analysis before and after contact. By comparing context mechanisms and physics structures under the same manipulation setting, this project aims to clarify which design ingredients best support reliable context-dependent world modeling for contact-rich robotic manipulation.

  • Beyond Accuracy: Diagnosing Reward-Induced Behavioral Changes in Mathematical Reasoning (Yongha Lee)

Reinforcement learning is increasingly used to improve large language models' reasoning capabilities, but it is usually evaluated by final-answer accuracy, which does not reveal what a reward has reinforced. We study this question in mathematical reasoning by training Qwen2.5-Math-1.5B under a controlled setup: the prompt distribution, output format, RL algorithm, rollout budget, and training size remain fixed while only the reward is varied. We construct the RL training dataset from the MATH training split by converting source problems into normal solving, verification, error-detection, and correction tasks, exposing all reward conditions to the same range of reasoning behaviors and contexts. We compare outcome-only reward with rewards for reasoning-action attempts, metacognitive form, and superficial reasoning features, then decompose the surface effect through length, step-marker, math-line, and grounding ablations. The central finding is that reward-induced gains decompose into separable behavioral components. On a controlled evaluation, superficial reward nearly matches outcome-only reward. On MATH-500, superficial reward outperforms outcome-only reward, reaching 0.364 versus 0.344, while surface-length and reasoning-action rewards improve further to 0.380 and 0.372. Object-math-line reward gives the strongest transfer to GSM8K. These gains are not explained by generic verbosity or step formatting alone. Win/loss decompositions show that successful changes add problem-grounded mathematical work and final-derivation structure, whereas unsuccessful changes often add work that is less connected to the problem or final answer. Expanded computation also does not reliably improve validity judgment: on MATH-500-derived validity traps, where polished reasoning traces may contain hidden errors, superficial reward adds nearly 99 tokens on average without improving accuracy, indicating that models remain biased toward accepting polished valid reasoning rather than rejecting polished invalid reasoning. Metacognitive-form reward performs best on this small trap set, which suggests a possible effect on specific aspects of math reasoning like validity judgement, although more evidence is needed. Overall, small-model math RL can benefit from superficially specified rewards when they induce grounded derivational work, but ordinary problem solving and robust verification of external reasoning traces remain distinct behavioral capabilities.

  • Proprioceptive Motion Memory for Efficient Receding-Horizon VLA Control (Hosung Lee)

Vision-language-action policies have emerged as a promising framework for robot manipulation, and recent action-chunking variants generate continuous action sequences from visual observations, language instructions, and proprioceptive state. However, action chunk execution introduces a fundamental trade-off: executing long chunks improves inference efficiency but reduces feedback and allows execution errors to accumulate, while frequent replanning improves closed-loop responsiveness at higher computational cost. In this work, we study how proprioceptive history can make this trade-off more favorable for action-chunk-generating VLA policies, using SmolVLA as a representative model. We propose a proprioceptive motion memory that encodes recent robot-state history into a compact motion-aware latent representation and conditions the policy during short-horizon action execution. Across LIBERO suites, we find that intermediate-cadence replanning substantially improves success over long open-loop execution, and that low-compute flow inference further improves the efficiency-success trade-off. Under this practical setting, proprioceptive motion memory improves policy stability over standard state conditioning, achieving the best full-suite success while maintaining efficient action throughput. The gains are especially pronounced on harder long-horizon tasks and under simulated proprioceptive delay and noise corruptions. These results suggest that proprioceptive history is most useful not simply as additional state input, but as a lightweight motion memory that stabilizes efficient receding-horizon VLA control.

  • Leveraging Single-Robot Action Models for Multi-Robot Manipulation via Learned Residual Coordination (Yunseok Han)

Many manipulation tasks---cooperative transport, bimanual assembly, handover---require several robots to act as a coupled system, yet most learned manipulation policies remain single-robot. We ask whether a frozen single-robot policy LaTeX: \pi_\theta π θ can be reused per robot while a small learned module supplies only the missing coordination. Our contribution is a falsifiable use-vs-replace diagnostic suite---semantic base controls, a grasp-matrix wrench decomposition, and a base-ablation/perturbation sweep. In a controlled cooperative planar-transport testbed (a center-of-mass force is underactuated for planar pose, so rotation is the intrinsically multi-robot component), the suite gives a clear result: the meaningful frozen base is genuinely reused---the executed policy keeps LaTeX: \pi_\theta π θ 's translation (wrench cosine 0.96) while the residual supplies 91% of the rotation---yielding more reliable, sample-efficient learning than from-scratch, random, fine-tuned, and misleading bases. We report the strongest comparator explicitly: a warm-started joint-observation monolith matches reliability and final success, so the residual's distinct value is modularity, an interpretable decomposition, and sample efficiency. The verdict is invariant to the prior's policy class (MLP, a diffusion policy, and a broadly trained domain-randomized prior all show the same signature). Across progressively less idealized stress tests---Coulomb friction and force limits, a 3-D SE(3) task under gravity (a genuine non-planar closed-loop success, gap 0.01 → 0.87), arm-level Jacobian/torque feasibility, and heterogeneous teams---the same reuse signature persists, with an analytic minimum-norm allocation layer that composes with the residual to substantially reduce internal (squeeze) force. The contribution is the measurement methodology and the finding that, where a competence/coordination split is diagnosable, single-robot models are a useful, interpretable prior whose reuse can be verified rather than assumed.

  • Test-Time Action Optimization over Learned Dynamics for Robust Non-Prehensile Pushing (Jeonghwa Heo)

Non-prehensile pushing of a heavy puck along a prescribed curved path is a contact-rich control problem whose dynamics—governed by uncertain, randomized friction—are difficult to model analytically. We investigate how reinforcement learning can achieve robust pushing under domain-randomized contact parameters. Beginning from a classical geometric prior—a contact-pushing heuristic that positions the end-effector behind the puck along a composite straight-line-plus-quartic-Bézier reference—we compare two approaches to augmenting this prior with learning. A model-free residual Soft Actor-Critic learns an entropy-regularized correction in a single forward pass, whereas our approach additionally learns a dynamics model and a reward model alongside the twin-critic and performs test-time action optimization: at every control step it refines the heuristic action via a short, trust-region-bounded gradient ascent on a one-step value objective evaluated through the learned dynamics. Sharing an identical off-policy actor-critic backbone, the two methods isolate the effect of replacing a compiled reactive policy with inference-time optimization. On a GPU-parallelized Isaac Lab simulator, both learned methods substantially outperform the open-loop heuristic in-distribution, and test-time optimization exhibits smaller performance degradation under out-of-distribution friction in a sweep evaluation, since it recomputes actions from the observed state rather than evaluating a fixed state-to-action mapping. These results situate the method within the model-based, planning-augmented reinforcement learning family and characterize the conditions under which inference-time optimization is warranted.