Welcome to the SSU Reality Lab!
Kanggeon Kim, KIST
Custom Built Research Platform
Governor's Award - Toddlerbot Project
Built with Gyeonggi Youth Gap Year Grant
Grand Prize - Spartan SW Camp
Association for the Advancement of AI
Poster Session Participation
Industry Collaboration and Research
Lab Activities and Team Bonding
Lab Celebration and Appreciation
Sebin Lee, ByungKwan Chae, Youngjae Choi
Dowon Kim, Sebin Lee, Yeonji Kim
Sooyoung Choi
Youngjae Choi, Hyunsuh Kho, Hojae Jeong, Byungkwan Chae
Click on a publication to view detailed information
Sebin Lee, Heewon Kim
ByungKwan Chae*, Youngjae Choi*, Heewon Kim
Sangmin Lee*, Sungyong Park*, Heewon Kim
Sooyoung Choi*†, Sungyong Park*, Heewon Kim
Hyunsuh Kho*, Seunghyun Oh*, Jungyun Jang*, Heewon Kim
Youngjae Choi*, Hyunsuh Kho*, Hojae Jeong*, Byungkwan Chae*, Sungyong Park, Heewon Kim
Reality Lab(리얼리티랩)은 2023년 숭실대학교 글로벌미디어학부 김희원 교수님의 지도 아래 설립되었습니다. 저희 연구실은 인공지능과 물리적 세계의 접점을 탐구합니다. 인간처럼 실세계 인식과 상호작용을 통해 학습하는 지능형 시스템을 개발하며, Computer Vision, Deep Learning, Robotics, Embodied AI, Multimodal Learning 분야를 연구합니다. CVPR, AAAI, ECCV 등 세계 최고 학회에 논문을 게재하고 있습니다.
Winter Conference on Applications of Computer Vision (WACV)
Existing eyeglass removal methods can handle frames and shadows but fail to correct lens-induced geometric distortions, as public datasets lack the necessary supervision. To address this, we introduce the HiGlass Dataset, the first large-scale synthetic dataset providing explicit flow-based supervision for refractive warping. We also propose HiGlassRM, a novel pipeline whose core is a network that explicitly estimates a displacement flowmap to de-warp distorted facial geometry. Experiments on both synthetic and real images show that this flowmap-centric approach, trained on our data, significantly improves identity preservation and perceptual quality over existing methods. Our work demonstrates that explicitly modeling and correcting geometric distortion via flowmap estimation, enabled by targeted supervision, is key to faithful eyeglass removal.
[C26] Sebin Lee and Heewon Kim, "HiGlassRM: Learning to Remove High-prescription Glasses via Synthetic Dataset Generation," Proc. Winter Conference on Applications of Computer Vision (WACV), 2026 (accepted)
Winter Conference on Applications of Computer Vision (WACV)
Personalizing text-to-image diffusion models from only a few images per class remains challenging, especially for fine-grained categories where class-defining cues are subtle. We introduce Metric-Based Textual Inversion (MBTI), a plug-and-play objective that augments textual inversion by explicitly enlarging inter-class margins in the text-embedding space while preserving denoising fidelity. MBTI (i) selects the most confusable previously learned classes as support embeddings via cosine similarity (SES), (ii) contrasts predicted noises under the same diffusion state (timestep and noise) between the supports and the current token (DMM), and (iii) stabilizes training with a time-weighted regularizer aligned to the diffusion objective. Across Flowers102, NABirds, FGVC-Aircraft, Stanford Cars, COCO, and PASCAL VOC, MBTI produces more representative samples and improves few-shot classification under augmentation-for-classification. MBTI requires no model fine-tuning and integrates seamlessly with existing personalization pipelines.
[C25] ByungKwan Chae*, Youngjae Choi*, and Heewon Kim, "MBTI: Metric-Based Textual Inversion for Fine-Grained Image Generation," Proc. Winter Conference on Applications of Computer Vision (WACV), 2026 (accepted)
* Equal contribution
Robotic manipulation in embodied AI critically depends on large-scale, high-quality datasets that reflect realistic object interactions and physical dynamics. However, existing data collection pipelines are often slow, expensive, and heavily reliant on manual efforts. We present DynScene, a diffusion-based framework for generating dynamic robotic manipulation scenes directly from textual instructions. Unlike prior methods that focus solely on static environments or isolated robot actions, DynScene decomposes the generation into two phases static scene synthesis and action trajectory generation allowing fine-grained control and diversity. Our model enhances realism and physical feasibility through scene refinement (layout sampling, quaternion quantization) and leverages residual action representation to enable action augmentation, generating multiple diverse trajectories from a single static configuration.
Experiments show DynScene achieves 26.8× faster generation, 1.84× higher accuracy, and 28% greater action diversity than human-crafted data. Furthermore, agents trained with DynScene exhibit up to 19.4% higher success rates across complex manipulation tasks. Our approach paves the way for scalable, automated dataset generation in robot learning.
[C22] Sangmin Lee*, Sungyong Park*, and Heewon Kim, "DynScene: Scalable Generation of Dynamic Robotic Manipulation Scenes for Embodied AI," Proc. Computer Vision and Pattern Recognition (CVPR), 2025.
* Equal contribution
Drag the slider to compare contaminated (오염됨) and original (원본) images
Smartphone cameras are ubiquitous in daily life, yet their performance can be severely impacted by dirty lenses, leading to degraded image quality. This issue is often overlooked in image restoration research, which assumes ideal or controlled lens conditions. To address this gap, we introduced SIDL (Smartphone Images with Dirty Lenses), a novel dataset designed to restore images captured through contaminated smartphone lenses. SIDL contains diverse real-world images taken under various lighting conditions and environments. These images feature a wide range of lens contaminants, including water drops, fingerprints, and dust. Each contaminated image is paired with a clean reference image, enabling supervised learning approaches for restoration tasks.
To evaluate the challenge posed by SIDL, various state-of-the-art restoration models were trained and compared on this dataset. Their performances achieved some level of restoration but did not adequately address the diverse and realistic nature of the lens contaminants in SIDL. This challenge highlights the need for more robust and adaptable image restoration techniques for restoring images with dirty lenses.
[C21] Sooyoung Choi*†, Sungyong Park*, and Heewon Kim, "SIDL: A Real-World Dataset for Restoring Smartphone Images with Dirty Lenses," AAAI Conference on Artificial Intelligence (AAAI), 2025.
* Equal contribution
† Undergraduate student
Photorealistic style transfer in neural radiance fields (NeRF) aims to modify the color characteristics of a 3D scene without altering its underlying geometry. Although recent approaches have achieved promising results, they often suffer from limited style diversity, focusing primarily on global color shifts. In contrast, artistic style transfer methods offer richer stylization but usually distort scene geometry, thereby reducing realism.
In this work, we present Intrinsic-guided Photorealistic Style Transfer (IPRF), a novel framework that leverages intrinsic image decomposition to decouple a scene into albedo and shading components. By introducing tailored loss functions in both domains, IPRF aligns the texture and color of the content scene to those of a style image while faithfully preserving geometric structure and lighting.
Furthermore, we propose Tuning-assisted Style Interpolation (TSI), a real-time technique for exploring the trade-off between photorealism and artistic expression through a weighted combination of albedo-oriented and shading-oriented radiance fields. Experimental results demonstrate that IPRF achieves a superior balance between naturalism and artistic expression compared to state-of-the-art methods, offering a versatile solution for 3D content creation in various fields, including digital art, virtual reality, and game design.
[C24] Hyunsuh Kho*, Seunghyun Oh*, Jungyun Jang*, and Heewon Kim, "Intrinsic-Guided Photorealistic Style Transfer for Radiance Fields," Proc. International Workshop on Application-driven Point Cloud Processing and 3D Vision (APP3DV, ACM MM Workshop), 2025 (accepted)
* Equal contribution
Diffusion models achieve impressive image synthesis, yet unsupervised methods for latent space exploration remain limited in fine-grained class translation. Existing approaches struggle with fine-grained class translation, often producing low-diversity outputs within parent classes or inconsistent child-class mappings across images. We propose UDT (Unsupervised Discovery of Transformations), a framework that incorporates hierarchical structure into unsupervised direction discovery. UDT leverages parent-class prompts to decompose predicted noise into class-general and class-specific components, ensuring translations remain within the parent domain while enabling disentangled child-class transformations. A hierarchy-aware contrastive loss further enforces consistency, with each direction corresponding to a distinct child class. Experiments on dogs, cats, birds, and flowers show that UDT outperforms state-of-the-art methods both qualitatively and quantitatively. Moreover, UDT supports controllable interpolation, allowing for the smooth generation of intermediate classes (e.g., mixed breeds). These results demonstrate UDT as a general and effective solution for fine-grained image translation.
[C23] Youngjae Choi*, Hyunsuh Kho*, Hojae Jeong*, Byungkwan Chae*, Sungyong Park, and Heewon Kim, "UDT: Unsupervised Discovery of Transformations between Fine-Grained Classes in Diffusion Models," Proc. British Machine Vision Conference (BMVC), 2025 (accepted)
* Equal contribution
Deep neural networks have achieved significant performance breakthroughs across a range of tasks. For diagnosing depression, there has been increasing attention on estimating depression status from personal medical data. However, the neural networks often act as black boxes, making it difficult to discern the individual effects of each input component. To alleviate this problem, we proposed a deep-learning-based generalized additive model called DeepGAM to improve the interpretability of depression diagnosis. We utilized the baseline cross-sectional data from the Heart and Soul Study to achieve our study's aim. DeepGAM incorporates additive functions based on a neural network that learns to discern the positive and negative impacts of the values of individual components. The network architecture and the objective function are designed to constrain and regularize the output values for interpretability.
Moreover, we used a direct-through estimator (STE) to select important features using gradient descent. The STE enables machine learning models to maintain their performance using a few features and interpretable function visualizations. DeepGAM achieved the highest AUC (0.600) and F1-score (0.387), outperforming neural networks and IGANN. The five features selected via STE performed comparably to 99 features and surpassed traditional methods such as Lasso and Boruta. Additionally, analyses highlighted DeepGAM's interpretability and performance on public datasets. In conclusion, DeepGAM with STE demonstrated accurate and interpretable performance in predicting depression compared to existing machine learning methods.
[J12] Chiyoung Lee*, Yeri Kim*†, Seoyoung Kim*†, Mary Whooley, and Heewon Kim, "DeepGAM: An Interpretable Deep Neural Network Using Generalized Additive Model for Depression Diagnosis: Data From The Heart and Soul Study," PLOS ONE, 2025 (accepted)
* Equal contribution
† Undergraduate student
We propose a diagnostic system for identifying bronchial diseases by analyzing dog cough sounds. We collected a dataset consisting of 124 healthy dog cough sounds obtained from open sources and 94 dog cough sounds with bronchial diseases obtained from YouTube, and performed a total of 218 recordings. These cough sounds were segmented into 423 separate cough datasets to improve the details and accuracy of their analysis. Additionally, data augmentation techniques such as noise addition, pitch shifting, time stretching, and volume scaling were applied, increasing the dataset size by 7 times. This resulted in 1,526 training and testing samples for multiple coughs and 2,961 samples for single coughs.
The disease prediction system leverages three different neural network models, multilayer perceptron (MLP), convolutional neural network (CNN), and recurrent neural network (RNN), to evaluate their effectiveness in detecting bronchial diseases. In our experiments, we found that the single cough dataset outperformed the multiple cough dataset, with the CNN achieving the highest accuracy, precision, AUC, and F1 scores compared to the RNN and MLP. The study highlights the potential of machine learning in improving diagnostic accuracy for veterinary medicine, suggesting that integrating different models could enhance diagnostic tools, thereby contributing to better health outcomes for dogs.
[J11] Do-Ye Kwon*†, Yeon-Ju Oh*†, and Heewon Kim, "Dog Cough Sound Classification Using Neural Networks for Diagnosing Bronchial Diseases," Journal of Electrical Engineering & Technology (JEET), 2025 (accepted)
* Equal contribution
† Undergraduate student
Transcranial direct current stimulation (tDCS) has been recognized as a safe and effective intervention for treating knee osteoarthritis (KOA) pain; however, research has suggested the heterogeneity of treatment effects across participants. This study aimed to identify the sociodemographic and clinical predictors of such heterogeneity in older adults with symptomatic KOA undergoing tDCS, thereby enhancing personalized treatment strategies. Specifically, we analyzed active and sham tDCS groups separately to account for placebo or sham effects. This study entailed secondary data analysis of a double-blind, randomized, sham-controlled, phase II, parallel-group pilot clinical trial involving 120 participants with KOA pain. These participants were assigned to 15 daily telehealth-delivered sessions of either active 2-mA tDCS (n=60) for 20 min or sham stimulation (n=60) over 3 weeks. The primary outcome was the change in Western Ontario and McMaster Universities Osteoarthritis Index (WOMAC) pain subscale scores, measured from baseline to after the 15 tDCS sessions in both the active and sham groups.
Predictive modeling using random forest (RF) and artificial neural network (ANN) algorithms was utilized, with model performance assessed based on R-squared values. The impact of predictive features on treatment outcomes was examined using several feature selection methods, including Lasso, BorutaSHAP, Chi2, F-regression, and R-regression. The RF and ANN models both effectively predicted treatment effects, indicating the potential of machine learning to enhance patient-specific treatment strategies. In the active group, the predominant features included age, average heat pain tolerance at the knee at baseline, baseline WOMAC functional score, and the duration of KOA. In the sham group, the major features comprised the duration of KOA, Kellgren–Lawrence scale score of the affected knee, baseline pain catastrophizing score, average heat pain tolerance at the knee at baseline, and baseline WOMAC functional score. Characterizing these predictive factors can inform personalized tDCS protocols, potentially improving treatment effects.
[C20&J9] Chiyoung Lee, Heewon Kim, Yeri Kim†, Seoyoung Kim†, Kent Kwoh, Juyoung Park, Hyochol Ahn, "Predictors of the treatment effects of transcranial direct current stimulation on knee osteoarthritis pain: a machine-learning approach," International Brain Stimulation Conference, 2025 & Brain Stimulation, vol. 18, Issue 1, pp. 456 - 457, Feb. 2025.
CVPR 2025 Embodied AI Workshop
1st Place Winner at CVPR 2025 Embodied AI Workshop
Chronic pain is a major public health problem affecting approximately 100 million Americans and United States military Veterans, who constitute a particularly vulnerable group. While pain research in Veterans is actively underway, information on the longitudinal course of pain in this population is limited. This study aimed to 1) identify the various longitudinal pain status trajectories among older Veterans over a 10-year period and 2) detect factors predicting membership in the worsening trajectory of chronic pain. We analyzed data from 619 Veterans (mean age: 58.5 years) participating in the Mind Your Heart Study, an ongoing prospective cohort study examining diverse health outcomes among Veterans. Initially, we employed a generalized mixture model to identify pain trajectory classes using Brief Pain Inventory (BPI) pain intensity subscale score collected at 2-, 5-, and 10-year intervals. Two distinct trajectories were identified—low and high—both of which remained relatively stable.
Subsequently, several feature selection methods extracted the predominant features from participants' baseline characteristics that predicted membership in the high vs. low pain trajectory. These included: prior arthritis diagnosis; prior post-traumatic stress disorder (PTSD) diagnosis; depression symptoms; PTSD symptoms of avoidance, hyperarousal, and negative mood alterations; physical functioning; sleep quality; and overall health. The scikit-learn RandomForestClassifier, utilizing the refined feature set, achieved a classification accuracy of 0.79, yielding results nearly identical to those obtained using all 261 features. These findings are clinically informative and pertinent, highlighting potential intervention targets warranting intensive pain care plans based on probable long-term prognosis and discussing early treatment strategies among older Veterans.
[C17&J8] Chiyoung Lee, Yeri Kim†, Seoyoung Kim†, Beth Cohen, Kent Kwoh, Hyochol Ahn, Juyoung Park, Heewon Kim, "Trajectories of chronic pain among older Veterans: Identifying pain-worsening predictors via machine learning," Gerontological Society of America (GSA) Annual Scientific Meeting, 2024 & Innovation in Aging, vol. 8, issue suppl. 1, pp. 1221, Dec. 2024.
Ischemic stroke is a major cause of mortality worldwide. Proper etiological subtyping of ischemic stroke is crucial for tailoring treatment strategies. This study explored the utility of circulating microRNAs encapsulated in extracellular vesicles (EV-miRNAs) to distinguish the following ischemic stroke subtypes: large artery atherosclerosis (LAA), cardioembolic stroke (CES), and small artery occlusion (SAO). Using next-generation sequencing (NGS) and machine-learning techniques, we identified differentially expressed miRNAs (DEMs) associated with each subtype. Through patient selection and diagnostic evaluation, a cohort of 70 patients with acute ischemic stroke was classified: 24 in the LAA group, 24 in the SAO group, and 22 in the CES group.
Our findings revealed distinct EV-miRNA profiles among the groups, suggesting their potential as diagnostic markers. Machine-learning models, particularly logistic regression models, exhibited a high diagnostic accuracy of 92% for subtype discrimination. The collective influence of multiple miRNAs was more crucial than that of individual miRNAs. Additionally, bioinformatics analyses have elucidated the functional implications of DEMs in stroke pathophysiology, offering insights into the underlying mechanisms. Despite limitations like sample size constraints and retrospective design, our study underscores the promise of EV-miRNAs coupled with machine learning for ischemic stroke subtype classification. Further investigations are warranted to validate the clinical utility of the identified EV-miRNA biomarkers in stroke patients.
[J7] Ji Hoon Bang†, Eun Hee Kim, Hyung Jun Kim, Jong-Won Chung, Woo-Keun Seo, Gyeong-Moon Kim, Dong-Ho Lee, Heewon Kim*, and Oh Young Bang*, "Machine Learning-Based Etiologic Subtyping of Ischemic Stroke Using Circulating Exosomal microRNAs," International Journal of Molecular Sciences (IJMS), vol. 25, no. 12, pp. 1-14, Jun. 2024.
* Equal contribution
† First author
We aim to train accurate denoising networks for smartphone/digital cameras from single noisy images. Downscaling is commonly used as a practical denoiser for low-resolution images. Based on this processing, we found that the pixel variance of the natural images is more robust to downscaling than the pixel variance of the camera noises. Intuitively, downscaling easily removes high-frequency noises than natural textures. To utilize this property, we can adopt noisy/clean image synthesis at low-resolution to train camera denoisers. On this basis, we propose a new solution pipeline -- NERDS that estimates camera noises and synthesizes noisy-clean image pairs from only noisy images. In particular, it first models the noise in raw-sensor images as a Poisson-Gaussian distribution, then estimates the noise parameters using the difference of pixel variances by downscaling. We formulate the noise estimation as a gradient-descent-based optimization problem through a reparametrization trick.
We further introduce a new Image Signal Processor (ISP) estimation method that enables denoiser training in a human-readable RGB space by transforming the synthetic raw images to the style of a given RGB noisy image. The noise and ISP estimations utilize rich augmentation to synthesize image pairs for denoiser training. Experiments show that our NERDS can accurately train CNN-based denoisers (e.g., DnCNN, ResNet-style network) outperforming previous noise-synthesis-based and self-supervision-based denoisers in real datasets.
[C15] Heewon Kim and Kyoung Mu Lee, "NERDS: A General Framework to Train Camera Denoisers from Raw-RGB Noisy Image Pairs," International Conference on Learning Representations (ICLR), 2023.
Editing flat-looking images into stunning photographs requires skill and time. Automated image enhancement algorithms have attracted increased interest by generating high-quality images without user interaction. However, the quality assessment of a photograph is subjective. Even in tone and color adjustments, a single photograph of auto-enhancement is challenging to fit user preferences which are subtle and even changeable. To address this problem, we present a semi-automatic image enhancement algorithm that can generate high-quality images with multiple styles by controlling a few parameters.
We first disentangle photo retouching skills from high-quality images and build an efficient enhancement system for each skill. Specifically, an encoder-decoder framework encodes the retouching skills into latent codes and decodes them into the parameters of image signal processing (ISP) functions. The ISP functions are computationally efficient and consist of only 19 parameters. Despite our approach requiring multiple inferences to obtain the desired result, experimental results present that the proposed method achieves state-of-the-art performances on the benchmark dataset for image quality and model efficiency.
[J4] Heewon Kim and Kyoung Mu Lee, "Learning Controllable ISP for Image Enhancement," IEEE Trans. Image Processing (TIP), vol. 33, no. 1, pp. 867-880, Aug. 2023.
In few-shot learning scenarios, the challenge is to generalize and perform well on new unseen examples when only very few labeled examples are available for each task. Model-agnostic meta-learning (MAML) has gained the popularity as one of the representative few-shot learning methods for its flexibility and applicability to diverse problems. However, MAML and its variants often resort to a simple loss function without any auxiliary loss function or regularization terms that can help achieve better generalization. The problem lies in that each application and task may require different auxiliary loss function, especially when tasks are diverse and distinct.
Instead of attempting to hand-design an auxiliary loss function for each application and task, we introduce a new meta-learning framework with a loss function that adapts to each task. Our proposed framework, named Meta-Learning with Task-Adaptive Loss Function (MeTAL), demonstrates the effectiveness and the flexibility across various domains, such as few-shot classification and few-shot regression.
[J3] Sungyong Baik, Myungsub Choi, Janghoon Choi, Heewon Kim, and Kyoung Mu Lee, "Learning to Learn Task-Adaptive Hyperparameters for Few-Shot Learning," IEEE Trans. Pattern Analysis and Machine Intelligence (TPAMI), vol. 46, no. 3, pp. 1441-1454, Mar. 2023.
The goal of filter pruning is to search for unimportant filters to remove in order to make convolutional neural networks (CNNs) efficient without sacrificing the performance in the process. The challenge lies in finding information that can help determine how important or relevant each filter is with respect to the final output of neural networks. In this work, we share our observation that the batch normalization (BN) parameters of pre-trained CNNs can be used to estimate the feature distribution of activation outputs, without processing of training data.
Upon observation, we propose a simple yet effective filter pruning method by evaluating the importance of each filter based on the BN parameters of pre-trained CNNs. The experimental results on CIFAR-10 and ImageNet demonstrate that the proposed method can achieve outstanding performance with and without fine-tuning in terms of the trade-off between the accuracy drop and the reduction in computational complexity and number of parameters of pruned networks.
[C11] Junghun Oh, Heewon Kim, Seungjun Nah, Cheeun Hong, Jonghyun Choi, and Kyoung Mu Lee, "Batch Normalization Tells You Which Filter is Important," Winter Conference on Applications of Computer Vision (WACV), 2022.
We aimed to develop prediction models for depression among U.S. adults with hypertension using various machine learning (ML) approaches. Moreover, we analyzed the mechanisms of the developed models. This cross-sectional study included 8,628 adults with hypertension (11.3% with depression) from the National Health and Nutrition Examination Survey (2011–2020). We selected several significant features using feature selection methods to build the models. Data imbalance was managed with random down-sampling. Six different ML classification methods implemented in the R package caret—artificial neural network, random forest, AdaBoost, stochastic gradient boosting, XGBoost, and support vector machine—were employed with 10-fold cross-validation for predictions. Model performance was assessed by examining the area under the receiver operating characteristic curve (AUC), accuracy, precision, sensitivity, specificity, and F1-score. For an interpretable algorithm, we used the variable importance evaluation function in caret.
Of all classification models, artificial neural network trained with selected features (n = 30) achieved the highest AUC (0.813) and specificity (0.780) in predicting depression. Support vector machine predicted depression with the highest accuracy (0.771), precision (0.969), sensitivity (0.774), and F1-score (0.860). The most frequent and important features contributing to the models included the ratio of family income to poverty, triglyceride level, white blood cell count, age, sleep disorder status, the presence of arthritis, hemoglobin level, marital status, and education level. In conclusion, ML algorithms performed comparably in predicting depression among hypertensive populations. Furthermore, the developed models shed light on variables' relative importance, paving the way for further clinical research.
[J1] Chiyoung Lee and Heewon Kim, "Machine learning-based predictive modeling of depression in hypertensive populations," PLOS ONE, vol. 17, no. 7, pp. 1-17, Jul. 2022.
The goal of filter pruning is to search for unimportant filters to remove in order to make convolutional neural networks (CNNs) efficient without sacrificing the performance in the process. The challenge lies in finding information that can help determine how important or relevant each filter is with respect to the final output of neural networks. In this work, we share our observation that the batch normalization (BN) parameters of pre-trained CNNs can be used to estimate the feature distribution of activation outputs, without processing of training data.
Upon observation, we propose a simple yet effective filter pruning method by evaluating the importance of each filter based on the BN parameters of pre-trained CNNs. The experimental results on CIFAR-10 and ImageNet demonstrate that the proposed method can achieve outstanding performance with and without fine-tuning in terms of the trade-off between the accuracy drop and the reduction in computational complexity and number of parameters of pruned networks.
[J2] Heewon Kim, Seokil Hong, Bohyung Han, Heesoo Myeong, and Kyoung Mu Lee, "Fine-Grained Neural Architecture Search for Image Super-Resolution," Journal of Visual Communication and Image Representation (JVCI), vol. 89, no. 1, pp. 1-9, Nov. 2022.
The ability to quickly learn and generalize from only few examples is an essential goal of few-shot learning. Gradient-based meta-learning algorithms effectively tackle the problem by learning how to learn novel tasks. In particular, model-agnostic meta-learning (MAML) encodes the prior knowledge into a trainable initialization, which allowed for fast adaptation to few examples. Despite its popularity, several recent works question the effectiveness of MAML initialization especially when test tasks are different from training tasks, thus suggesting various methodologies to improve the initialization.
Instead of searching for a better initialization, we focus on a complementary factor in MAML framework, the inner-loop optimization (or fast adaptation). Consequently, we propose a new weight update rule that greatly enhances the fast adaptation process. Specifically, we introduce a small meta-network that can adaptively generate per-step hyperparameters: learning rate and weight decay coefficients. The experimental results validate that the Adaptive Learning of hyperparameters for Fast Adaptation (ALFA) is the equally important ingredient that was often neglected in the recent few-shot learning approaches. Surprisingly, fast adaptation from random initialization with ALFA can already outperform MAML.
[C6] Sungyong Baik, Myungsub Choi, Janghoon Choi, Heewon Kim, and Kyoung Mu Lee, "Meta-Learning with Adaptive Hyperparameters," Advances in Neural Information Processing Systems (NeurIPS), 2020.
Since the resurgence of deep neural networks (DNNs), image super-resolution (SR) has recently seen a huge progress in improving the quality of low resolution images, however at the great cost of computations and resources. Recently, there has been several efforts to make DNNs more efficient via quantization. However, SR demands pixel-level accuracy in the system, it is more difficult to perform quantization without significantly sacrificing SR performance.
To this end, we introduce a new ultra-low precision yet effective quantization approach specifically designed for SR. In particular, we observe that in recent SR networks, each channel has different distribution characteristics. Thus we propose a channel-wise distribution-aware quantization scheme. Experimental results demonstrate that our proposed quantization, dubbed Distribution-Aware Quantization (DAQ), manages to greatly reduce the computational and resource costs without the significant sacrifice in SR performance, compared to other quantization methods.
[C14] Cheeun Hong, Sungyong Baik, Heewon Kim, Seungjun Nah, and Kyoung Mu Lee, "CADyQ: Content-Aware Dynamic Quantization for Image Super Resolution," Proc. European Conference on Computer Vision (ECCV), 2022.
Since the resurgence of deep neural networks (DNNs), image super-resolution (SR) has recently seen a huge progress in improving the quality of low resolution images, however at the great cost of computations and resources. Recently, there has been several efforts to make DNNs more efficient via quantization. However, SR demands pixel-level accuracy in the system, it is more difficult to perform quantization without significantly sacrificing SR performance.
To this end, we introduce a new ultra-low precision yet effective quantization approach specifically designed for SR. In particular, we observe that in recent SR networks, each channel has different distribution characteristics. Thus we propose a channel-wise distribution-aware quantization scheme. Experimental results demonstrate that our proposed quantization, dubbed Distribution-Aware Quantization (DAQ), manages to greatly reduce the computational and resource costs without the significant sacrifice in SR performance, compared to other quantization methods.
[C12] Cheeun Hong*, Heewon Kim*, Sungyong Baik, Junghun Oh, and Kyoung Mu Lee, "DAQ: Channel-Wise Distribution-Aware Quantization for Deep Image Super-Resolution Networks," Winter Conference on Applications of Computer Vision (WACV), 2022.
* Equal contribution
Video frame interpolation aims to synthesize accurate intermediate frames given a low-frame-rate video. While the quality of the generated frames is increasingly getting better, state-of-the-art models have become more and more computationally expensive. However, local regions with small or no motion can be easily interpolated with simple models and do not require such heavy compute, whereas some regions may not be correct even after inference through a large model.
Thus, we propose an effective framework that assigns varying amounts of computation for different regions. Our dynamic architecture first calculates the approximate motion magnitude to use as a proxy for the difficulty levels for each region, and decides the depth of the model and the scale of the input. Experimental results show that static regions pass through a smaller number of layers, while the regions with larger motion are downscaled for better motion reasoning. In doing so, we demonstrate that the proposed framework can significantly reduce the computation cost (FLOPs) while maintaining the performance, often up to 50% when interpolating a 2K resolution video.
[C13] Junghun Oh, Heewon Kim, Seungjun Nah, Cheeun Hong, Jonghyun Choi, and Kyoung Mu Lee, "Attentive Fine-Grained Structured Sparsity for Image Restoration," Proc. Computer Vision and Pattern Recognition (CVPR), 2022.
In few-shot learning scenarios, the challenge is to generalize and perform well on new unseen examples when only very few labeled examples are available for each task. Model-agnostic meta-learning (MAML) has gained the popularity as one of the representative few-shot learning methods for its flexibility and applicability to diverse problems. However, MAML and its variants often resort to a simple loss function without any auxiliary loss function or regularization terms that can help achieve better generalization. The problem lies in that each application and task may require different auxiliary loss function, especially when tasks are diverse and distinct.
Instead of attempting to hand-design an auxiliary loss function for each application and task, we introduce a new meta-learning framework with a loss function that adapts to each task. Our proposed framework, named Meta-Learning with Task-Adaptive Loss Function (MeTAL), demonstrates the effectiveness and the flexibility across various domains, such as few-shot classification and few-shot regression.
[C10] Sungyong Baik, Janghoon Choi, Heewon Kim, Dohee Cho, Jaesik Min, and Kyoung Mu Lee, "Meta-Learning with Task-Adaptive Loss Function for Few-Shot Learning," Proc. International Conference on Computer Vision (ICCV), 2021. (ORAL presentation)
ORAL presentation
We present a novel framework for controllable image restoration that can effectively restore multiple types and levels of degradation of a corrupted image. The proposed model, named TASNet, is automatically determined by our neural architecture search algorithm, which optimizes the efficiency-accuracy trade-off of the candidate model architectures. Specifically, we allow TASNet to share the early layers across different restoration tasks and adaptively adjust the remaining layers with respect to each task.
The shared task-agnostic layers greatly improve the efficiency while the task-specific layers are optimized for restoration quality, and our search algorithm seeks for the best balance between the two. We also propose a new data sampling strategy to further improve the overall restoration performance. As a result, TASNet achieves significantly faster GPU latency and lower FLOPs compared to the existing state-of-the-art models, while also showing visually more pleasing outputs.
[C8] Heewon Kim, Sungyong Baik, Myungsub Choi, Janghoon Choi, and Kyoung Mu Lee, "Searching for Controllable Image Restoration Networks," Proc. International Conference on Computer Vision (ICCV), 2021.
Video super-resolution has recently become one of the most important mobile-related problems due to the rise of video communication and streaming services. While many solutions have been proposed for this task, the majority of them are too computationally expensive to run on portable devices with limited hardware resources. To address this problem, we introduce the first Mobile AI challenge, where the target is to develop an end-to-end deep learning-based video super-resolution solutions that can achieve a real-time performance on mobile GPUs.
The participants were provided with the REDS dataset and trained their models to do an efficient 4X video upscaling. The runtime of all models was evaluated on the OPPO Find X2 smartphone with the Snapdragon 865 SoC capable of accelerating floating-point networks on its Adreno GPU. The proposed solutions are fully compatible with any mobile GPU and can upscale videos to HD resolution at up to 80 FPS while demonstrating high fidelity results. A detailed description of all models developed in the challenge is provided in this paper.
[C7] Andrey Ignatov, Andres Romero, Heewon Kim, and Radu Timofte, "Real-time video super-resolution on smartphones with deep learning, mobile ai 2021 challenge: Report," Proc. Computer Vision and Pattern Recognition Workshops (CVPRW), 2021.
Video frame interpolation aims to synthesize accurate intermediate frames given a low-frame-rate video. While the quality of the generated frames is increasingly getting better, state-of-the-art models have become more and more computationally expensive. However, local regions with small or no motion can be easily interpolated with simple models and do not require such heavy compute, whereas some regions may not be correct even after inference through a large model.
Thus, we propose an effective framework that assigns varying amounts of computation for different regions. Our dynamic architecture first calculates the approximate motion magnitude to use as a proxy for the difficulty levels for each region, and decides the depth of the model and the scale of the input. Experimental results show that static regions pass through a smaller number of layers, while the regions with larger motion are downscaled for better motion reasoning. In doing so, we demonstrate that the proposed framework can significantly reduce the computation cost (FLOPs) while maintaining the performance, often up to 50% when interpolating a 2K resolution video.
[C9] Myungsub Choi, Suyoung Lee, Heewon Kim, and Kyoung Mu Lee, "Motion-Aware Dynamic Architecture for Efficient Frame Interpolation," Proc. International Conference on Computer Vision (ICCV), 2021.
Prevailing video frame interpolation techniques rely heavily on optical flow estimation and require additional model complexity and computational cost; it is also susceptible to error propagation in challenging scenarios with large motion and heavy occlusion. To alleviate the limitation, we propose a simple but effective deep neural network for video frame interpolation, which is end-to-end trainable and is free from a motion estimation network component.
Our algorithm employs a special feature reshaping operation, referred to as PixelShuffle, with a channel attention, which replaces the optical flow computation module. The main idea behind the design is to distribute the information in a feature map into multiple channels and extract motion information by attending the channels for pixel-level frame synthesis. The model given by this principle turns out to be effective in the presence of challenging motion and occlusion. We construct a comprehensive evaluation benchmark and demonstrate that the proposed approach achieves outstanding performance compared to the existing models with a component for optical flow computation.
[C5] Myungsub Choi, Heewon Kim, Bohyung Han, Ning Xu, and Kyoung Mu Lee, "Channel Attention Is All You Need for Video Frame Interpolation," AAAI Conference on Artificial Intelligence, 2020.
Videos contain various types and strengths of motions that may look unnaturally discontinuous in time when the recorded frame rate is low. This paper reviews the first AIM challenge on video temporal super-resolution (frame interpolation) with a focus on the proposed solutions and results. From low-frame-rate (15 fps) video sequences, the challenge participants are asked to submit higher-frame-rate (60 fps) video sequences by estimating temporally intermediate frames. We employ the REDS VTSR dataset derived from diverse videos captured in a hand-held camera for training and evaluation purposes.
The competition had 62 registered participants, and a total of 8 teams competed in the final testing phase. The challenge winning methods achieve the state-of-the-art in video temporal super-resolution.
[C4] Seungjun Nah, Sanghyun Son, Radu Timofte, Kyoung Mu Lee, Li Siyao, Ziwei Pan, Xiangyu Xu, Wenxiu Sun, Myungsub Choi, Heewon Kim, Bohyung Han, Ning Xu, Renjie Liao, Alex Schwing, Yuhua Chen, Chen Chen, Kai Zhang, Wangmeng Zuo, Tong Lu, Ran Duan, Tong Lu, Lihe Zhang, Woonsook Park, Manhcuot Tran, George Pisha, Eyal Naor, Lilie Zhang, Kai Zhang, Wangmeng Zuo, Zhimin Tang, Linkai Luo, Shaochui Li, Min Fu, Lei Cai, Wen Heng, Giang Bui, Truc Le, Ye Deng, Sifei Liu, Hyun Young Jung, Tianlong Guo, Hongchang Gao, Jiejun Xu, Ruofan Fang, Yunsong Li, Anmur Mehta, Vishal Monga, "AIM 2019 challenge on video temporal super-resolution: Methods and results," International Conference on Computer Vision Workshop (ICCVW), 2019.
Image downscaling is one of the most classical problems in computer vision that aims to preserve the visual appearance of the original image when it is resized to a smaller scale. Upscaling a small image back to its original size is a difficult and ill-posed problem due to information loss that arises in the downscaling process. In this paper, we present a novel technique called task-aware image downscaling to support an upscaling task.
We propose an auto-encoder-based framework that enables joint learning of the downscaling network and the upscaling network to maximize the restoration performance. Our framework is efficient, and it can be generalized to handle an arbitrary image resizing operation. Experimental results show that our task-aware downscaled images greatly improve the performance of the existing state-of-the-art super-resolution methods. In addition, realistic images can be recovered by recursively applying our scaling model up to an extreme scaling factor of x128. We also validate our model's generalization capability by applying it to the task of image colorization.
[C3] Heewon Kim, Myungsub Choi, Bee Lim, and Kyoung Mu Lee, "Task-Aware Image Downscaling," Proc. European Conference on Computer Vision (ECCV), 2018.
This paper reviews the first challenge on single image super-resolution (restoration of rich details in an low resolution image) with focus on proposed solutions and results. A new DIVerse 2K resolution image dataset (DIV2K) was employed. The challenge had 6 competitions divided into 2 tracks with 3 magnification factors each. Track 1 employed the standard bicubic downscaling setup, while Track 2 had unknown downscaling operators (blur kernel and decimation) but learnable through low and high res train images. Each competition had ∼ 100 registered participants and 20 teams competed in the final testing phase. They gauge the state-of-the-art in single image super-resolution.
[C2] Radu Timofte, Eirikur Agustsson, Luc Van Gool, Ming-Hsuan Yang, Lei Zhang, Bee Lim, Sanghyun Son, Heewon Kim, Seungjun Nah, Kyoung Mu Lee, Xingtao Wang, Yapeng Tian, Ke Yu, Yulun Zhang, Shixiang Wu, Chao Dong, Liang Lin, Yu Qiao, Chen Change Loy, Wen Heng, Giang Bui, Truc Le, Ye Duan, Dacheng Tao, Ruxin Wang, Xu Lin, Jianwei Pang, Jian Xu, Yue Zhao, Xiangyu Xu, Jun Pan, Deqing Sun, Yujin Zhang, Jian Song, Yong Qin, Yukang Li, Jianwei Li, Yujin Chen, Kai Zhang, Wangmeng Zuo, Zhimin Tang, Linkai Luo, Shaohui Li, Min Fu, Lei Cao, Wen Heng, Giang Bui, Truc Le, Ye Duan, and Qi Guo, "NTIRE 2017 Challenge on Single Image Super-Resolution: Methods and Results," NTIRE 2017, New Trends in Image Restoration and Enhancement workshop and challenge on super-resolution in conjunction with CVPR 2017.
Recent research on super-resolution has progressed with the development of deep convolutional neural networks (DCNN). In particular, residual learning techniques exhibit improved performance. In this paper, we develop an enhanced deep super-resolution network (EDSR) with performance exceeding those of current state-of-the-art SR methods. The significant performance improvement of our model is due to optimization by removing unnecessary modules in conventional residual networks.
The performance is further improved by expanding the model size while we stabilize the training procedure. We also propose a new multi-scale deep super-resolution system (MDSR) and training method, which can reconstruct high-resolution images of different upscaling factors in a single model. The proposed methods show superior performance over the state-of-the-art methods on benchmark datasets and prove its excellence by winning the NTIRE2017 Super-Resolution Challenge.
[C1] Bee Lim, Sanghyun Son, Heewon Kim, Seungjun Nah, and Kyoung Mu Lee, "Enhanced Deep Residual Networks for Single Image Super-Resolution," 2nd NTIRE: New Trends in Image Restoration and Enhancement workshop and challenge on image super-resolution in conjunction with CVPR 2017. (Challenge Winners, Best Paper Award of Workshop)
* Equal contribution
Challenge Winners, Best Paper Award of Workshop
Human pose estimation (HPE) is challenging due to the need to accurately capture rapid and occluded body movements, often resulting in uncertain predictions. In the context of fast sports actions like baseball swings, existing HPE methods insufficiently leverage domain-specific prior knowledge about these movements. To address this gap, we propose the Baseball Player Pose Corrector (BPPC), an optimization framework that utilizes high-quality 3D standard motion data to refine 2D keypoints in baseball swing videos. BPPC operates in two stages: first, it aligns the 3D standard motion to test swing videos through action recognition, offset learning, and 3D-to-2D projection. Next, it applies movement-aware optimization to refine the keypoints, ensuring robustness to variations in swing patterns.
Notably, BPPC does not rely on additional datasets; it only requires manually annotated 3D standard motion data for baseball swings. Experimental results demonstrate that BPPC improves keypoint estimation accuracy by up to 2.4% on a baseball swing dataset, particularly enhancing keypoints with confidence scores below 0.5. Qualitative analysis further highlights BPPC's ability to correct rapidly moving joints, such as elbows and wrists.
[J10] Seunghyun Oh† and Heewon Kim, "Accurate Baseball Player Pose Refinement Using Motion Prior Guidance," ICT Express, 2025 (accepted)
† Undergraduate student
모델: GPT-OSS 120B (오픈소스 LLM)
하드웨어: NVIDIA RTX 4090 GPU
연결: Cloudflare Tunnel (공개 접속 지원)
운영시간: 오전 8시 ~ 익일 오전 4시 (한국시간 KST)
휴식시간: 매일 오전 4시 ~ 8시 (시스템 유지보수)
자동 재시작: 매일 오전 8시, 오후 6시 (터널 갱신)
업데이트: 매주 일요일 오전 2시 자동 업데이트
• ✓ : 연구원이 검증한 정보를 기반으로 한 답변
Model: GPT-OSS 120B (Open-source LLM)
Hardware: NVIDIA RTX 4090 GPU
Connection: Cloudflare Tunnel (Public access enabled)
Operating Hours: 8 AM ~ 4 AM next day (KST - Korea Standard Time)
Rest Time: Daily 4 AM ~ 8 AM (System maintenance)
Auto-restart: Daily at 8 AM & 6 PM (Tunnel refresh)
Updates: Auto-update every Sunday 2 AM KST
• ✓ : Answer based on researcher-verified information
궁금한 내용을 남겨주시면, 연구실 관계자가 검토 후 답변을 추가하겠습니다.
모델을 다운로드하고 있습니다...
발견하신 버그를 알려주세요. 익명으로 GitHub 이슈가 생성됩니다.