Details of a Researcher - SUGIURA, Komei

Papers - SUGIURA, Komei

Division display 1 - 20 of about 211 ／ All the affair displays >>

Mobile Manipulation Instruction Generation From Multiple Images With Automatic Metric Enhancement

K Katsumata, M Kambara, D Yashima, R Korekata, K Sugiura

IEEE Robotics and Automation Letters 2025

DM2RM: dual-mode multimodal ranking for target objects and receptacles based on open-vocabulary instructions

R Korekata, K Kaneda, S Nagashima, Y Imai, K Sugiura

Advanced Robotics, 1-16 abs/2408.07910 2025

Access to Document (DOI)

Pre-manipulation alignment prediction with parallel deep state-space and transformer models

M Kambara, K Sugiura

Advanced Robotics 39 (13), 806-816 2025

Pre-Manipulation Alignment Prediction for Open-Vocabulary Object Manipulation Based on End-Effector Trajectories

M Kambara, K Sugiura

2025 19th International Conference on Machine Vision and Applications (MVA), 1-5 1 - 5 2025

Access to Document (DOI)

NaiLIA: 緩和損失に基づくネイルデザインのマルチモーダル検索

雨宮佳音，小松拓実，八島大地，是方諒介，勝又圭，杉浦孔明

人工知能学会全国大会論文集第 39 回 (2025), 2Win555-2Win555 (The Japanese Society for Artificial Intelligence) JSAI2025 ( 0 ) 2Win555 - 2Win555 2025

We focus on the task of retrieving nail design images based on dense intent descriptions, which represent long and multi-layered user intent for nail designs. This is challenging because such descriptions specify flexibly created paintings and pre-manufactured embellishments, as well as visual characteristics, spatial relationships, higher-order themes, and overall impressions. Existing vision-and-language foundation models often struggle to capture the interplay between paintings and embellishments, failing to incorporate multi-layered intent descriptions. To address this, we propose NaiLIA, a method that enables the retrieval of nail design images that comprehensively align with descriptions with dense user intent. Our approach estimates confidence scores for images that align with a given description and can be considered as positive examples but are not explicitly labeled (unlabeled positives), and incorporates this score into the loss function. To evaluate NaiLIA, we constructed a benchmark consisting of 10,625 images collected from people with diverse cultural backgrounds. The images were annotated with long and dense intent descriptions given by over 200 annotators. Experimental results demonstrate that the proposed method outperforms standard methods by 20.9 points in terms of recall@10.

Interactive robot action replanning using multimodal llm trained from human demonstration videos

C Hori, M Kambara, K Sugiura, K Ota, S Khurana, S Jain, R Corcodel, ...

ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and … 1 - 5 2025

Access to Document (DOI)

Deep Space Weather Model: Long-Range Solar Flare Prediction from Multi-Wavelength Images

S Nagashima, K Sugiura

Proceedings of the IEEE/CVF International Conference on Computer Vision … abs/2508.07847 2025

Access to Document (DOI)

Crosslingual Visual Prompt に基づくテキスト付き画像からの日常物体検索

戸倉健登，是方諒介，小松拓実，今井悠人，杉浦孔明

人工知能学会全国大会論文集第 39 回 (2025), 1Win452-1Win452 (The Japanese Society for Artificial Intelligence) JSAI2025 ( 0 ) 1Win452 - 1Win452 2025

This study explores a task where a robot searches for images containing target objects based on user language queries from a large set of images captured in diverse indoor and outdoor environments. Both images with and without scene text are considered. For example, when searching with the query, "Pass me the red container of Sun-Maid raisins on the kitchen counter," the model ranks images containing a container labeled "Sun-Maid raisins" on the kitchen counter higher. However, linking visual semantics with scene text is challenging. Additionally, multimodal search requires large-scale, high-speed inference, making it impractical to rely solely on a multimodal large language model (MLLM). To address this, we introduce a Scene Text Visual Encoder, integrating an Aligned Representation with a narrative representation obtained using an MLLM based on Crosslingual Visual Prompting. Incorporating OCR results into the prompt further reduces hallucination. Experiments show that the proposed method outperforms multimodal foundation models across multiple benchmarks in standard evaluation metrics for ranking-based learning.

Object Segmentation from Open-Vocabulary Manipulation Instructions Based on Optimal Transport Polygon Matching with Multimodal Foundation Models.

Takayuki Nishimura, Katsuyuki Kuyo, Motonari Kambara, Komei Sugiura

IROS 9549 - 9556 2024

Access to Document (DOI)

Learning-To-Rank Approach for Identifying Everyday Objects Using a Physical-World Search Engine.

Kanta Kaneda, Shunya Nagashima, Ryosuke Korekata, Motonari Kambara, Komei Sugiura

9 ( 3 ) 2088 - 2095 2024

Access to Document (DOI)

Trimodal Navigable Region Segmentation Model: Grounding Navigation Instructions in Urban Areas

N Hosomi, S Hatanaka, Y Iioka, W Yang, K Kuyo, T Misu, K Yamada, ...

IEEE Robotics and Automation Letters 9 (5), 4162-4169 2024

Nearest neighbor future captioning: generating descriptions for possible collisions in object placement tasks

T Komatsu, M Kambara, S Hatanaka, H Matsuo, T Hirakawa, T Yamashita, ...

Advanced Robotics 38 (18), 1265-1276 2024

Co-scale cross-attentional transformer for rearrangement target detection

H Matsuo, S Ishikawa, K Sugiura

Advanced Robotics 38 (18), 1277-1286 2024

Cooperative Control of Multiple CAs

T Nagai, T Nakamura, K Sugiura, T Taniguchi, Y Suzuki, M Hirata

Cybernetic Avatar, 151-207 2024

Deneb: A Hallucination-Robust Automatic Evaluation Metric for Image Captioning

K Matsuda, Y Wada, K Sugiura

Proceedings of the Asian Conference on Computer Vision, 3570-3586 2024

Layer-Wise Relevance Propagation with Conservation Property for ResNet

S Otsuki, T Iida, F Doublet, T Hirakawa, T Yamashita, H Fujiyoshi, ...

European Conference on Computer Vision, 349-364 2024

Mask-Attention A3C: Visual Explanation of Action-State Value in Deep Reinforcement Learning

H Itaya, T Hirakawa, T Yamashita, H Fujiyoshi, K Sugiura

IEEE Access 12 86553 - 86571 2024

Access to Document (DOI)

Multimodal Target Localization with Landmark-Aware Positioning for Urban Mobility

N Hosomi, Y Iioka, S Hatanaka, T Misu, K Yamada, N Tsukamoto, ...

IEEE Robotics and Automation Letters 10 ( 1 ) 716 - 723 2024

Access to Document (DOI)

Alternative Adapter Model: 視覚言語基盤モデルのための視覚的説明生成

平野愼之助，飯田紡，杉浦孔明

人工知能学会全国大会論文集第 38 回 (2024), 1D3GS705-1D3GS705 (The Japanese Society for Artificial Intelligence) JSAI2024 ( 0 ) 1D3GS705 - 1D3GS705 2024

In the modern era where deep learning is applied across a wide range of fields, the explainability of models is of paramount importance. However, existing methods are not optimized for vision-language foundation models, leading to lower explanation quality for such models. Therefore, this study proposes the Alternative Adapter Model, an explanation generation model tailored to vision-language foundation models. By introducing a Side Branch Network connected to the vision-language foundation model, the proposed method extracts features suitable for explanation generation. Furthermore, by implementing the Alternative Epoch Architecture, which dynamically changes the outputs of modules and the layers to be frozen, we address the issue of overly narrow focus areas. To evaluate the proposed method, experiments were conducted using the CUB-200-2011 dataset. The results demonstrate that the proposed method surpasses existing methods in mean IoU, Insertion Score, Deletion Score, and Insertion-Deletion Score, which are standard metrics for visual explanation generation tasks.

Retention 機構に基づく頭蓋内脳波の分類と BMI の構築

長嶋隼矢，兼田寛大，飯田紡，田口美紗，平田雅之，杉浦孔明

人工知能学会全国大会論文集第 38 回 (2024), 4N1GS101-4N1GS101 (The Japanese Society for Artificial Intelligence) JSAI2024 ( 0 ) 4N1GS101 - 4N1GS101 2024

Speech impairments from conditions like Amyotrophic Lateral Sclerosis and muscular dystrophy severely restrict patient communication, affecting daily and social life. Decoding technology based on Electrocorticography (ECoG) is essential for supporting these patients' communication. In this study, we propose a novel architecture combining a specialized convolutional layer for electrode feature extraction and a retentive network for ECoG signal classification of motor imagery, outperforming all baselines in accuracy.

　 Previous page - Next page