Related works

We propose GOTPR, a robust place recognition method designed for outdoor environments where GPS signals are unavailable. Unlike existing approaches that use point cloud maps, which are large and difficult to store, GOTPR leverages scene graphs generated from text descriptions and maps for place recognition. This method improves scalability by replacing point clouds with compact data structures, allowing robots to efficiently store and utilize extensive map data. Additionally, GOTPR eliminates the need for custom map creation by using publicly available OpenStreetMap data, which provides global spatial information. We evaluated its performance using the KITTI360Pose dataset with corresponding OpenStreetMap data, comparing it to existing point cloud-based place recognition methods. The results show that GOTPR achieves comparable ac- curacy while significantly reducing storage requirements. In city- scale tests, it completed processing within a few seconds, making it highly practical for real-world robotics applications.
Process of GOTPR. The method consists of three sequential steps: 1) Scene graph generation, 2) Scene graph candidates extraction, 3) Scene graph retrieval and scene selection. The input data are a query text description and OSM. The output is the matching scene id.
Network of scene graph retrieval. Input of the process are query text scene graph and the extracted OSM scene graph candidates. And the output is the selected top-k scene id. The joint embedding model consists of multiple GPS convolution layers with self and cross modules.
Examples of the experimental data are provided. The data were generated based on GPS coordinates 48.964117, 8.472481. (a) and (b) are presented to facilitate comparison and understanding rather than actual data used in GOTPR. In addition, The segmentations in (b) and (c) are included solely for visualization purposes to aid understanding. Moreover, the gray area in (f) represents the overlapping region with (e).
GPS samples collected in Toronto (⬤: Included GPS, ⬤: Excluded GPS)
(a) Street-view image (360 degree).
(b) OSM image (used for directional reference).
(c) Top-1 OSM (G.T.).
(d) Top-1 OSM scene graph (G.T.).
(e) Top-3 OSM.
(f) Top-3 OSM scene graph.
(g) Top-5 OSM.
(h) Top-5 OSM scene graph.
An example of place recognition results from GOTPR. The OSM scene graph that best matches the actual query text scene graph serves as the ground truth and corresponds to the Top-1 OSM scene graph with the highest similarity in the example above. The results were generated using the GPS coordinates 43.7594129828112,-79.46708294624517. The suffixes _n1,2 attached to some node labels in (a) are identifiers used to distinguish nodes with the same label. These identifiers are removed during the text scene graph generation process, resulting in the text scene graph shown in (b).
Street-view image (360 degree).
OSM image (used for directional reference).
User intruction | User-generated text description | Prompt for LLM (user-generated text to GOTPR format) | LLM-generated text description
@article{jung2025gotloc,
title={GOTLoc: General Outdoor Text-based Localization Using Scene Graph Retrieval with OpenStreetMap},
author={Jung, Donghwi and Kim, Keonwoo and Kim, Seong-Woo},
journal={arXiv preprint arXiv:2501.08575},
year={2025}
}