Waffle:: Multimodal Floorplan Understanding in The Wild
Waffle:: Multimodal Floorplan Understanding in The Wild
https://tau-vailab.github.io/WAFFLE
arXiv:2412.00955v2 [cs.CV] 3 Dec 2024
Abstract
CA
K
encoded by floorplans for various applications, such as 3D
sa5
++
LE
lan
N
3D
a
reconstruction [25] and floorplan-guided building naviga-
FF
o rP
biC
LA
nt
WA
tion [27, 39]. However, prior data-driven techniques oper-
Flo
SD
RP
Cu
R2
Re
ating on floorplans mostly focus on extremely limited se-
mantic domains (e.g. apartments) and geographical loca- Building — Residential buildings — ∼4⋄ >1K
tions (often a single country), failing to cover the diversity types
needed for automatic understanding of floorplans in an un- Countries UK Fin. Asia Switz. Japan ?† >100
constrained setting. #Categories ∗
∼15 ∼80 ∼13 91 22 35 ∞‡
In this work, we introduce WAFFLE (WikipediA-Fueled
FLoorplan Ensemble), a multimodal floorplan understand- Real images? ✓ × × ✓ ✓ ✓ ✓
∗
ing dataset comprised of diverse imagery spanning a variety Number of unique annotation values for labeled grounded regions or objects.
†
of building types, geographical regions, historical eras, and Unspecified data source
‡
data formats (as illustrated in Figure 1), along with com- Free text, on a subset of images
⋄
prehensive textual data. WAFFLE is derived from freely- Contains floorplans of 100 buildings spanning residential buildings, schools,
hospitals, and shopping malls
available Internet images and metadata from the Wikimedia
Commons platform. To turn noisy Internet data into this Table 1. A comparison between WAFFLE and other floorplan
curated dataset with rich semantic annotations, we lever- datasets. SD above stands for Swiss Dwellings. We can see that, in
age state-of-the-art foundation models, using large language contrast to our proposed WAFFLE dataset, most existing datasets
models (LLMs) and vision-language models (VLMs) to focus on a single building type in a specific area in the world, and
perform curation tasks with little or no supervision. This in- consider a small, closed list of annotation values.
cludes a decomposition of floorplans into visual elements;
and structuring textual metadata, code and OCR detections plans have also been utilized for navigation tasks. Several
with LLMs. By combining these powerful tools, we build works predict position over a given floorplan, for a single
a new dataset for floorplan understanding with rich and di- image [39] or video sequences [6] depicting regions of the
verse semantics. environment. Narasimhan et al. [27] train an agent to nav-
In addition to serving as a challenging benchmark for igate in new environments by predicting corresponding la-
prior work, we show the utility of this data for various build- beled floorplans.
ing understanding tasks that were not feasible with previous Some works specifically target recognition of semantic
datasets. By using high-level and localized semantic labels elements over both rasterized [9, 48] and vectorized [45]
along with floorplan images in WAFFLE, we learn to pre- floorplan representations, as well as applying this to per-
dict building semantics and use them to generate floorplan form raster-to-vector conversion [18, 22, 24]. In our work,
images with the correct building type, along with optional we are interested in understanding Internet imagery of di-
conditioning on structural configurations. Grounded labels verse data types such as raster graphics and photographs
within images also provide supervision to segment areas or scans of real floorplans. In contrast to prior work that
corresponding to domain-specific architectural terms. As mostly focuses on a fixed set of semantic elements in res-
shown by these applications, WAFFLE opens the door for idential apartments, such as walls, bathrooms, closets, and
semantic understanding and generation of buildings in a di- so on, we are interested in acquiring higher-level reasoning
verse, real-world setting. over a wide array of building types.
The problem of synthesizing novel floorplans, and other
2. Related Works types of 2D layouts such as documents [30, 50], has also re-
ceived considerable interest (see the recent survey by Weber
Floorplans in Computer Vision. Floorplans are a funda- et al. [41] for a comprehensive review). Earlier works gen-
mental element of architectural design; as such, automatic erate floorplans from high-level constraints, such as room
understanding and generation of floorplans has drawn sig- adjacencies [17, 26]. Later works are able to generate novel
nificant interest from the research community. floorplans in more challenging settings, e.g. only given their
Several works aim to reconstruct floorplans, either from boundaries [16, 43]. In our work, we show that SOTA text-
3D scans [21, 47], RGB panoramas [2, 33, 46], room lay- to-image generation tools can be fine-tuned for generating
outs [15] or combined modalities, such as sparse views and floorplans of diverse building types, not only residential
room-connectivity graphs [12]. Prior works also investi- buildings, as explored by prior methods.
gate the problem of alignment between floorplans and 3D
point clouds depicting scenes [19]. Martin et al. [25] lever- Floorplan Datasets. Prior datasets containing floorplan
age floorplans of large-scale scenes to produce a unified data are limited in structural and semantic diversity, typ-
reconstruction from disconnected 3D point clouds. Floor- ically being limited to residential building types such as
apartments from specific geographic locations, often mined
from real estate listings. For example, Rent3D++ [38] con-
tains floorplans of 215 apartments located in London, and
CubiCasa5K [18] contains floorplans of 5K Finnish apart-
ments. The RPLAN [43] dataset contains 80K floorplans
of apartments in Asia, further limited by various size and
structural requirements (e.g., having only 3–9 rooms with
specific proportions relative to the living room). The Swiss Building name: St. Paul’s Episcopal Church Building name: Château de Blois
Building type: church Building type: castle
Dwellings [34] includes floorplan data for 42K Swiss apart- Country: United States of America Country: France
ments, and the Modified Swiss Dwellings [37] dataset pro- Grounded architectural features: {organ room Grounded legend: {1: barn, 2:chapel, 3:
rector, chapel, vestry, altar, tower, nave, apse} saint savior church, 6: prior’s house, 8:
vides a filtered subset of this data with additional access residence of counts, 9: new home, 10:
orchard in moat, 11: bretonry garden, 12:
graph information for floorplan auto-completion learning. booth hall}
The R2V [22] dataset introduces 815 Japanese residential Figure 2. Samples from WAFFLE. Above, we show images
building floorplans. paired with their structured data, including the building name and
Additionally, WAFFLE differs substantially from prior type, country of origin, and their grounded architectural features.
works with regards to the sourcing and curation of data. We also visualize the detected layout components (floorplan, leg-
Datasets of real floorplans, such as those previously- end, compass, and scale, as relevant) overlaid on top of the images.
mentioned, are constructed with tedious manual annotation.
For example, specialists spent over 1K hours in the con-
struction of FloorPlanCAD [11] to provide annotations of Figure 2. We provide an interactive viewer of samples from
30 categories (such as door, window, bed, etc.). Annota- the WAFFLE dataset, and additional details and statistics of
tions may also derive from other input types rather than be- our dataset, in the supplementary material. We proceed to
ing direct annotations of floorplans; for instance, the Zillow describe the curation process and contents of WAFFLE.
Indoor Dataset [7] generates floorplans with user assistance 3.1. Data Collection
from 360◦ panoramas, yielding plans for 1,524 homes after
over 1.5K hours of manual annotation labor. To bypass such Images and metadata in Wikimedia Commons data are
manual procedures, other works generate synthetic floor- ordered by hierarchical categories (WikiCategories). To
plans using predefined constraints [8]. By contrast, WAF- find relevant data, we recursively scrape the WikiCategories
FLE contains diverse Internet imagery of floorplans, includ- Floor plans and Architectural drawings, ex-
ing both original digital images and scans captured in the tracting images and metadata from Wikimedia Commons
wild, and is curated with a fully automatic pipeline. See Ta- and the text of linked Wikipedia articles. As many images
ble 1 for a comparison of the most related datasets with our contain valuable textual information (e.g. hints to the loca-
proposed WAFFLE dataset. tion of origin, legend labels, etc.), we also extract text from
Finally, there are also large-scale datasets of landmark- the images using the Google Vision API* for optical char-
centric image collections, such as Google Landmarks [29, acter recognition (OCR). Finally, we decompose images
42] and WikiScenes [44]. Along with photographs and sim- into constituent items by fine-tuning the detection model
ilar imagery of these landmarks, such collections may in- DETR [3] on a small subset of labeled examples to pre-
clude schematic data such as floorplans. While prior works dict bounding boxes for common layout components (floor-
focus on the natural imagery in these collections for tasks plans, legend boxes, compass, and scale icons).
such as image recognition, retrieval, and 3D reconstruction, The raw data includes a significant amount of noise
we specifically leverage the schematic diagrams found in along with floorplans, including similar topics such as maps
such collections for layout generation and understanding. and cross-sectional blueprints as well as other unrelated
data. Therefore, we filter this data as follows:
3. WAFFLE: Internet Floorplans Dataset Text-based filtering (LLM). We perform an initial text-
only filtering stage by processing our images’ textual meta-
In this section, we introduce WAFFLE (WikipediA- data with an LLM to extract structured information. We
Fueled FLoorplan Ensemble), a new dataset of 18,556 provide the LLM with a prompt containing image metadata
floorplans, derived from Wikimedia Commons* and asso- and ask it to categorize the image in multiple-choice format,
ciated textual descriptions available on Wikipedia. WAF- providing it with a closed set of possible categories. These
FLE contains floorplan images with paired structured meta- include positive categories such as floorplan and building
data containing overall semantic information and spatially- as well as some negative categories (not floorplans) such as
grounded legends. Samples from our dataset are provided in map and city.
* https://commons.wikimedia.org * https://cloud.google.com/vision?hl=en
Image-based filtering (CLIP). We use CLIP [31] im-
age embeddings to filter for images likely to be floor-
plans. Firstly, as the WikiCategory Architectural
drawings contains many non-floorplan images, we train a
linear classifier on a balanced sample of items from the two
WikiCategories and select images that are closer to those Raw Data Legend Image
in the Floor plans WikiCategory. Moreover, we filter Figure 3. We automatically extract legends and architectural fea-
all images by comparing them with CLIP text prompt em- tures from the image raw data (illustrated on the left, either the
beddings, following the use of CLIP for zero-shot classifi- image metadata or OCR detections) by prompting LLMs. We as-
cation. We compare to multiple prompts such as A map, A sociate the keys with text detected in the image, yielding grounded
picture of people, and A floorplan, aggregating scores for regions associated with semantics.
positive and negative classes and filtering out images with
low scores. Finally, we train a binary classifier using high- marked directly on the floorplan, we examine the bounding
scoring images and negative examples to adjust the zero- boxes of floorplan and legend detections (using the model
shot CLIP classifications for increased recall. described in Section 3.1) and select OCR detections within
these areas. We also extract additional legend information
This step results in a final dataset of nearly 20K im- from image metadata by prompting the LLM with an in-
ages. Each image is accompanied by the following raw data struction including page content from the image’s Wikime-
extracted from its Wikimedia Commons page and linked dia Commons page or the code surrounding the image in its
pages: the image file name, its Wikimedia Commons page linked Wikipedia pages (as legends often appear in these lo-
content (including a textual description), a list of linked cations). We further structure the legend outputs using reg-
WikiCategories, the contents of linked Wikipedia pages (if ular expressions to identify key-value pairs. Finally, we link
present), OCR detections in the image, and bounding boxes the legend keys and architectural features to the regions in
of constituent layout components. the floorplan images coinciding with OCR detections, thus
providing grounding for the semantic values of the image.
3.2. LLM-Driven Structured pGT Generation See Figure 3 for an example.
Our raw data contains significant grounded information 3.3. Dataset Statistics
about each image in diverse formats, which we wish to sys-
tematically organize and structure for use in downstream Our dataset contains nearly 20K images with accompa-
tasks. To this aim, we harness the capabilities of large lan- nying metadata, in a range of formats. In particular, we
guage models (LLMs) for distilling essential information note that our dataset contains over 1K vectorized floorplans.
from diverse textual data. In particular, we extract the fol- Additionally, our dataset contains more than 1K building
lowing information (also illustrated in Figure 2) by prompt- types spread over more than 100 countries across the world,
ing Llama-2 [36] with an instruction and relevant metadata and over 11K different Grounded Architectural Features
fields: building name, building type (i.e. church, hotel, mu- (GAFs) across almost 3K grounded images. We split into
seum etc.), location information (country, state, city), and a train and test sets (18,259 and 297 images respectively) by
list of architectural features that are grounded in the image. selecting according to country (train: 50 countries; test:
In general, the raw metadata contains considerable and 57 countries), thus ensuring disjointedness with regards to
diverse noise, involving multilingual content and multi- buildings and preventing data leakage.
ple written representations of identical entities (e.g. Notre Data Quality Validation. We manually inspect the test set
Dame Cathedral vs. Notre-Dame de Paris). To control for images, removing images that do not contain a valid floor-
the source language, we employ prompts that instruct the plan. Based on this validation, we find that 89% are indeed
LLM to respond in English and request translations when relevant floorplan images. We find this level of noise ac-
necessary. For linking representations of identical enti- ceptable for training models on in-the-wild data, while the
ties (also known as record linkage), we employ LinkTrans- manual filtering assures a clean test set for evaluation. In
former [1] clustering along with various textual heuristics. addition, we manually inspect the quality of our generated
We provide additional details, including prompts used, in pGTs. We find that 89% of the building names, 85% of the
the supplementary material, and proceed to describe our building types and 96% of the countries of origin are accu-
method for grounding architectural features in floorplans. rately labeled (considering 100 random data samples).
Architectural Feature Extraction and Grounding. Many 4. Experiments
floorplan images indicate architectural information either
directly with text on the relevant region, or indirectly using In this section, we perform several experiments applying
a legend. To identify legends and architectural information our dataset to both discriminative and generative building
R@1 R@5 R@8 R@16 MRR
Nave
CLIP 1.5% 7.6% 10.3% 19.7% 0.07
CLIPF T 11.8% 34.1% 40.0% 52.9% 0.23
Table 2. Results on CLIP retrieval of building types, for CLIP
Court
before and after fine-tuning on our dataset. We report Recall@k
(R@k) for k ∈ {1, 5, 8, 16} and Mean Reciprocal Rank (MRR)
for these models, evaluated on our test set. As seen above, fine-
tuning on WAFFLE significantly improves retrieval metrics.
Kitchen
CC5K∗ CLIPSeg Ours
AP 0.138 0.157 0.226
mIoU 0.057 0.066 0.131 GT CC5K∗ CLIPSeg Ours
Table 3. Open-Vocabulary Floorplan Segmentation Evaluation. Figure 4. Comparison of open-vocabulary segmentation probabil-
We compare against a pretrained CLIPSeg model and against ity map results. We show the input images in the first column,
a closed-vocabulary segmentation model (CC5K). As illustrated with the corresponding GT regions in red. ∗ Note that CC5K is
above, our method improves localization across all evaluation met- a closed-vocabulary model designed for residential floorplan un-
rics. ∗ Evaluated only over a subset of residential buildings. derstanding, and therefore we cannot compare to it over additional
building types (such as castles and cathedrals illustrated above). In
addition to improving on the base CLIPSeg segmentation model,
understanding tasks. For all tasks, we use the the train-test we outperform the strongly-supervised CC5K, suggesting that this
model cannot generalize well beyond its training set distribution.
split outlined in Section 3.3. Please refer to the supplemen-
tary material for further training details.
segmentation targets. This yields partial ground truth super-
4.1. Building Type Understanding vision; for a text query, we use OCR bounding box regions
corresponding to text labels that semantically match the
Task description. We test the ability to predict building- query (implemented via text embedding similarity) as posi-
level semantics from a floorplan, similarly to a human who tive targets and the remaining bounding box regions as neg-
might look at a floorplan and make an educated guess as ative targets. To prevent leakage from the written text in the
to what type of building it depicts. To learn this under- images, we perform inpainting with Stable Diffusion [32]
standing, we fine-tune CLIP with a contrastive objective on to replace the contents of the OCR bounding boxes. As
paired images and building type pseudo-labels from WAF- our inpainting process may cause artifacts, for evaluation
FLE. Our fine-tuned model (CLIPF T ) is expected to adjust purposes we manually select images that do not contain
CLIP to assign floorplan image embeddings close to those GAFs. We follow prior work [23] and report mean Intersec-
of relevant building types, allowing for subsequent retrieval tion over Union (mIoU) and Average Precision (AP). The
or classification with floorplan images as input. We test the mIoU metric requires a threshold, which we empirically set
extent to which this understanding has been learned in prac- to 0.25. AP is a threshold-agnostic metric that measures the
tice with standard retrieval metrics, evaluating Recall@k for area under the recall-precision curve, quantifying to what
k ∈ {1, 5, 8, 16} and Mean Reciprocal Rank (MRR). extent it can discriminate between correct and erroneous
Results. Results for fine-tuning CLIP for building type un- matches. In addition to comparing against the pretrained
derstanding are shown in Table 2. As is seen there, CLIPF T CLIPSeg model, we compare against the closed-vocabulary
significantly outperforms the base model in retrieving the segmentation model provided by CubiCasa5K (CC5K) [18]
correct building type pseudo-labels, hence showing a better over a subset of residential buildings in our test set (evalu-
understanding of their global semantics. ating semantic regions which this model was trained on).
Results. Quantitative results are reported in Table 3, show-
4.2. Open-Vocabulary Floorplan Segmentation
ing a clear boost in performance across both metrics. This
Task description. To model localized semantics within is further reflected in our qualitative results in Figures 4. In
floorplans, we use the GAFs in WAFFLE to fine-tune a text- addition, the results on residential buildings of the strongly-
driven segmentation model. We adopt the open-vocabulary supervised residential floorplan understanding model [18]
text-guided segmentation model CLIPSeg [23] and perform yields inferior performance, likely because the latter model
fine-tuning on the subset of these grounded images. uses supervision from a specific geographical region and
To provide supervision, we use the values of the GAFs as style alone (a limitation of existing datasets, as we describe
input text prompts for the segmentation model and the OCR in Section 2). Overall, both metrics show that there is much
bounding box regions of the associated grounded values as room for improvements with future techniques leveraging
Walls Doors Windows Interior BG FID ↓ KMMD ↓ CLIP Sim. ↑
Precision 0.737 0.201 0.339 0.799 0.697 SD 194.8 0.10 24.9
Recall 0.590 0.163 0.334 0.521 0.912 SDF T 145.3 0.07 25.6
IoU 0.488 0.099 0.202 0.461 0.653 Table 5. Results on generated images, using a base and fine-tuned
Table 4. Benchmark for Semantic Segmentation Evaluation. We Stable Diffusion (SD) model. We compare the quality of the gener-
benchmark prior work, reporting performance over the CubiCasa- ated images (FID, KMMD) and the similarity to the given prompt
5k [18] segmentation model, on common grounded categories. (CLIP Sim.). As illustrated above, SD fine-tuning improves both
Note that background is denoted as BG above. As illustrated, realism and semantic correctness of image generations.
WAFFLE serves as a challenging benchmark for existing work.
generation model on paired images and pGT textual data
our data for segmentation-related tasks. from WAFFLE for text-guided generation of floorplan im-
4.3. Benchmark for Semantic Segmentation ages. We adopt the latent diffusion model Stable Diffu-
Following prior work [18, 22, 43], we consider segmen- sion [32] (SD), using prompts of the form “A floor plan
tation of rasterized floorplan images into fine-grained lo- of a <building type>” which use the building type
calized categories, as locating elements such as walls has pseudo-labels from our LLM-extracted data. We balance
applications to various downstream tasks. To provide a training samples across building names and types to avoid
new benchmark for performance on the diverse floorplans overfitting on common categories. We evaluate the real-
in WAFFLE, we manually annotate pixel-level segmenta- ism of these generations using Fréchet Inception Distance
tion maps for more than a hundred images over categories (FID) [14] as well as Kernel Maximum Mean Discrepancy
applicable to most building types: wall, door, window, in- (KMMD), since FID can be unstable on small datasets [5].
terior and background. As our dataset contains a variety of Similar to prior work [4,28,40], we measure KMMD on In-
data types, we annotate SVG-formatted images, which can ception features [35]. To measure semantic correctness, we
be easily manually annotated by region. measure CLIP similarity (using pretrained CLIP) between
We illustrate the utility of this benchmark by evaluating generations and prompts. All metrics were calculated per
a standard existing model, namely the supervised segmen- building type, averaging over the most common 15 types.
tation model provided by CC5K [18]. We also evaluate a Results. Table 5 summarizes quantitative metrics, compar-
modern diffusion-based architecture trained with the same ing floorplan generation using the base SD model with our
supervised data to predict wall locations as black-and-white fine-tuned version. These provide evidence that our model
images, to explore whether architectural modifications can generates more realistic floorplan images that better adhere
yield improved performance. Further details of these mod- to the given prompts. Supporting this, we provide exam-
els are provided in the supplementary material. ples of such generated floorplans in Figure 5, observing the
Results. Table 4 includes a quantitative evaluation of the diversity and semantic layouts predicted by our model for
existing model provided by CC5K on our benchmark. As various prompts. We note that our model correctly pre-
illustrated in the table, our dataset provides a challenging dicts the distinctive elements of each building type, such as
benchmark for existing models, yielding low performance, the peripheral towers of castles and numerous side rooms
particularly for more fine-grained categories, such as doors for patient examinations in hospitals. Such distinctive ele-
and windows. In addition to these results, we find that ments are mostly not observed in the pretrained SD model,
the modern diffusion architecture shows significantly bet- which generally struggles at generating floorplans. To fur-
ter performance at localization of walls, generating binary ther illustrate that our generated floorplans better convey the
maps with higher quantitative metric values (+1.2% in pre- building type specified within the target text prompt, we
cision, +36.4% in recall and +29.5% in IoU, in comparison conducted a user study. Given a pair of images, one gen-
to the values obtained on the wall category in Table 4). This erated with the pretrained model and one with our finetuned
additional experiment shows promise in using stronger ar- model, users were asked to select the image that best con-
chitectures for improving localized knowledge on weakly veys the target text prompt. We find that 70.42% of the
supervised in-the-wild data to ultimately approach the goal time users prefer our generated images, in comparison to
of pixel-level localization within diverse floorplans. Qual- the generations of the pretrained model. Additional details
itative results from both models, along with ground truth regarding this study are provided in the supplementary.
segmentations, are provided in the supplementary material.
4.5. Structure-Conditioned Floorplan Generation
4.4. Text-Conditioned Floorplan Generation
Task description. Structural conditions for floorplan gen-
Task description. Inspired by the rich literature on au- eration have attracted particular interest, as architects may
tomatic floorplan generation, we fine-tune a text-to-image wish to design floorplans given a fixed building shape or
School Palace Church Castle Hospital Hotel Library
xxpretrained
←− fine-tuned −→
Figure 5. Examples for generated floorplans for various building types, using the prompt “A floor plan of a <building type>” (cor-
responding types are shown on top). The first row shows samples from the pretrained SD model, and the bottom three show results from
the model fine-tuned on WAFFLE. As seen above, pretrained SD struggles at generating floorplans in general and often yields results that
do not structurally resemble real floorplans. By contrast, our fine-tuned model can correctly generate fine-grained architectural structures,
such as towers in castles or long corridors in libraries.
Figure 6. Boundary-conditioned generation. The first column shows images in WAFFLE, the second column shows automatically-extracted
boundary masks, and the following columns show floorplan image generations conditioned on this generation with diverse building types
provided as prompts.
Cond. School Cond. Castle Cond. Library Cond. Cathedral
Figure 7. Structure-conditioned generation. For each image pair, the first image displays a building layout condition, taken from the
existing CubiCasa5K dataset, which defines foreground (white) and background (black) regions, walls (red), doors (blue), and windows
(cyan). The second image shows a generation conditioned on this layout, using the ControlNet-based model described in Section 4.5.
Our image data and metadata enable the generation of diverse building types with structural constraints, without requiring any pixel-level
annotations of images in WAFFLE. Notably, this succeeds even when the constraint is highly unusual for the corresponding building type,
such as the condition above for cathedral (as cathedrals are usually constructed in a cross shape).
desired room configuration [16, 17, 26, 43]. Unlike exist- building type, such as the cathedral in Figure 7 which de-
ing works that consider residential buildings exclusively, viates from the typical layout of a cathedral (usually con-
we operate on the diverse set of building types and con- structed in a cross shape) in order to obey the condition.
figurations found in WAFFLE, providing conditioning to
our generative model SDF T by fine-tuning ControlNet [49] 5. Conclusion
for conditional generation, combined with applying text
prompts reflecting various building types. We have presented the WAFFLE dataset of diverse floor-
We are challenged by the fact that data in WAFFLE, cap- plans in the wild, curated from Internet data spanning di-
tured in-the-wild, does not contain localized structural an- verse building types, geographic locations, and architectural
notations (locations of walls, doors, windows or other fea- features. To construct a large dataset of images with rich,
tures) such as those painstakingly annotated in some exist- structured metadata, we leverage SOTA LLMs and mul-
ing datasets. Therefore, we leverage our data in an unsuper- timodal representations for filtering and extracting struc-
vised manner to achieve conditioning. To condition on the ture from noisy metadata, reflecting both the global seman-
desired building boundary, we approximate the outer shape tics of buildings and localized semantics grounded in re-
for all images in the training set via an edge detection al- gions within floorplans. We show that this data can be used
gorithm. To condition on more complex internal structures, to train models for building understanding tasks, enabling
we instead train ControlNet on image-condition pairs de- progress on both discriminative and generative tasks which
rived from the existing annotated CubiCasa5K dataset [18]. were previously not feasible.
By using SDF T as the backbone for this ControlNet, re- While our dataset expands the scope of floorplan under-
ducing the conditioning scale (assigning less weight to the standing to new unexplored tasks, it still has limitations.
conditioning input during inference) and using a relatively As we collect diverse images in-the-wild, our data natu-
rally contains noise (mitigated by our data collection and
high classifier-free guidance scale factor (assigning higher
cleaning pipeline) which could affect downstream perfor-
weight to the prompt condition), we fuse the ability of mance. In addition, while our dataset covers a diverse set
SDF T to generate diverse building types while incorporat- of building types, it leans towards historic and religious
ing the structural constraints derived from external anno- buildings, possibly introducing bias towards these seman-
tated data of residential buildings. tic domains. We focus on 2D floorplan images, though we
see promise in our approach and data for spurring further
Results. We provide structurally-conditioned generations research in adjacent domains, such as 3D building gener-
in Figures 6–7 for various building types. For boundary ation and such as architectural diagram understanding in
conditioning, the condition shape is extracted from existing general. In particular, although our work does not consider
images in our dataset. For structure conditioning, the con- the 3D structure of buildings, we see promise in the use
ditions are derived from the annotations in the external Cu- of our floorplans for aligning in-the-wild 3D point clouds
biCasa5K dataset, using categories relevant to the diverse or producing 3D-consistent room and building layouts. Fi-
buildings in WAFFLE. These examples illustrate that our nally, our work could provide a basis for navigation tasks
which require indoor spatial understanding, such as indoor
model is able to control the contents and style of the build-
and household robotics. We envision future architectural
ing according to the text prompt while adhering to the over- understanding models that are enabled by datasets such as
all layout of the condition. This again demonstrates that the WAFFLE will explore new challenging tasks such as visual
model has learned the distinct characteristics of each build- question answering for floorplans, which could be enabled
ing type. In addition, we note that this succeeds even when by our textual metadata and open-vocabulary architectural
the structural constraint is highly unusual for the paired features.
References ings of the IEEE conference on computer vision and pattern
recognition, pages 770–778, 2016. 6
[1] Abhishek Arora and Melissa Dell. Linktransformer: A uni- [14] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner,
fied package for record linkage with transformer language Bernhard Nessler, and Sepp Hochreiter. Gans trained by a
models. arXiv preprint arXiv:2309.00789, 2023. 4 two time-scale update rule converge to a local nash equilib-
[2] Ricardo Cabral and Yasutaka Furukawa. Piecewise planar rium, 2018. 6
and compact floorplan reconstruction from images. In 2014 [15] Sepidehsadat Hosseini, Mohammad Amin Shabani, Saghar
IEEE Conference on Computer Vision and Pattern Recogni- Irandoust, and Yasutaka Furukawa. Puzzlefusion: Unleash-
tion, pages 628–635. IEEE, 2014. 2 ing the power of diffusion models for spatial puzzle solv-
[3] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas ing. In Thirty-seventh Conference on Neural Information
Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to- Processing Systems, 2023. 2
end object detection with transformers. In European confer- [16] Ruizhen Hu, Zeyu Huang, Yuhan Tang, Oliver Van Kaick,
ence on computer vision, pages 213–229. Springer, 2020. 3, Hao Zhang, and Hui Huang. Graph2plan: Learning floor-
1 plan generation from layout graphs. ACM Transactions on
[4] Eric Ming Chen, Jin Sun, Apoorv Khandelwal, Dani Lischin- Graphics (TOG), 39(4):118–1, 2020. 2, 8
ski, Noah Snavely, and Hadar Averbuch-Elor. What’s in [17] Hao Hua. Irregular architectural layout synthesis with graph-
a decade? transforming faces through time. In Computer ical inputs. Automation in construction, 72:388–396, 2016.
Graphics Forum, volume 42, pages 281–291. Wiley Online 2, 8
Library, 2023. 6 [18] Ahti Kalervo, Juha Ylioinas, Markus Häikiö, Antti Karhu,
[5] Min Jin Chong and David Forsyth. Effectively unbiased fid and Juho Kannala. Cubicasa5k: A dataset and an im-
and inception score and where to find them, 2020. 6 proved multi-task model for floorplan image analysis. In
[6] Hang Chu, Dong Ki Kim, and Tsuhan Chen. You are here: Image Analysis: 21st Scandinavian Conference, SCIA 2019,
Mimicking the human thinking process in reading floor- Norrköping, Sweden, June 11–13, 2019, Proceedings 21,
plans. In Proceedings of the IEEE International Conference pages 28–40. Springer, 2019. 2, 3, 5, 6, 8, 4, 7, 9
on Computer Vision, pages 2210–2218, 2015. 2 [19] Ryan S Kaminsky, Noah Snavely, Steven M Seitz, and
[7] Steve Cruz, Will Hutchcroft, Yuguang Li, Naji Khosravan, Richard Szeliski. Alignment of 3d point clouds to overhead
Ivaylo Boyadzhiev, and Sing Bing Kang. Zillow indoor images. In 2009 IEEE computer society conference on com-
dataset: Annotated floor plans with 360deg panoramas and puter vision and pattern recognition workshops, pages 63–
3d room layouts. In Proceedings of the IEEE/CVF con- 70. IEEE, 2009. 2
ference on computer vision and pattern recognition, pages [20] Diederik P Kingma and Jimmy Ba. Adam: A method for
2133–2143, 2021. 3 stochastic optimization. arXiv preprint arXiv:1412.6980,
[8] Mathieu Delalandre, Ernest Valveny, Tony Pridmore, and 2014. 4
Dimosthenis Karatzas. Generation of synthetic documents [21] Chen Liu, Jiaye Wu, and Yasutaka Furukawa. Floornet:
for performance evaluation of symbol recognition & spot- A unified framework for floorplan reconstruction from 3d
ting systems. International Journal on Document Analysis scans. In Proceedings of the European conference on com-
and Recognition (IJDAR), 13(3):187–207, 2010. 3 puter vision (ECCV), pages 201–217, 2018. 2
[9] Samuel Dodge, Jiu Xu, and Björn Stenger. Parsing floor plan [22] Chen Liu, Jiajun Wu, Pushmeet Kohli, and Yasutaka Fu-
images. In 2017 Fifteenth IAPR international conference on rukawa. Raster-to-vector: Revisiting floorplan transforma-
machine vision applications (MVA), pages 358–361. IEEE, tion. In Proceedings of the IEEE International Conference
2017. 2 on Computer Vision, pages 2195–2203, 2017. 2, 3, 6
[10] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, [23] Timo Lüddecke and Alexander Ecker. Image segmenta-
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, tion using text and image prompts. In Proceedings of
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- the IEEE/CVF Conference on Computer Vision and Pattern
vain Gelly, et al. An image is worth 16x16 words: Trans- Recognition, pages 7086–7096, 2022. 5, 3, 9
formers for image recognition at scale. arXiv preprint [24] Xiaolei Lv, Shengchu Zhao, Xinyang Yu, and Binqiang
arXiv:2010.11929, 2020. 1 Zhao. Residential floor plan recognition and reconstruction.
[11] Zhiwen Fan, Lingjie Zhu, Honghua Li, Siyu Zhu, and Ping In Proceedings of the IEEE/CVF Conference on Computer
Tan. Floorplancad: A large-scale cad drawing dataset for Vision and Pattern Recognition, pages 16717–16726, 2021.
panoptic symbol. In Proceedings of the IEEE/CVF Inter- 2
national Conference on Computer Vision (ICCV), October [25] Ricardo Martin-Brualla, Yanling He, Bryan C Russell, and
2021. 3 Steven M Seitz. The 3d jigsaw puzzle: Mapping large in-
[12] Arnaud Gueze, Matthieu Ospici, Damien Rohmer, and door spaces. In Computer Vision–ECCV 2014: 13th Eu-
Marie-Paule Cani. Floor plan reconstruction from sparse ropean Conference, Zurich, Switzerland, September 6-12,
views: Combining graph neural network with constrained 2014, Proceedings, Part III 13, pages 1–16. Springer, 2014.
diffusion. In Proceedings of the IEEE/CVF International 2
Conference on Computer Vision, pages 1583–1592, 2023. 2 [26] Paul Merrell, Eric Schkufza, and Vladlen Koltun. Computer-
[13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. generated residential building layouts. In ACM SIGGRAPH
Deep residual learning for image recognition. In Proceed- Asia 2010 papers, pages 1–12. 2010. 2, 8
[27] Medhini Narasimhan, Erik Wijmans, Xinlei Chen, Trevor Conference on Computer Vision and Pattern Recognition,
Darrell, Dhruv Batra, Devi Parikh, and Amanpreet Singh. pages 10733–10742, 2021. 3
Seeing the un-scene: Learning amodal semantic maps for [39] Shenlong Wang, Sanja Fidler, and Raquel Urtasun. Lost
room navigation. In Computer Vision–ECCV 2020: 16th shopping! monocular localization in large indoor spaces. In
European Conference, Glasgow, UK, August 23–28, 2020, Proceedings of the IEEE International Conference on Com-
Proceedings, Part XVIII 16, pages 513–529. Springer, 2020. puter Vision, pages 2695–2703, 2015. 2
2 [40] Yaxing Wang, Abel Gonzalez-Garcia, David Berga, Luis
[28] Atsuhiro Noguchi and Tatsuya Harada. Image generation Herranz, Fahad Shahbaz Khan, and Joost van de Weijer.
from small datasets via batch statistics adaptation, 2019. 6 Minegan: effective knowledge transfer from gans to target
[29] Hyeonwoo Noh, Andre Araujo, Jack Sim, Tobias Weyand, domains with few images. In Proceedings of the IEEE/CVF
and Bohyung Han. Large-scale image retrieval with attentive Conference on Computer Vision and Pattern Recognition,
deep local features. In Proceedings of the IEEE international pages 9332–9341, 2020. 6
conference on computer vision, pages 3456–3465, 2017. 3 [41] Ramon Elias Weber, Caitlin Mueller, and Christoph Rein-
[30] Akshay Gadi Patil, Omri Ben-Eliezer, Or Perel, and Hadar hart. Automated floorplan generation in architectural design:
Averbuch-Elor. Read: Recursive autoencoders for document A review of methods and applications. Automation in Con-
layout generation. In Proceedings of the IEEE/CVF Con- struction, 140:104385, 2022. 2
ference on Computer Vision and Pattern Recognition Work- [42] Tobias Weyand, Andre Araujo, Bingyi Cao, and Jack Sim.
shops, pages 544–545, 2020. 2 Google landmarks dataset v2-a large-scale benchmark for
instance-level recognition and retrieval. In Proceedings of
[31] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
the IEEE/CVF conference on computer vision and pattern
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
recognition, pages 2575–2584, 2020. 3
Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning
[43] Wenming Wu, Xiao-Ming Fu, Rui Tang, Yuhan Wang, Yu-
transferable visual models from natural language supervi-
Hao Qi, and Ligang Liu. Data-driven interior plan genera-
sion. In International conference on machine learning, pages
tion for residential buildings. ACM Transactions on Graph-
8748–8763. PMLR, 2021. 4, 1
ics (TOG), 38(6):1–12, 2019. 2, 3, 6, 8
[32] Robin Rombach, Andreas Blattmann, Dominik Lorenz,
[44] Xiaoshi Wu, Hadar Averbuch-Elor, Jin Sun, and Noah
Patrick Esser, and Björn Ommer. High-resolution image syn-
Snavely. Towers of babel: Combining images, language, and
thesis with latent diffusion models, 2022. 5, 6, 4
3d geometry for learning multimodal vision. In Proceedings
[33] Mohammad Amin Shabani, Weilian Song, Makoto of the IEEE/CVF International Conference on Computer Vi-
Odamaki, Hirochika Fujiki, and Yasutaka Furukawa. Ex- sion, pages 428–437, 2021. 3
treme structure from motion for indoor panoramas without [45] Bingchen Yang, Haiyong Jiang, Hao Pan, and Jun Xiao. Vec-
visual overlaps. In Proceedings of the IEEE/CVF Interna- torfloorseg: Two-stream graph attention network for vector-
tional Conference on Computer Vision, pages 5703–5711, ized roughcast floorplan segmentation. In Proceedings of
2021. 2 the IEEE/CVF Conference on Computer Vision and Pattern
[34] Matthias Standfest, Michael Franzen, Yvonne Schröder, Recognition, pages 1358–1367, 2023. 2
Luis Gonzales Medina, Yarilo Villanueva Hernandez, [46] Shang-Ta Yang, Fu-En Wang, Chi-Han Peng, Peter Wonka,
Jan Hendrik Buck, Yen-Ling Tan, Milena Niedzwiecka, and Min Sun, and Hung-Kuo Chu. Dula-net: A dual-projection
Rachele Colmegna. Swiss dwellings: a large dataset of apart- network for estimating room layouts from a single rgb
ment models including aggregated geolocation-based simu- panorama. In Proceedings of the IEEE/CVF Conference
lation results covering viewshed, natural light, traffic noise, on Computer Vision and Pattern Recognition, pages 3363–
centrality and geometric analysis, 2022. 3 3372, 2019. 2
[35] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon [47] Yuanwen Yue, Theodora Kontogianni, Konrad Schindler,
Shlens, and Zbigniew Wojna. Rethinking the inception archi- and Francis Engelmann. Connecting the dots: Floorplan
tecture for computer vision. In Proceedings of the IEEE con- reconstruction using two-level queries. In Proceedings of
ference on computer vision and pattern recognition, pages the IEEE/CVF Conference on Computer Vision and Pattern
2818–2826, 2016. 6 Recognition, pages 845–854, 2023. 2
[36] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, [48] Zhiliang Zeng, Xianzhi Li, Ying Kin Yu, and Chi-Wing
Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Fu. Deep floor plan recognition using a multi-task network
Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. with room-boundary-guided attention. In Proceedings of the
Llama 2: Open foundation and fine-tuned chat models. arXiv IEEE/CVF International Conference on Computer Vision,
preprint arXiv:2307.09288, 2023. 4, 1 pages 9096–9104, 2019. 2
[37] Casper van Engelenburg, Seyran Khademi, Fatemeh [49] Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding
Mostafavi, Matthias Standfest, and Michael Franzen. Mod- conditional control to text-to-image diffusion models, 2023.
ified swiss dwellings: a machine learning-ready dataset for 8, 4, 5
floor plan auto-completion at scale, 2023. 3 [50] Xinru Zheng, Xiaotian Qiao, Ying Cao, and Rynson WH
[38] Madhawa Vidanapathirana, Qirui Wu, Yasutaka Furukawa, Lau. Content-aware generative modeling of graphic design
Angel X Chang, and Manolis Savva. Plan2scene: Convert- layouts. ACM Transactions on Graphics (TOG), 38(4):1–15,
ing floorplans to 3d scenes. In Proceedings of the IEEE/CVF 2019. 2
Appendix Prefixes "an illustration of ", "a drawing of ", "a sketch
of ", "a picture of ", "a photo of ", "a document
A. Interactive Visualization Tool of ", "an image of ", "a visual representation of
", "a graphic of ", "a rendering of ", "a diagram
Please see the attached HTML file (waffle.html) of ", ""
for an interactive visualization of data from the WAFFLE
dataset. Negative "a 3d simulation of ", "a 3d model of ", "a 3d
Prefixes rendering of "
B. Additional Dataset Details Positive "a floor plan", "an architectural layout", "a
Suffixes blueprint of a building", "a layout design"
We proceed to describe the creation of our WAFFLE
dataset in the sections below, including details on curating, Negative "a map", "a building", "people", "an aerial
filtering, and generating pseudo-ground truth labels. Suffixes view", "a cityscape", "a landscape", "a
Figure 11. Distribution of the grounded architectural features (log scale), among almost 3K grounded images, 25K instances grounded,
and 11K unique features.
en/training/text2image
* https://docs.opencv.org/4.x/da/d22/tutorial_
Evaluation. We manually annotate 95 images for evalua-
tion with 27 common GAFs. The most common building py_canny.html
* https://docs.opencv.org/4.x/d9/d8b/tutorial_
types in our evaluation set are churches, castles and resi- py_contours_hierarchy.html
dential buildings. * https://huggingface.co/docs/diffusers/v0.18.2/
en/training/controlnet
* https://www.sbert.net/ * https://huggingface.co/blog/controlnet
Residential building
Historic building
Monastery
Cathedral
Museum
Building
Hospital
Theater
Church
Temple
School
Palace
House
Castle
Hotel
FID ↓
SD 284.6 159.6 198.4 188.3 285.1 199.2 194.3 212.2 159.2 180.6 139.6 224.0 165.9 141.6 189.5
SDF T 148.3 146.1 156.8 142.0 147.7 114.5 100.9 158.0 122.3 137.6 119.1 168.6 176.9 102.0 238.4
KMMD ↓
SD 0.13 0.06 0.11 0.09 0.16 0.09 0.13 0.12 0.08 0.07 0.05 0.13 0.07 0.06 0.08
SDF T 0.07 0.05 0.06 0.04 0.11 0.05 0.05 0.08 0.03 0.05 0.04 0.11 0.09 0.03 0.17
CLIP Sim. ↑
SD 25.3 25.2 24.4 24.2 25.3 24.0 24.2 25.8 25.9 25.2 25.5 25.1 24.7 24.6 24.6
SDF T 25.9 25.6 25.6 25.4 25.3 24.5 25.1 26.7 26.5 26.1 25.5 26.0 25.6 25.8 24.5
Table 6. Quantitative results of results on floorplan image generation split by building type, comparing the quality of images generated
with the pretrained model and our fine-tuned model.
model’s UNet weights. Images and conditions are resized We created a benchmark of 110 SVG images, containing
using the method described above. We train the model for wall, windows and door annotations. We included SVG im-
20K iterations with a batch size of 4, a learning rate of 10−5 ages from our test set. To obtain additional SVG images, we
and Adam optimizer, on one A5000 GPU. During inference also searched for SVGs that were filtered during the dataset
we use CFG scale 15.0 and condition scale 0.5, to fuse the filtering step. Then, we used Inkscape * which allowed us to
style of the input prompt (learned from floorplans in WAF- easily annotate full SVG components at once instead of do-
FLE) and the structure condition. ing it pixel-wise. This made the manual annotation process
less tedious and more accurate.
User study. Each study contained 36 randomly-generated
image pairs, with text prompts mentioning various building C.5. Wall Segmentation with a Diffusion Model
types that were sampled from the 100 most common types. We apply a diffusion-based architecture to wall segmen-
Overall, thirty one users participated in the study, resulting tation by training ControlNet [49] using CubiCasa5K [18]
in a total of 1,116 image pairs (one generated from the pre- (CC5K) layout maps as the target image to generate and in-
trained model, and the other generated from the finetuned put images as conditions. In particular, we convert CC5K
model) that were averaged for obtaining the final results re- annotations into binary images by denoting walls with black
ported in the main paper. pixels and use these as supervision for binary wall segmen-
Participants were provided with the following instruc-
tions: In this user study you will be presented with a series * https://inkscape.org/
D. Additional Results and Visualizations
D.1. Semantic Segmentation Results
Figures 16 and 17 contain examples of test images and
annotations from our benchmark for semantic segmenta-
tion, and the results of the existing CC5K [18] model on
them. These figures demonstrate how the model is chal-
lenged in detecting segments it wasn’t exposed to during
training, like pillars or curved walls. In Figure 18 we show
qualitative examples of wall detection using different model
architectures – the existing ResNet-152 [13] based model,
and the Diffusion-based model discussed in Section C.5. As
illustrated in the figure, using a more advanced model archi-
tecture allows for obtaining significantly cleaner wall seg-
mentations. Table 7 contains a quantitative analysis of the
two models on the wall segmentation prediction task.
Image GT CC5K D.2. Additional Open-Vocabulary Floorplan Seg-
Figure 16. Benchmark for semantic segmentation (over the walls, mentation Results
doors, windows, interior and background categories) on images
from WAFFLE using the strong supervised CC5K [18] pretrained In Figure 19, we show additional examples of text-driven
model. We can see that our data serves as a challenging benchmark floorplan image segmentation before and after fine-tuning
as the model struggles with more diverse and complex floorplans. on our data. We see that the baseline model struggles to
localize concepts inside floorplan images while our fine-
tuning better concentrates probabilities in relevant regions,
approaching the GT regions indicated in orange rectangles.
ResNet Diffusion In Figure 20 we visually compare the segmentation re-
sults to those of CC5K and CLIPSeg on residential build-
Precision 0.737 0.746 ings. We observe that the supervised CC5K model (trained
Recall 0.590 0.805 on Finnish residential floorplans alone) fails to generalize to
IoU 0.488 0.632 the diverse image appearances and building styles in WAF-
Table 7. A comparison between an existing ResNet-based wall de- FLE, even when they are residential buildings, while our
tection model (introduced in CC5K [18]) and a Diffusion-model model shows a more general understanding of semantics in
based one (detailed further in Section C.5), evaluated on our such images.
benchmark. We can see the Diffusion-based model outperforms
the ResNet-based model across all metrics, suggesting that newer D.3. Additional Generation Results
architectures show promise in improving localized knowledge of
in-the-wild data, such the floorplans found in WAFFLE. We show additional results for the generation task in Fig-
ure 21 and for the spatially-conditioned generation in Fig-
ure 22. We provide multiple examples for various building
types, showing that a model trained on our data learns the
distinctive structural details of each building type. For ex-
tation. We initialize with the CompVis/stable-di ample, castles have towers, libraries have long aisles, mu-
ffusion-v1-4 checkpoint and train for 200K iterations seums, hospitals, and hotels have numerous small rooms,
on train items in the CC5K dataset which provides us with churches have a typical cross shape, and theaters are char-
4,200 pairs of images. Other training hyperparameters are acterized by having rows of seats facing a stage. The differ-
the same as those used for ControlNet applied to other tasks ences between the various types and their unique details are
as described above. During inference, we input an image further shown in Figure 24, where we illustrate examples
(resized to the correct resolution) as a control, generate 25 from our training set of various types.
images with random seeds (and guidance scale 1.0, CFG In Table 6 we show a breakdown of the metrics for the
scale 7.5, 20 inference steps using the default PNDM sched- generated images according to the most common building
uler). We discretize output pixel values to the closest valid types in the dataset. The table compares our fine-tuned
layout color and then use pixel-wise mode values, thus re- model with a base SD model, showing that for the vast ma-
ducing noise from the random nature of each individual jority of building types, our fine-tuned model generates im-
generation. ages that are both more realistic and also semantically closer
Image GT CC5K Image GT CC5K
Figure 17. Additional results on the semantic segmentation benchmark. Images are annotated according to walls, doors, windows, interior
and background categories. On the right we show results obtained with CC5K [18].
Narthex
Tower
Court
Choir
Nave
xxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxx
Court
Nave
xxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxx
Choir
Choir
xxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxx
Entrance
Court
xxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxx
GT CLIPSeg Ours GT CLIPSeg Ours
Figure 19. In each column: segmentation results on samples of our test set before (center) and after (right) fine-tuning on our data.
Living
room
Kitchen
room
Porch∗
room
room
←−Bedroom−→room
room
room
room
Figure 20. Additional comparisons of our segmentation probability map results on residential buildings with the strongly-supervised
CubiCasa5K (CC5K) model [18].
pretrained ←− fine-tuned −→
ixxxx Cathedral
xxxxxx Theater
xxResidential Building xxxxxxx Palace
xxxxx Museum
xxxxxx Hospital
Figure 21. Additional generated floorplans, showing diverse building types (provided on the left). The left-most column shows samples
from the pretrained SD model and rest of the columns showcase the results from our fine-tuned model.
Input Images Extracted Masks House Castle Library
Figure 22. Additional results for boundary-conditioned generation, showing a variety of shapes (shown on the left) and building types
(shown on top).
Condition CS 0.0 CS 0.25 CS 0.5 CS 1.0
ixxxxxx School
ixxxxx Cathedral
xxxxx Mausoleum
Figure 23. Additional results for structure-conditioned generation, showing the effect of changing condition scale (CS) and CFG scales
during inference (with a fixed seed). The condition scale controls the trade-off between adherence to the structure condition and avoiding
leakage of the CubiCasa5K style which ControlNet was exposed to in fine-tuning. We also find a relatively high CFG value to improve
image quality. Chosen values for inference are in bold.
xxxxxx Museum xxxResidential Building xxxxxxxx Library xxxxxxx Theater xxxxxx Church
xxxxxxx House xxxxxxx School xxxxxxx Hotel xxxxxxx Castle xxxxxx Hospital
Figure 24. Examples of images from our dataset with their building types (shown on the left)
Image Category Building Name Building Type Location Information
[INST] [INST] [INST] [INST]
Please read the following Please read the following Please read the following Please read the following
(truncated) information (truncated) information (truncated) information (truncated) information
extracted from Wikipedia extracted from Wikipedia extracted from Wikipedia extracted from Wikipedia
related to an image: related to an image of a related to an image of the related to an image of the
building: building {building name}: building {building name}:
--- START WIKI INFO ---
--- START WIKI INFO --- --- START WIKI INFO --- --- START WIKI INFO ---
* Entity category:
xi{category} * Entity category: * Entity category: * Entity category:
* Entity description: xi{category} xi{category} xi{category}
xi{description} * Entity description: * Entity description: * Entity description:
* Image filename: xi{description} xi{description} xi{description}
xi{fn} * Image filename: * Image filename: * Image filename:
* Texts that appear in the xi{fn} xi{fn} xi{fn}
xi(image extracted with OCR) * Wiki page summary: * Wiki page summary: * Wiki page summary:
xi{ocr texts} xi{wiki shows} xi{wiki shows} xi{wiki shows}
* Texts that appear in the
--- END WIKI INFO --- xiimage (extracted with OCR): --- END WIKI INFO --- --- END WIKI INFO ---
xi{ocr texts}
What type or category of Where is {building name}
Now answer the following
question in English: What is building is building name}? located? Write the country,
--- END WIKI INFO ---
this file most likely a Write your answer in state (if exists) and city
depiction of? English, surrounded by surrounded by brackets < >
What is the name of the
(A) A floorplan brackets < > and separate between them
building depicted above?
(B) A building with a semi colon, for example:
Write it in English,
(C) A cross section of a [/INST] <City; State; Country>.
surrounded by brackets < >
building If one of them is unknown
(D) A garden/park The building {building name} write ’Unknown’, for example:
[/INST]
(E) A Map is a < <City; Unknown; Country>,
(F) A city/town <Unknown; State; Country>
The name of the building
(G) A physics/mathematics topic etc.
discussed by the article is <
(H) I don’t know
[/INST]
Please choose one answer
(A/B/C/D/E/F/G/H) {building name} is located in <
[/INST]
Figure 25. The prompts used for LLM-based extraction of pGTs. Each {...} placeholder is replaced with the respective image data. At
first we only have raw data (as seen in the “Image Category” prompt), but once we gather pGTs we may use them in other prompts, for
example {building name} as used in “Building Type” and “Location Information”. We ask the LLM to return a semi-structured response
(choosing an answer from a closed set; wrapping the answer in brackets etc.) so that we can easily extract the answer of interest. From
left to right: The “Image Category” prompt is used for the initial text based filtering, where categories (A) and (B) are positive and the
rest are negative. The “Building Name” and “Building Type” prompts are used for setting the building name and type respectively. The
“Location Information” prompt extracts the country, state, and/or city (whichever of these exist). Note that the country is subsequently
used for defining our test-train split.
Legend Existence (Wikipedia) Legend Content (Wikipedia) Legend Existence (caption) Legend Content (caption)
The image "{fn}" is a plan of The image "{fn}" is a plan of The image "{fn}" is a plan The image "{fn}" is a plan
the building {building name} and the building {building name} and of the building {building name} of the building {building name}
it contains the following texts, it contains the following texts, and it has the following and it has the following
detected by an OCR model: detected by an OCR model: description: description:
{ocr texts} {ocr texts} --- START IMAGE DESC. --- --- START IMAGE DESC. ---
Please read the following Please read the following {description} {description}
excerpt from an article about excerpt from an article about
the building which contains the building which contains --- END IMAGE DESC. --- --- END IMAGE DESC. ---
this image: this image:
Does the description above look Does the discussed image
--- START EXCERPT --- --- START EXCERPT --- like it contains a legend for contain a legend (as in a
the image, i.e. an itemized key/table/code for
{snippet} {snippet} list corresponding to regions understanding the image)?
marked by labels in the image, If so, what are the legend’s
--- END EXCERPT --- --- END EXCERPT --- explaining what each label contents? Answer with a
signifies? bulleted list in English of the
Now answer the following The excerpt contains a legend, Write yes/no/not sure in legend contents. Include only
question about the excerpt: i.e. an itemized list English, surrounded by full items and not just labels
Does the excerpt contain a corresponding to regions brackets < > (for example, ’1. nave’ should
legend for the image "{fn}", marked by labels in the image. be included, but ’1.’ alone
i.e. an itemized list Reproduce the legend below. [/INST] shouldn’t)
corresponding to regions marked
by OCR labels in the image, [/INST] < [/INST]
explaining what each label
signifies? Answer Sure! Here is the legend: Answer: The legend contains:
yes/no/unsure. *
[/INST]
Figure 26. The prompts used for extracting the legend contents. The two left prompts are used for extracting data from Wikipedia, and the
two right ones for the image caption. In both cases this is a two-step extraction: first we ask the LLM if it thinks the text contains a legend.
Only if it answers yes, we ask it for its content. This reduces hallucinations and keeps the answers accurate.
Legend simplification
[INST]
{legend}
Do:
* Include all legend parts from the list above.
* Keep it simple and short: summarize each row in one/two words
* Keep the original legend keys
* In case the features have distinct names (e.g. Chapel of the
Ascension) treat their type only (e.g. a chapel) and disregard
any specific name.
Don’t:
* Don’t invent new information
* Don’t include specific names (use their type instead)
* Don’t skip any of the legend lines above
[/INST]
The image "{fn}" is a plan The image "{fn}" is a plan The image "{fn}" is a plan The image "{fn}" is a plan
of the building {building name} of the building {building name} of the building {building name} of the building {building name}
and it contains the following and it contains the following and it contains the following and it contains the following
texts, detected by an OCR model: texts, detected by an OCR model: texts, detected by an OCR model: texts, detected by an OCR model:
--- START IMAGE TEXTS --- --- START IMAGE TEXTS --- --- START IMAGE TEXTS --- --- START IMAGE TEXTS ---
{ocr legend candidate} {ocr legend candidate} {ocr legend candidate} {ocr legend candidate}
--- END IMAGE TEXTS --- --- END IMAGE TEXTS --- --- END IMAGE TEXTS --- --- END IMAGE TEXTS ---
Do the OCR detections above The above texts may contain, Do the OCR detections above The above texts may contain,
look like they contain a legend among other things, the content look like they contain words among other things,
for the image, i.e. an of an image legend (as in a that represent architectural architectural features marked
itemized list corresponding to key/table/code for feature labels? on the floorplan.
regions marked by OCR labels in understanding the image). Disregard anything that looks Out of the texts above, can you
the image, explaining what each Can you extract the legend like a symbol or a key (like extract those that represent
label signifies? contents from the above texts? numbers), and any words that architectural features?
Answer with a bulleted list of represent direction (e.g. Like room types, halls, porches,
Write yes/no/not sure in the legend contents. north, east, etc.) etc.
English, surrounded by Include only full items and Write yes/no/not sure in Don’t include anything that
brackets < > not just keys/labels (for English, surrounded by looks like a symbol or a key
example, ’1. nave’ can be brackets < > (like numbers). Don’t include
[/INST] included, but ’1.’ or ’nave’ any words that represent
alone shouldn’t). [/INST] direction (e.g. north, east,
< Disregard text that doesn’t etc.).
seem like it’s part of the < Disregard text that isn’t
legend. related to architectural
Include the original features.
keys/labels and don’t invent Answer with a bulleted list
new ones. If you can’t deduce of the architectural features.
a legend return "I don’t know". Use the original text, do not
modify, translate, or add
[/INST] extensions to the text you
chose to add.
Sure! Here are the legend If you can’t deduce any
contents: architectural features return
"I don’t know".
[/INST]
Figure 28. The prompts used for extracting legends and architectural features from OCR detections. The two left prompts are used for
extracting legends, and the two right ones for architectural feature labels. In both cases this is a two-step extraction: first we ask the LLM
if it thinks the text contains a legend. Only if it answers yes, we ask it for its content. This reduces hallucinations and keeps the answers
accurate.