0% found this document useful (0 votes)
59 views27 pages

Waffle:: Multimodal Floorplan Understanding in The Wild

WAFFLE is a novel multimodal dataset comprising nearly 20,000 floorplan images and metadata, curated from diverse building types and locations, aimed at enhancing automatic floorplan understanding. The dataset utilizes state-of-the-art language and vision models to extract semantic information and serves as a benchmark for advancing computational methods in this field. WAFFLE provides a foundation for learning building semantics, facilitating new tasks in floorplan analysis and generation beyond the limitations of previous datasets.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
59 views27 pages

Waffle:: Multimodal Floorplan Understanding in The Wild

WAFFLE is a novel multimodal dataset comprising nearly 20,000 floorplan images and metadata, curated from diverse building types and locations, aimed at enhancing automatic floorplan understanding. The dataset utilizes state-of-the-art language and vision models to extract semantic information and serves as a benchmark for advancing computational methods in this field. WAFFLE provides a foundation for learning building semantics, facilitating new tasks in floorplan analysis and generation beyond the limitations of previous datasets.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

WAFFLE: Multimodal Floorplan Understanding in the Wild

Keren Ganon∗1 Morris Alper∗1 Rachel Mikulinsky1 Hadar Averbuch-Elor1,2


1
Tel Aviv University 2 Cornell University

https://tau-vailab.github.io/WAFFLE
arXiv:2412.00955v2 [cs.CV] 3 Dec 2024

Abstract

Buildings are a central feature of human culture and re-


quire significant work to design, build, and maintain. As
such, the fundamental element defining their structure – the
floorplan – has increasingly become an object of computa-
tional analysis. Existing works on automatic floorplan un-
derstanding are extremely limited in scope, often focusing
on a single semantic category and region (e.g. apartments
from a single country). This contrasts with the wide vari-
ety of shapes and sizes of real-world buildings which reflect
their diverse purposes. In this work, we introduce WAF-
FLE, a novel multimodal floorplan understanding dataset of
nearly 20K floorplan images and metadata curated from In-
ternet data spanning diverse building types, locations, and
data formats. By using a large language model and mul-
timodal foundation models, we curate and extract seman-
tic information from these images and their accompanying
noisy metadata. We show that WAFFLE serves as a chal-
lenging benchmark for prior computational methods, while
enabling progress on new floorplan understanding tasks. Figure 1. What can we understand from looking at these im-
ages? For instance, do we have a sense of what type of build-
We will publicly release WAFFLE along with our code and
ings these floorplans depict? Floorplans provide multimodal cues
trained models, providing the research community with a over the semantics and structure of buildings; however, they are
new foundation for learning the semantics of buildings. often opaque for non-professionals, particularly for images lack-
ing textual descriptions (such as the bottom images). We propose
1. Introduction WAFFLE, a new multimodal dataset depicting floorplan images
“Life is chaotic, dangerous, and surprising. Buildings associated with rich textual descriptions. Our dataset allows for
understanding in-the-wild floorplan imagery illustrating a wide ar-
should reflect that.”
ray of building types. For example, a vision-and-language model
—Frank Gehry
finetuned on our data can correctly predict the building types for
the examples depicted above (answers are provided below* ).
Buildings come in all shapes and sizes, from the tiny
cottages dotting the English countryside to the imposing maintenance of buildings. Of particular interest is the auto-
galleries of the temple of Angkor Wat. The diverse ar- matic analysis of floorplans, the most fundamental element
chitectural designs of buildings have been influenced by defining the structure of buildings which communicate rich
their purposes, geographical locations, and changing trends schematic and layout information.
throughout history. Recent years have seen a growing inter-
est in the development of computational tools for architec- * From left to right: castle, temple, residential building. These samples
ture, which promise to aid experts engaged in the design and (depicting the Penrhyn Castle in Wales, the Forum at Timgad in Algeria
and a house floorplan in Bosnia and Herzegovina, respectively) were taken
* Denotes equal contribution from the WAFFLE test set.
D
Prior works have tapped into the vast visual knowledge

CA
K
encoded by floorplans for various applications, such as 3D

sa5
++

LE
lan
N
3D

a
reconstruction [25] and floorplan-guided building naviga-

FF
o rP
biC

LA
nt

WA
tion [27, 39]. However, prior data-driven techniques oper-

Flo
SD
RP
Cu

R2
Re
ating on floorplans mostly focus on extremely limited se-
mantic domains (e.g. apartments) and geographical loca- Building — Residential buildings — ∼4⋄ >1K
tions (often a single country), failing to cover the diversity types
needed for automatic understanding of floorplans in an un- Countries UK Fin. Asia Switz. Japan ?† >100
constrained setting. #Categories ∗
∼15 ∼80 ∼13 91 22 35 ∞‡
In this work, we introduce WAFFLE (WikipediA-Fueled
FLoorplan Ensemble), a multimodal floorplan understand- Real images? ✓ × × ✓ ✓ ✓ ✓

ing dataset comprised of diverse imagery spanning a variety Number of unique annotation values for labeled grounded regions or objects.

of building types, geographical regions, historical eras, and Unspecified data source

data formats (as illustrated in Figure 1), along with com- Free text, on a subset of images

prehensive textual data. WAFFLE is derived from freely- Contains floorplans of 100 buildings spanning residential buildings, schools,
hospitals, and shopping malls
available Internet images and metadata from the Wikimedia
Commons platform. To turn noisy Internet data into this Table 1. A comparison between WAFFLE and other floorplan
curated dataset with rich semantic annotations, we lever- datasets. SD above stands for Swiss Dwellings. We can see that, in
age state-of-the-art foundation models, using large language contrast to our proposed WAFFLE dataset, most existing datasets
models (LLMs) and vision-language models (VLMs) to focus on a single building type in a specific area in the world, and
perform curation tasks with little or no supervision. This in- consider a small, closed list of annotation values.
cludes a decomposition of floorplans into visual elements;
and structuring textual metadata, code and OCR detections plans have also been utilized for navigation tasks. Several
with LLMs. By combining these powerful tools, we build works predict position over a given floorplan, for a single
a new dataset for floorplan understanding with rich and di- image [39] or video sequences [6] depicting regions of the
verse semantics. environment. Narasimhan et al. [27] train an agent to nav-
In addition to serving as a challenging benchmark for igate in new environments by predicting corresponding la-
prior work, we show the utility of this data for various build- beled floorplans.
ing understanding tasks that were not feasible with previous Some works specifically target recognition of semantic
datasets. By using high-level and localized semantic labels elements over both rasterized [9, 48] and vectorized [45]
along with floorplan images in WAFFLE, we learn to pre- floorplan representations, as well as applying this to per-
dict building semantics and use them to generate floorplan form raster-to-vector conversion [18, 22, 24]. In our work,
images with the correct building type, along with optional we are interested in understanding Internet imagery of di-
conditioning on structural configurations. Grounded labels verse data types such as raster graphics and photographs
within images also provide supervision to segment areas or scans of real floorplans. In contrast to prior work that
corresponding to domain-specific architectural terms. As mostly focuses on a fixed set of semantic elements in res-
shown by these applications, WAFFLE opens the door for idential apartments, such as walls, bathrooms, closets, and
semantic understanding and generation of buildings in a di- so on, we are interested in acquiring higher-level reasoning
verse, real-world setting. over a wide array of building types.
The problem of synthesizing novel floorplans, and other
2. Related Works types of 2D layouts such as documents [30, 50], has also re-
ceived considerable interest (see the recent survey by Weber
Floorplans in Computer Vision. Floorplans are a funda- et al. [41] for a comprehensive review). Earlier works gen-
mental element of architectural design; as such, automatic erate floorplans from high-level constraints, such as room
understanding and generation of floorplans has drawn sig- adjacencies [17, 26]. Later works are able to generate novel
nificant interest from the research community. floorplans in more challenging settings, e.g. only given their
Several works aim to reconstruct floorplans, either from boundaries [16, 43]. In our work, we show that SOTA text-
3D scans [21, 47], RGB panoramas [2, 33, 46], room lay- to-image generation tools can be fine-tuned for generating
outs [15] or combined modalities, such as sparse views and floorplans of diverse building types, not only residential
room-connectivity graphs [12]. Prior works also investi- buildings, as explored by prior methods.
gate the problem of alignment between floorplans and 3D
point clouds depicting scenes [19]. Martin et al. [25] lever- Floorplan Datasets. Prior datasets containing floorplan
age floorplans of large-scale scenes to produce a unified data are limited in structural and semantic diversity, typ-
reconstruction from disconnected 3D point clouds. Floor- ically being limited to residential building types such as
apartments from specific geographic locations, often mined
from real estate listings. For example, Rent3D++ [38] con-
tains floorplans of 215 apartments located in London, and
CubiCasa5K [18] contains floorplans of 5K Finnish apart-
ments. The RPLAN [43] dataset contains 80K floorplans
of apartments in Asia, further limited by various size and
structural requirements (e.g., having only 3–9 rooms with
specific proportions relative to the living room). The Swiss Building name: St. Paul’s Episcopal Church Building name: Château de Blois
Building type: church Building type: castle
Dwellings [34] includes floorplan data for 42K Swiss apart- Country: United States of America Country: France
ments, and the Modified Swiss Dwellings [37] dataset pro- Grounded architectural features: {organ room Grounded legend: {1: barn, 2:chapel, 3:
rector, chapel, vestry, altar, tower, nave, apse} saint savior church, 6: prior’s house, 8:
vides a filtered subset of this data with additional access residence of counts, 9: new home, 10:
orchard in moat, 11: bretonry garden, 12:
graph information for floorplan auto-completion learning. booth hall}
The R2V [22] dataset introduces 815 Japanese residential Figure 2. Samples from WAFFLE. Above, we show images
building floorplans. paired with their structured data, including the building name and
Additionally, WAFFLE differs substantially from prior type, country of origin, and their grounded architectural features.
works with regards to the sourcing and curation of data. We also visualize the detected layout components (floorplan, leg-
Datasets of real floorplans, such as those previously- end, compass, and scale, as relevant) overlaid on top of the images.
mentioned, are constructed with tedious manual annotation.
For example, specialists spent over 1K hours in the con-
struction of FloorPlanCAD [11] to provide annotations of Figure 2. We provide an interactive viewer of samples from
30 categories (such as door, window, bed, etc.). Annota- the WAFFLE dataset, and additional details and statistics of
tions may also derive from other input types rather than be- our dataset, in the supplementary material. We proceed to
ing direct annotations of floorplans; for instance, the Zillow describe the curation process and contents of WAFFLE.
Indoor Dataset [7] generates floorplans with user assistance 3.1. Data Collection
from 360◦ panoramas, yielding plans for 1,524 homes after
over 1.5K hours of manual annotation labor. To bypass such Images and metadata in Wikimedia Commons data are
manual procedures, other works generate synthetic floor- ordered by hierarchical categories (WikiCategories). To
plans using predefined constraints [8]. By contrast, WAF- find relevant data, we recursively scrape the WikiCategories
FLE contains diverse Internet imagery of floorplans, includ- Floor plans and Architectural drawings, ex-
ing both original digital images and scans captured in the tracting images and metadata from Wikimedia Commons
wild, and is curated with a fully automatic pipeline. See Ta- and the text of linked Wikipedia articles. As many images
ble 1 for a comparison of the most related datasets with our contain valuable textual information (e.g. hints to the loca-
proposed WAFFLE dataset. tion of origin, legend labels, etc.), we also extract text from
Finally, there are also large-scale datasets of landmark- the images using the Google Vision API* for optical char-
centric image collections, such as Google Landmarks [29, acter recognition (OCR). Finally, we decompose images
42] and WikiScenes [44]. Along with photographs and sim- into constituent items by fine-tuning the detection model
ilar imagery of these landmarks, such collections may in- DETR [3] on a small subset of labeled examples to pre-
clude schematic data such as floorplans. While prior works dict bounding boxes for common layout components (floor-
focus on the natural imagery in these collections for tasks plans, legend boxes, compass, and scale icons).
such as image recognition, retrieval, and 3D reconstruction, The raw data includes a significant amount of noise
we specifically leverage the schematic diagrams found in along with floorplans, including similar topics such as maps
such collections for layout generation and understanding. and cross-sectional blueprints as well as other unrelated
data. Therefore, we filter this data as follows:
3. WAFFLE: Internet Floorplans Dataset Text-based filtering (LLM). We perform an initial text-
only filtering stage by processing our images’ textual meta-
In this section, we introduce WAFFLE (WikipediA- data with an LLM to extract structured information. We
Fueled FLoorplan Ensemble), a new dataset of 18,556 provide the LLM with a prompt containing image metadata
floorplans, derived from Wikimedia Commons* and asso- and ask it to categorize the image in multiple-choice format,
ciated textual descriptions available on Wikipedia. WAF- providing it with a closed set of possible categories. These
FLE contains floorplan images with paired structured meta- include positive categories such as floorplan and building
data containing overall semantic information and spatially- as well as some negative categories (not floorplans) such as
grounded legends. Samples from our dataset are provided in map and city.
* https://commons.wikimedia.org * https://cloud.google.com/vision?hl=en
Image-based filtering (CLIP). We use CLIP [31] im-
age embeddings to filter for images likely to be floor-
plans. Firstly, as the WikiCategory Architectural
drawings contains many non-floorplan images, we train a
linear classifier on a balanced sample of items from the two
WikiCategories and select images that are closer to those Raw Data Legend Image
in the Floor plans WikiCategory. Moreover, we filter Figure 3. We automatically extract legends and architectural fea-
all images by comparing them with CLIP text prompt em- tures from the image raw data (illustrated on the left, either the
beddings, following the use of CLIP for zero-shot classifi- image metadata or OCR detections) by prompting LLMs. We as-
cation. We compare to multiple prompts such as A map, A sociate the keys with text detected in the image, yielding grounded
picture of people, and A floorplan, aggregating scores for regions associated with semantics.
positive and negative classes and filtering out images with
low scores. Finally, we train a binary classifier using high- marked directly on the floorplan, we examine the bounding
scoring images and negative examples to adjust the zero- boxes of floorplan and legend detections (using the model
shot CLIP classifications for increased recall. described in Section 3.1) and select OCR detections within
these areas. We also extract additional legend information
This step results in a final dataset of nearly 20K im- from image metadata by prompting the LLM with an in-
ages. Each image is accompanied by the following raw data struction including page content from the image’s Wikime-
extracted from its Wikimedia Commons page and linked dia Commons page or the code surrounding the image in its
pages: the image file name, its Wikimedia Commons page linked Wikipedia pages (as legends often appear in these lo-
content (including a textual description), a list of linked cations). We further structure the legend outputs using reg-
WikiCategories, the contents of linked Wikipedia pages (if ular expressions to identify key-value pairs. Finally, we link
present), OCR detections in the image, and bounding boxes the legend keys and architectural features to the regions in
of constituent layout components. the floorplan images coinciding with OCR detections, thus
providing grounding for the semantic values of the image.
3.2. LLM-Driven Structured pGT Generation See Figure 3 for an example.
Our raw data contains significant grounded information 3.3. Dataset Statistics
about each image in diverse formats, which we wish to sys-
tematically organize and structure for use in downstream Our dataset contains nearly 20K images with accompa-
tasks. To this aim, we harness the capabilities of large lan- nying metadata, in a range of formats. In particular, we
guage models (LLMs) for distilling essential information note that our dataset contains over 1K vectorized floorplans.
from diverse textual data. In particular, we extract the fol- Additionally, our dataset contains more than 1K building
lowing information (also illustrated in Figure 2) by prompt- types spread over more than 100 countries across the world,
ing Llama-2 [36] with an instruction and relevant metadata and over 11K different Grounded Architectural Features
fields: building name, building type (i.e. church, hotel, mu- (GAFs) across almost 3K grounded images. We split into
seum etc.), location information (country, state, city), and a train and test sets (18,259 and 297 images respectively) by
list of architectural features that are grounded in the image. selecting according to country (train: 50 countries; test:
In general, the raw metadata contains considerable and 57 countries), thus ensuring disjointedness with regards to
diverse noise, involving multilingual content and multi- buildings and preventing data leakage.
ple written representations of identical entities (e.g. Notre Data Quality Validation. We manually inspect the test set
Dame Cathedral vs. Notre-Dame de Paris). To control for images, removing images that do not contain a valid floor-
the source language, we employ prompts that instruct the plan. Based on this validation, we find that 89% are indeed
LLM to respond in English and request translations when relevant floorplan images. We find this level of noise ac-
necessary. For linking representations of identical enti- ceptable for training models on in-the-wild data, while the
ties (also known as record linkage), we employ LinkTrans- manual filtering assures a clean test set for evaluation. In
former [1] clustering along with various textual heuristics. addition, we manually inspect the quality of our generated
We provide additional details, including prompts used, in pGTs. We find that 89% of the building names, 85% of the
the supplementary material, and proceed to describe our building types and 96% of the countries of origin are accu-
method for grounding architectural features in floorplans. rately labeled (considering 100 random data samples).
Architectural Feature Extraction and Grounding. Many 4. Experiments
floorplan images indicate architectural information either
directly with text on the relevant region, or indirectly using In this section, we perform several experiments applying
a legend. To identify legends and architectural information our dataset to both discriminative and generative building
R@1 R@5 R@8 R@16 MRR

Nave
CLIP 1.5% 7.6% 10.3% 19.7% 0.07
CLIPF T 11.8% 34.1% 40.0% 52.9% 0.23
Table 2. Results on CLIP retrieval of building types, for CLIP

Court
before and after fine-tuning on our dataset. We report Recall@k
(R@k) for k ∈ {1, 5, 8, 16} and Mean Reciprocal Rank (MRR)
for these models, evaluated on our test set. As seen above, fine-
tuning on WAFFLE significantly improves retrieval metrics.

Kitchen
CC5K∗ CLIPSeg Ours
AP 0.138 0.157 0.226
mIoU 0.057 0.066 0.131 GT CC5K∗ CLIPSeg Ours
Table 3. Open-Vocabulary Floorplan Segmentation Evaluation. Figure 4. Comparison of open-vocabulary segmentation probabil-
We compare against a pretrained CLIPSeg model and against ity map results. We show the input images in the first column,
a closed-vocabulary segmentation model (CC5K). As illustrated with the corresponding GT regions in red. ∗ Note that CC5K is
above, our method improves localization across all evaluation met- a closed-vocabulary model designed for residential floorplan un-
rics. ∗ Evaluated only over a subset of residential buildings. derstanding, and therefore we cannot compare to it over additional
building types (such as castles and cathedrals illustrated above). In
addition to improving on the base CLIPSeg segmentation model,
understanding tasks. For all tasks, we use the the train-test we outperform the strongly-supervised CC5K, suggesting that this
model cannot generalize well beyond its training set distribution.
split outlined in Section 3.3. Please refer to the supplemen-
tary material for further training details.
segmentation targets. This yields partial ground truth super-
4.1. Building Type Understanding vision; for a text query, we use OCR bounding box regions
corresponding to text labels that semantically match the
Task description. We test the ability to predict building- query (implemented via text embedding similarity) as posi-
level semantics from a floorplan, similarly to a human who tive targets and the remaining bounding box regions as neg-
might look at a floorplan and make an educated guess as ative targets. To prevent leakage from the written text in the
to what type of building it depicts. To learn this under- images, we perform inpainting with Stable Diffusion [32]
standing, we fine-tune CLIP with a contrastive objective on to replace the contents of the OCR bounding boxes. As
paired images and building type pseudo-labels from WAF- our inpainting process may cause artifacts, for evaluation
FLE. Our fine-tuned model (CLIPF T ) is expected to adjust purposes we manually select images that do not contain
CLIP to assign floorplan image embeddings close to those GAFs. We follow prior work [23] and report mean Intersec-
of relevant building types, allowing for subsequent retrieval tion over Union (mIoU) and Average Precision (AP). The
or classification with floorplan images as input. We test the mIoU metric requires a threshold, which we empirically set
extent to which this understanding has been learned in prac- to 0.25. AP is a threshold-agnostic metric that measures the
tice with standard retrieval metrics, evaluating Recall@k for area under the recall-precision curve, quantifying to what
k ∈ {1, 5, 8, 16} and Mean Reciprocal Rank (MRR). extent it can discriminate between correct and erroneous
Results. Results for fine-tuning CLIP for building type un- matches. In addition to comparing against the pretrained
derstanding are shown in Table 2. As is seen there, CLIPF T CLIPSeg model, we compare against the closed-vocabulary
significantly outperforms the base model in retrieving the segmentation model provided by CubiCasa5K (CC5K) [18]
correct building type pseudo-labels, hence showing a better over a subset of residential buildings in our test set (evalu-
understanding of their global semantics. ating semantic regions which this model was trained on).
Results. Quantitative results are reported in Table 3, show-
4.2. Open-Vocabulary Floorplan Segmentation
ing a clear boost in performance across both metrics. This
Task description. To model localized semantics within is further reflected in our qualitative results in Figures 4. In
floorplans, we use the GAFs in WAFFLE to fine-tune a text- addition, the results on residential buildings of the strongly-
driven segmentation model. We adopt the open-vocabulary supervised residential floorplan understanding model [18]
text-guided segmentation model CLIPSeg [23] and perform yields inferior performance, likely because the latter model
fine-tuning on the subset of these grounded images. uses supervision from a specific geographical region and
To provide supervision, we use the values of the GAFs as style alone (a limitation of existing datasets, as we describe
input text prompts for the segmentation model and the OCR in Section 2). Overall, both metrics show that there is much
bounding box regions of the associated grounded values as room for improvements with future techniques leveraging
Walls Doors Windows Interior BG FID ↓ KMMD ↓ CLIP Sim. ↑
Precision 0.737 0.201 0.339 0.799 0.697 SD 194.8 0.10 24.9
Recall 0.590 0.163 0.334 0.521 0.912 SDF T 145.3 0.07 25.6
IoU 0.488 0.099 0.202 0.461 0.653 Table 5. Results on generated images, using a base and fine-tuned
Table 4. Benchmark for Semantic Segmentation Evaluation. We Stable Diffusion (SD) model. We compare the quality of the gener-
benchmark prior work, reporting performance over the CubiCasa- ated images (FID, KMMD) and the similarity to the given prompt
5k [18] segmentation model, on common grounded categories. (CLIP Sim.). As illustrated above, SD fine-tuning improves both
Note that background is denoted as BG above. As illustrated, realism and semantic correctness of image generations.
WAFFLE serves as a challenging benchmark for existing work.
generation model on paired images and pGT textual data
our data for segmentation-related tasks. from WAFFLE for text-guided generation of floorplan im-
4.3. Benchmark for Semantic Segmentation ages. We adopt the latent diffusion model Stable Diffu-
Following prior work [18, 22, 43], we consider segmen- sion [32] (SD), using prompts of the form “A floor plan
tation of rasterized floorplan images into fine-grained lo- of a <building type>” which use the building type
calized categories, as locating elements such as walls has pseudo-labels from our LLM-extracted data. We balance
applications to various downstream tasks. To provide a training samples across building names and types to avoid
new benchmark for performance on the diverse floorplans overfitting on common categories. We evaluate the real-
in WAFFLE, we manually annotate pixel-level segmenta- ism of these generations using Fréchet Inception Distance
tion maps for more than a hundred images over categories (FID) [14] as well as Kernel Maximum Mean Discrepancy
applicable to most building types: wall, door, window, in- (KMMD), since FID can be unstable on small datasets [5].
terior and background. As our dataset contains a variety of Similar to prior work [4,28,40], we measure KMMD on In-
data types, we annotate SVG-formatted images, which can ception features [35]. To measure semantic correctness, we
be easily manually annotated by region. measure CLIP similarity (using pretrained CLIP) between
We illustrate the utility of this benchmark by evaluating generations and prompts. All metrics were calculated per
a standard existing model, namely the supervised segmen- building type, averaging over the most common 15 types.
tation model provided by CC5K [18]. We also evaluate a Results. Table 5 summarizes quantitative metrics, compar-
modern diffusion-based architecture trained with the same ing floorplan generation using the base SD model with our
supervised data to predict wall locations as black-and-white fine-tuned version. These provide evidence that our model
images, to explore whether architectural modifications can generates more realistic floorplan images that better adhere
yield improved performance. Further details of these mod- to the given prompts. Supporting this, we provide exam-
els are provided in the supplementary material. ples of such generated floorplans in Figure 5, observing the
Results. Table 4 includes a quantitative evaluation of the diversity and semantic layouts predicted by our model for
existing model provided by CC5K on our benchmark. As various prompts. We note that our model correctly pre-
illustrated in the table, our dataset provides a challenging dicts the distinctive elements of each building type, such as
benchmark for existing models, yielding low performance, the peripheral towers of castles and numerous side rooms
particularly for more fine-grained categories, such as doors for patient examinations in hospitals. Such distinctive ele-
and windows. In addition to these results, we find that ments are mostly not observed in the pretrained SD model,
the modern diffusion architecture shows significantly bet- which generally struggles at generating floorplans. To fur-
ter performance at localization of walls, generating binary ther illustrate that our generated floorplans better convey the
maps with higher quantitative metric values (+1.2% in pre- building type specified within the target text prompt, we
cision, +36.4% in recall and +29.5% in IoU, in comparison conducted a user study. Given a pair of images, one gen-
to the values obtained on the wall category in Table 4). This erated with the pretrained model and one with our finetuned
additional experiment shows promise in using stronger ar- model, users were asked to select the image that best con-
chitectures for improving localized knowledge on weakly veys the target text prompt. We find that 70.42% of the
supervised in-the-wild data to ultimately approach the goal time users prefer our generated images, in comparison to
of pixel-level localization within diverse floorplans. Qual- the generations of the pretrained model. Additional details
itative results from both models, along with ground truth regarding this study are provided in the supplementary.
segmentations, are provided in the supplementary material.
4.5. Structure-Conditioned Floorplan Generation
4.4. Text-Conditioned Floorplan Generation
Task description. Structural conditions for floorplan gen-
Task description. Inspired by the rich literature on au- eration have attracted particular interest, as architects may
tomatic floorplan generation, we fine-tune a text-to-image wish to design floorplans given a fixed building shape or
School Palace Church Castle Hospital Hotel Library
xxpretrained
←− fine-tuned −→

Figure 5. Examples for generated floorplans for various building types, using the prompt “A floor plan of a <building type>” (cor-
responding types are shown on top). The first row shows samples from the pretrained SD model, and the bottom three show results from
the model fine-tuned on WAFFLE. As seen above, pretrained SD struggles at generating floorplans in general and often yields results that
do not structurally resemble real floorplans. By contrast, our fine-tuned model can correctly generate fine-grained architectural structures,
such as towers in castles or long corridors in libraries.

Input Mask Museum Theater School Hotel

Figure 6. Boundary-conditioned generation. The first column shows images in WAFFLE, the second column shows automatically-extracted
boundary masks, and the following columns show floorplan image generations conditioned on this generation with diverse building types
provided as prompts.
Cond. School Cond. Castle Cond. Library Cond. Cathedral

Figure 7. Structure-conditioned generation. For each image pair, the first image displays a building layout condition, taken from the
existing CubiCasa5K dataset, which defines foreground (white) and background (black) regions, walls (red), doors (blue), and windows
(cyan). The second image shows a generation conditioned on this layout, using the ControlNet-based model described in Section 4.5.
Our image data and metadata enable the generation of diverse building types with structural constraints, without requiring any pixel-level
annotations of images in WAFFLE. Notably, this succeeds even when the constraint is highly unusual for the corresponding building type,
such as the condition above for cathedral (as cathedrals are usually constructed in a cross shape).

desired room configuration [16, 17, 26, 43]. Unlike exist- building type, such as the cathedral in Figure 7 which de-
ing works that consider residential buildings exclusively, viates from the typical layout of a cathedral (usually con-
we operate on the diverse set of building types and con- structed in a cross shape) in order to obey the condition.
figurations found in WAFFLE, providing conditioning to
our generative model SDF T by fine-tuning ControlNet [49] 5. Conclusion
for conditional generation, combined with applying text
prompts reflecting various building types. We have presented the WAFFLE dataset of diverse floor-
We are challenged by the fact that data in WAFFLE, cap- plans in the wild, curated from Internet data spanning di-
tured in-the-wild, does not contain localized structural an- verse building types, geographic locations, and architectural
notations (locations of walls, doors, windows or other fea- features. To construct a large dataset of images with rich,
tures) such as those painstakingly annotated in some exist- structured metadata, we leverage SOTA LLMs and mul-
ing datasets. Therefore, we leverage our data in an unsuper- timodal representations for filtering and extracting struc-
vised manner to achieve conditioning. To condition on the ture from noisy metadata, reflecting both the global seman-
desired building boundary, we approximate the outer shape tics of buildings and localized semantics grounded in re-
for all images in the training set via an edge detection al- gions within floorplans. We show that this data can be used
gorithm. To condition on more complex internal structures, to train models for building understanding tasks, enabling
we instead train ControlNet on image-condition pairs de- progress on both discriminative and generative tasks which
rived from the existing annotated CubiCasa5K dataset [18]. were previously not feasible.
By using SDF T as the backbone for this ControlNet, re- While our dataset expands the scope of floorplan under-
ducing the conditioning scale (assigning less weight to the standing to new unexplored tasks, it still has limitations.
conditioning input during inference) and using a relatively As we collect diverse images in-the-wild, our data natu-
rally contains noise (mitigated by our data collection and
high classifier-free guidance scale factor (assigning higher
cleaning pipeline) which could affect downstream perfor-
weight to the prompt condition), we fuse the ability of mance. In addition, while our dataset covers a diverse set
SDF T to generate diverse building types while incorporat- of building types, it leans towards historic and religious
ing the structural constraints derived from external anno- buildings, possibly introducing bias towards these seman-
tated data of residential buildings. tic domains. We focus on 2D floorplan images, though we
see promise in our approach and data for spurring further
Results. We provide structurally-conditioned generations research in adjacent domains, such as 3D building gener-
in Figures 6–7 for various building types. For boundary ation and such as architectural diagram understanding in
conditioning, the condition shape is extracted from existing general. In particular, although our work does not consider
images in our dataset. For structure conditioning, the con- the 3D structure of buildings, we see promise in the use
ditions are derived from the annotations in the external Cu- of our floorplans for aligning in-the-wild 3D point clouds
biCasa5K dataset, using categories relevant to the diverse or producing 3D-consistent room and building layouts. Fi-
buildings in WAFFLE. These examples illustrate that our nally, our work could provide a basis for navigation tasks
which require indoor spatial understanding, such as indoor
model is able to control the contents and style of the build-
and household robotics. We envision future architectural
ing according to the text prompt while adhering to the over- understanding models that are enabled by datasets such as
all layout of the condition. This again demonstrates that the WAFFLE will explore new challenging tasks such as visual
model has learned the distinct characteristics of each build- question answering for floorplans, which could be enabled
ing type. In addition, we note that this succeeds even when by our textual metadata and open-vocabulary architectural
the structural constraint is highly unusual for the paired features.
References ings of the IEEE conference on computer vision and pattern
recognition, pages 770–778, 2016. 6
[1] Abhishek Arora and Melissa Dell. Linktransformer: A uni- [14] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner,
fied package for record linkage with transformer language Bernhard Nessler, and Sepp Hochreiter. Gans trained by a
models. arXiv preprint arXiv:2309.00789, 2023. 4 two time-scale update rule converge to a local nash equilib-
[2] Ricardo Cabral and Yasutaka Furukawa. Piecewise planar rium, 2018. 6
and compact floorplan reconstruction from images. In 2014 [15] Sepidehsadat Hosseini, Mohammad Amin Shabani, Saghar
IEEE Conference on Computer Vision and Pattern Recogni- Irandoust, and Yasutaka Furukawa. Puzzlefusion: Unleash-
tion, pages 628–635. IEEE, 2014. 2 ing the power of diffusion models for spatial puzzle solv-
[3] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas ing. In Thirty-seventh Conference on Neural Information
Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to- Processing Systems, 2023. 2
end object detection with transformers. In European confer- [16] Ruizhen Hu, Zeyu Huang, Yuhan Tang, Oliver Van Kaick,
ence on computer vision, pages 213–229. Springer, 2020. 3, Hao Zhang, and Hui Huang. Graph2plan: Learning floor-
1 plan generation from layout graphs. ACM Transactions on
[4] Eric Ming Chen, Jin Sun, Apoorv Khandelwal, Dani Lischin- Graphics (TOG), 39(4):118–1, 2020. 2, 8
ski, Noah Snavely, and Hadar Averbuch-Elor. What’s in [17] Hao Hua. Irregular architectural layout synthesis with graph-
a decade? transforming faces through time. In Computer ical inputs. Automation in construction, 72:388–396, 2016.
Graphics Forum, volume 42, pages 281–291. Wiley Online 2, 8
Library, 2023. 6 [18] Ahti Kalervo, Juha Ylioinas, Markus Häikiö, Antti Karhu,
[5] Min Jin Chong and David Forsyth. Effectively unbiased fid and Juho Kannala. Cubicasa5k: A dataset and an im-
and inception score and where to find them, 2020. 6 proved multi-task model for floorplan image analysis. In
[6] Hang Chu, Dong Ki Kim, and Tsuhan Chen. You are here: Image Analysis: 21st Scandinavian Conference, SCIA 2019,
Mimicking the human thinking process in reading floor- Norrköping, Sweden, June 11–13, 2019, Proceedings 21,
plans. In Proceedings of the IEEE International Conference pages 28–40. Springer, 2019. 2, 3, 5, 6, 8, 4, 7, 9
on Computer Vision, pages 2210–2218, 2015. 2 [19] Ryan S Kaminsky, Noah Snavely, Steven M Seitz, and
[7] Steve Cruz, Will Hutchcroft, Yuguang Li, Naji Khosravan, Richard Szeliski. Alignment of 3d point clouds to overhead
Ivaylo Boyadzhiev, and Sing Bing Kang. Zillow indoor images. In 2009 IEEE computer society conference on com-
dataset: Annotated floor plans with 360deg panoramas and puter vision and pattern recognition workshops, pages 63–
3d room layouts. In Proceedings of the IEEE/CVF con- 70. IEEE, 2009. 2
ference on computer vision and pattern recognition, pages [20] Diederik P Kingma and Jimmy Ba. Adam: A method for
2133–2143, 2021. 3 stochastic optimization. arXiv preprint arXiv:1412.6980,
[8] Mathieu Delalandre, Ernest Valveny, Tony Pridmore, and 2014. 4
Dimosthenis Karatzas. Generation of synthetic documents [21] Chen Liu, Jiaye Wu, and Yasutaka Furukawa. Floornet:
for performance evaluation of symbol recognition & spot- A unified framework for floorplan reconstruction from 3d
ting systems. International Journal on Document Analysis scans. In Proceedings of the European conference on com-
and Recognition (IJDAR), 13(3):187–207, 2010. 3 puter vision (ECCV), pages 201–217, 2018. 2
[9] Samuel Dodge, Jiu Xu, and Björn Stenger. Parsing floor plan [22] Chen Liu, Jiajun Wu, Pushmeet Kohli, and Yasutaka Fu-
images. In 2017 Fifteenth IAPR international conference on rukawa. Raster-to-vector: Revisiting floorplan transforma-
machine vision applications (MVA), pages 358–361. IEEE, tion. In Proceedings of the IEEE International Conference
2017. 2 on Computer Vision, pages 2195–2203, 2017. 2, 3, 6
[10] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, [23] Timo Lüddecke and Alexander Ecker. Image segmenta-
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, tion using text and image prompts. In Proceedings of
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- the IEEE/CVF Conference on Computer Vision and Pattern
vain Gelly, et al. An image is worth 16x16 words: Trans- Recognition, pages 7086–7096, 2022. 5, 3, 9
formers for image recognition at scale. arXiv preprint [24] Xiaolei Lv, Shengchu Zhao, Xinyang Yu, and Binqiang
arXiv:2010.11929, 2020. 1 Zhao. Residential floor plan recognition and reconstruction.
[11] Zhiwen Fan, Lingjie Zhu, Honghua Li, Siyu Zhu, and Ping In Proceedings of the IEEE/CVF Conference on Computer
Tan. Floorplancad: A large-scale cad drawing dataset for Vision and Pattern Recognition, pages 16717–16726, 2021.
panoptic symbol. In Proceedings of the IEEE/CVF Inter- 2
national Conference on Computer Vision (ICCV), October [25] Ricardo Martin-Brualla, Yanling He, Bryan C Russell, and
2021. 3 Steven M Seitz. The 3d jigsaw puzzle: Mapping large in-
[12] Arnaud Gueze, Matthieu Ospici, Damien Rohmer, and door spaces. In Computer Vision–ECCV 2014: 13th Eu-
Marie-Paule Cani. Floor plan reconstruction from sparse ropean Conference, Zurich, Switzerland, September 6-12,
views: Combining graph neural network with constrained 2014, Proceedings, Part III 13, pages 1–16. Springer, 2014.
diffusion. In Proceedings of the IEEE/CVF International 2
Conference on Computer Vision, pages 1583–1592, 2023. 2 [26] Paul Merrell, Eric Schkufza, and Vladlen Koltun. Computer-
[13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. generated residential building layouts. In ACM SIGGRAPH
Deep residual learning for image recognition. In Proceed- Asia 2010 papers, pages 1–12. 2010. 2, 8
[27] Medhini Narasimhan, Erik Wijmans, Xinlei Chen, Trevor Conference on Computer Vision and Pattern Recognition,
Darrell, Dhruv Batra, Devi Parikh, and Amanpreet Singh. pages 10733–10742, 2021. 3
Seeing the un-scene: Learning amodal semantic maps for [39] Shenlong Wang, Sanja Fidler, and Raquel Urtasun. Lost
room navigation. In Computer Vision–ECCV 2020: 16th shopping! monocular localization in large indoor spaces. In
European Conference, Glasgow, UK, August 23–28, 2020, Proceedings of the IEEE International Conference on Com-
Proceedings, Part XVIII 16, pages 513–529. Springer, 2020. puter Vision, pages 2695–2703, 2015. 2
2 [40] Yaxing Wang, Abel Gonzalez-Garcia, David Berga, Luis
[28] Atsuhiro Noguchi and Tatsuya Harada. Image generation Herranz, Fahad Shahbaz Khan, and Joost van de Weijer.
from small datasets via batch statistics adaptation, 2019. 6 Minegan: effective knowledge transfer from gans to target
[29] Hyeonwoo Noh, Andre Araujo, Jack Sim, Tobias Weyand, domains with few images. In Proceedings of the IEEE/CVF
and Bohyung Han. Large-scale image retrieval with attentive Conference on Computer Vision and Pattern Recognition,
deep local features. In Proceedings of the IEEE international pages 9332–9341, 2020. 6
conference on computer vision, pages 3456–3465, 2017. 3 [41] Ramon Elias Weber, Caitlin Mueller, and Christoph Rein-
[30] Akshay Gadi Patil, Omri Ben-Eliezer, Or Perel, and Hadar hart. Automated floorplan generation in architectural design:
Averbuch-Elor. Read: Recursive autoencoders for document A review of methods and applications. Automation in Con-
layout generation. In Proceedings of the IEEE/CVF Con- struction, 140:104385, 2022. 2
ference on Computer Vision and Pattern Recognition Work- [42] Tobias Weyand, Andre Araujo, Bingyi Cao, and Jack Sim.
shops, pages 544–545, 2020. 2 Google landmarks dataset v2-a large-scale benchmark for
instance-level recognition and retrieval. In Proceedings of
[31] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
the IEEE/CVF conference on computer vision and pattern
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
recognition, pages 2575–2584, 2020. 3
Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning
[43] Wenming Wu, Xiao-Ming Fu, Rui Tang, Yuhan Wang, Yu-
transferable visual models from natural language supervi-
Hao Qi, and Ligang Liu. Data-driven interior plan genera-
sion. In International conference on machine learning, pages
tion for residential buildings. ACM Transactions on Graph-
8748–8763. PMLR, 2021. 4, 1
ics (TOG), 38(6):1–12, 2019. 2, 3, 6, 8
[32] Robin Rombach, Andreas Blattmann, Dominik Lorenz,
[44] Xiaoshi Wu, Hadar Averbuch-Elor, Jin Sun, and Noah
Patrick Esser, and Björn Ommer. High-resolution image syn-
Snavely. Towers of babel: Combining images, language, and
thesis with latent diffusion models, 2022. 5, 6, 4
3d geometry for learning multimodal vision. In Proceedings
[33] Mohammad Amin Shabani, Weilian Song, Makoto of the IEEE/CVF International Conference on Computer Vi-
Odamaki, Hirochika Fujiki, and Yasutaka Furukawa. Ex- sion, pages 428–437, 2021. 3
treme structure from motion for indoor panoramas without [45] Bingchen Yang, Haiyong Jiang, Hao Pan, and Jun Xiao. Vec-
visual overlaps. In Proceedings of the IEEE/CVF Interna- torfloorseg: Two-stream graph attention network for vector-
tional Conference on Computer Vision, pages 5703–5711, ized roughcast floorplan segmentation. In Proceedings of
2021. 2 the IEEE/CVF Conference on Computer Vision and Pattern
[34] Matthias Standfest, Michael Franzen, Yvonne Schröder, Recognition, pages 1358–1367, 2023. 2
Luis Gonzales Medina, Yarilo Villanueva Hernandez, [46] Shang-Ta Yang, Fu-En Wang, Chi-Han Peng, Peter Wonka,
Jan Hendrik Buck, Yen-Ling Tan, Milena Niedzwiecka, and Min Sun, and Hung-Kuo Chu. Dula-net: A dual-projection
Rachele Colmegna. Swiss dwellings: a large dataset of apart- network for estimating room layouts from a single rgb
ment models including aggregated geolocation-based simu- panorama. In Proceedings of the IEEE/CVF Conference
lation results covering viewshed, natural light, traffic noise, on Computer Vision and Pattern Recognition, pages 3363–
centrality and geometric analysis, 2022. 3 3372, 2019. 2
[35] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon [47] Yuanwen Yue, Theodora Kontogianni, Konrad Schindler,
Shlens, and Zbigniew Wojna. Rethinking the inception archi- and Francis Engelmann. Connecting the dots: Floorplan
tecture for computer vision. In Proceedings of the IEEE con- reconstruction using two-level queries. In Proceedings of
ference on computer vision and pattern recognition, pages the IEEE/CVF Conference on Computer Vision and Pattern
2818–2826, 2016. 6 Recognition, pages 845–854, 2023. 2
[36] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, [48] Zhiliang Zeng, Xianzhi Li, Ying Kin Yu, and Chi-Wing
Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Fu. Deep floor plan recognition using a multi-task network
Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. with room-boundary-guided attention. In Proceedings of the
Llama 2: Open foundation and fine-tuned chat models. arXiv IEEE/CVF International Conference on Computer Vision,
preprint arXiv:2307.09288, 2023. 4, 1 pages 9096–9104, 2019. 2
[37] Casper van Engelenburg, Seyran Khademi, Fatemeh [49] Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding
Mostafavi, Matthias Standfest, and Michael Franzen. Mod- conditional control to text-to-image diffusion models, 2023.
ified swiss dwellings: a machine learning-ready dataset for 8, 4, 5
floor plan auto-completion at scale, 2023. 3 [50] Xinru Zheng, Xiaotian Qiao, Ying Cao, and Rynson WH
[38] Madhawa Vidanapathirana, Qirui Wu, Yasutaka Furukawa, Lau. Content-aware generative modeling of graphic design
Angel X Chang, and Manolis Savva. Plan2scene: Convert- layouts. ACM Transactions on Graphics (TOG), 38(4):1–15,
ing floorplans to 3d scenes. In Proceedings of the IEEE/CVF 2019. 2
Appendix Prefixes "an illustration of ", "a drawing of ", "a sketch
of ", "a picture of ", "a photo of ", "a document

A. Interactive Visualization Tool of ", "an image of ", "a visual representation of
", "a graphic of ", "a rendering of ", "a diagram
Please see the attached HTML file (waffle.html) of ", ""
for an interactive visualization of data from the WAFFLE
dataset. Negative "a 3d simulation of ", "a 3d model of ", "a 3d
Prefixes rendering of "

B. Additional Dataset Details Positive "a floor plan", "an architectural layout", "a
Suffixes blueprint of a building", "a layout design"
We proceed to describe the creation of our WAFFLE
dataset in the sections below, including details on curating, Negative "a map", "a building", "people", "an aerial

filtering, and generating pseudo-ground truth labels. Suffixes view", "a cityscape", "a landscape", "a

topographic representation", "a satellite image",


B.1. Model Checkpoints and Settings "geographical features", "a mechanical design",
"an engineering sketch", "an abstract pattern",
We use Llama-2 [36] for text-related tasks, and CLIP
"wallpaper", "a Window plan", "a staircase plan"
[31] for image-related tasks. For most text related tasks we
use the meta-llama/Llama-2-13b-chat-hf Figure 8. The prompts used for gathering CLIP scores. (Prefixes
model, and for legend extraction we use the × Positive Suffixes) are used as positive prompts, and ((Prefixes +
meta-llama/Llama-2-70b-chat-hf model. Negative Prefixes) × Negative Suffixes) are negative prompts.
In both cases, we use the default sampling settings de-
fined by the Hugging Face API. For CLIP, we use the
openai/clip-vit-base-patch32 model. we filter out images categorized as negative categories. The
full prompt is shown on the leftmost column of Figure 25,
B.2. Layout Component Detection where options A and B are treated as positive and the rest
As part of the data collection, we train a DETR [3] object are negative.
detection model to identify common floorplan layout com-
ponents which will be later on used for the pGT extraction B.3.2 Image-based filtering (CLIP)
and in the segmentation experiment. We use the checkpoint
TahaDouaji/detr-doc-table-detection* as the base We proceed to use image-based filtering to yield our final
model, and fine-tune it on 200 manually annotated dataset. This is composed of two sub-stages: first, we gen-
images with augmentations, using the following labels: erate a smaller set of highly accurate images (a seed); we
floorplan, legend, scale, compass. We fine- then extend this seed to produce an enlarged dataset. These
tune for 1,300 iterations, a batch size of 4, and a 10−4 learn- sub-stages are described below.
ing rate on one A5000 GPU, splitting our data into 80% Seed generation. We start by creating a highly accurate
training images and 20% test images. seed of images (i.e. containing floorplans) by aggressively
filtering according to the CLIP normalized scores extracted
B.3. Data Filtering over positive and negative prompts. We list the prompts
As described in the paper, we first scrape a set of im- used in Figure 8. As illustrated in the figure, negative
ages and metadata from Wikimedia Commons and proceed prompts correspond to images that depict categories similar
to filter to only select images of floorplans using a two-stage to floorplans, such as maps or satellite images. We sort the
process: text-based filtering with an LLM, and image-based prompts by score, and add images to the set if it passes the
filtering with CLIP. All models used for these stages are de- following two tests: (i) All top five prompts contain floor
scribed in our main paper. plan, and (ii) The sum of all prompts containing floor plan
is over 0.5. We empirically find that these tests allow for
B.3.1 Text-based filtering (LLM) creating a highly accurate seed of 3,402 images.
Dataset extension. Next, we use this seed to bootstrap an
First, we query an LLM in order to obtain an initial catego-
image classifier, in order to enlarge the dataset. We first use
rization of our raw data. We ask it to choose what the image
this seed to train a vision transformer binary classifier. We
is most likely a depiction of out of a closed set of categories
take 1K images from the seed as positive samples, and 1K
(i.e. multiple choice question format), marked positive (e.g.
images that were categorized as a negative category in the
floorplan or building) or negative (e.g. map or park), and
text-based filtering step as negative samples. We fine-tune
* https://huggingface.co/TahaDouaji/detr-doc-table-detection a ViT model [10] for 5 epochs with a batch size of 4 and
a 2 ∗ 10−4 learning rate on one A5000 GPU, splitting our
data into 1,400 training images, 300 validation and 300 test
images.
To create our final dataset, we select images that pass the
following two tests: (i) classifier threshold selected to filter
out 10% of data, and (ii) the sum of normalized scores on
all positive prompts (described in Figure 8) is over 0.5.
Altogether this leads us to the final dataset of ∼19K
floorplans. In addition to the manual validation over 100
random sampled images in the dataset, we also manualy in-
pect the entire test set, and remove all image that do not
contain a valid floorplan. Based on this validation, we es-
timate that 89% of images from our full dataset are indeed
floorplans.

B.4. LLM-Driven Structured pGT Generation


Figure 25 contains the prompts used for extracting the
pseudo-ground truth (pGT) labels for our dataset. Note that
some prompts use previously extracted pGTs as inputs, such
as those for “Building Type” and “Location Information”.
The architectural feature grounding process is split into
two: legend extraction from the image metadata, and archi-
tectural information extraction from the image.
Legend Structuring from Metadata. We divide the task
of legend structuring into four sub-tasks: (i) Legend raw
Figure 9. An example of an image which contains legend text,
content extraction (the raw text containing key–value pairs), seen as rasterized text underneath the floorplan. Our legend key-
extracted using the prompts in Figure 26; (ii) Key–value grounding correctly detects the keys in the image and can suc-
identification (raw text structurization) using regular ex- cessfully avoid incorrect grounding such as to the legend depicted
pressions on the raw text legend; (iii) Legend content sim- below.
plification, using the prompt in Figure 27; and (iv) ground-
ing the architectural features in the legend to the image, by
marking the keys of the legend in the image, mapping be-
late API while maintaining the original text representation
tween the legend values and the key locations. The last sub-
which is used to ground them to the image.
task is obtained by searching for the keys in the image’s
OCR detected texts. Images can sometimes contain the full
key–value legend itself in addition to the key markings (as B.5. Dataset statistics
seen in Figure 9 for example). To avoid marking these as
well, we leverage the multiple text granularities returned by Figure 12 shows a visualization of the different coun-
the Google Vision API and filter out identified keys that are tries in our dataset; Figure 10 shows a histogram of common
part of a sentence/paragraph, and exclude areas detected as building types in the dataset; Figure 13 shows a visualiza-
’legend’ by our layout component detector. tion of the distribution of words in images detected using
OCR; and Figure 11 show a histogram of the common ar-
Architectural Information from the Image. As men- chitectural features that are grounded in WAFFLE images.
tioned in the main paper, we use the OCR detections within
the relevant layout components as candidates to include in-
teresting architectural information – legends in legend ar- C. Experimental Details
eas, and architectural labels in floorplan areas. Next,
we send these candidates to the LLM (similarly to the meta- C.1. Building Type Understanding Task
data legend extraction process) using the prompts in Figure
28 to obtain a raw legend/list of architectural labels. The For the building type understanding task, we fine-tune
legends are formatted and grounded similarly to those ex- CLIP on our training set. We train it for 5 epochs with batch
tracted from the metadata. The architectural features are size of 256, learning rate of 10−3 and Adam optimizer, on
translated if they are not in English using the Google Trans- one A5000 GPU.
Figure 10. Distribution of common building types extracted automatically (log scale), illustrating the rich semantics captured in WAFFLE.

Figure 11. Distribution of the grounded architectural features (log scale), among almost 3K grounded images, 25K instances grounded,
and 11K unique features.

Figure 12. Number of samples per country (log scale) in WAFFLE,


Figure 13. OCR words statistics. The bar plot depicts a histogram
showing the diversity of our dataset for both training and test splits.
of the number of words detected in an image; the word map on the
Blue: training data; Orange: test data.
top right shows the most common words detected in our dataset.
As illustrated above, the raw OCR data contains semantic infor-
mation and also significant levels of noise, and thus it is challeng-
C.2. Open-Vocabulary Floorplan Segmentation
ing to operate over this data directly; hence motivating the need
Task for extracting data from source external to the images (e.g. linked
We learn image segmentation by fine- Wikipedia pages).
tuning CLIPSeg [23] (using base checkpoint
CIDAS/clipseg-rd64-refined) on images with
corresponding positive and negative segmentation maps.
to filter out noise. To avoid leakage of information from ren-
Training details. Training data for segmentation is created dered text, OCR detections are removed via in-painting as
from images with grounded architectural features (GAFs), seen in Figure 14. During training, we also apply augmen-
using those that occur over 10 times in our dataset in order tations such as cropping, resizing and noising to enlarge our
C.3. Floorplan Generation Task
For the generation task, we adopt the text to image ex-
ample provided in Hugging Face* , by fine-tuning the Stable
Diffusion (SD) [32] model
CompVis/stable-diffusion-v1-4. We add a cus-
tom sampler to avoid over-sampling the same building; in
particular, in each epoch we use only one sample out of all
those corresponding to a given <building type> and
Original <building name>. In addition, we resize the images to
512×512 keeping the original proportions of the image and
adding padding as needed. We train our model for 20K it-
erations with batch size of 4, a learning rate of 10−5 and
Adam optimizer, on one A5000 GPU.
For the boundary-conditioned generation task (condi-
tioned on the outer contour of the building), we first ex-
tract the outer edges for all images in the training set. We
use the Canny edge detection algorithm as implemented in
In-painted the OpenCV library* , extract only external contours* , and
remove contours with small areas. Samples where edge de-
Figure 14. An example of in-painting to preprocess data for the tection fails (returns an empty mask) are excluded. We then
segmentation task. On the left, we show the original image which fine-tune* ControlNet [49]. We initialize the SD part of the
contains text indicating architectural features, including the Nave, architecture with our fine-tuned SD from the previous para-
Side Chapels and Aisles. On the right, we show the in-painted ver- graph and the shape-condition part with a pre-trained model
sion of the image, which succeeds in removing these texts to pre- trained on Canny edges masks (as this condition is similar
vent leakage. We observe that this in-painting process may slightly to our task) (lllyasviel/sd-controlnet-canny)*
modify the appearance of the image, but the floorplan’s structure
We use the same custom sampler and resize as described
is mostly preserved.
above. We train the model for 15K iterations with a batch
size of 4, a learning rate of 10−5 and Adam optimizer, on
training dataset, applied to both images and target segmen- one A5000 GPU.
tation maps as needed. For structure-conditioned generation (conditioned on
building layouts), we use our fine-tuned SD model from
To identify positive and negative targets for a given
above and fine-tune ControlNet on top of it. Unlike the
image and GAF text, we use the text embedding model
boundary-conditioned task, we fine-tuned ControlNet us-
paraphrase-multilingual-mpnet-base-v2 from
ing external data from the CubiCasa5K (CC5K) [18] train
SentenceTranformers* , measuring semantic similarity
set. As conditions, we convert the structured CC5K SVG
between GAF texts via embedding cosine similarity. Pairs
data into images with pixel values representing the subset
of features with high (> 0.7) similarity scores are marked
of categories relevant to our dataset: foreground (white),
as positive and those with low (< 0.4) similarity are marked
background (black), walls (red), doors (blue), and windows
as negative; the loss is calculated on these areas alone.
(cyan). The foreground category comprises all CC5K room
Our overall loss consists of the weighted sum of three
categories that are not background; doors and windows are
losses: cross-entropy over the masked positive and negative
taken from the CC5K icon categories and overlaid on top of
areas (Lce ), L1 regularization loss (LL1 ) and entropy loss
foreground/background/walls to produce the condition im-
(mean of binary entropies of pixel intensities on the whole
age (rather than being independent layers). We initialize the
image) (Le ). Our total loss is Ltotal = 21 Lce + 12 LL1 + Le .
SD part of the architecture with our fine-tuned SD from the
We fine-tune with the following settings: 20 epochs;
previous paragraph and the shape-condition part with the
batch size 1; on one A5000 GPU; with a 10−4 learning rate;
with an Adam [20] optimizer. * https://huggingface.co/docs/diffusers/v0.18.2/

en/training/text2image
* https://docs.opencv.org/4.x/da/d22/tutorial_
Evaluation. We manually annotate 95 images for evalua-
tion with 27 common GAFs. The most common building py_canny.html
* https://docs.opencv.org/4.x/d9/d8b/tutorial_
types in our evaluation set are churches, castles and resi- py_contours_hierarchy.html
dential buildings. * https://huggingface.co/docs/diffusers/v0.18.2/

en/training/controlnet
* https://www.sbert.net/ * https://huggingface.co/blog/controlnet
Residential building
Historic building

Monastery
Cathedral

Museum
Building

Hospital

Theater
Church

Temple

School
Palace
House
Castle

Hotel
FID ↓
SD 284.6 159.6 198.4 188.3 285.1 199.2 194.3 212.2 159.2 180.6 139.6 224.0 165.9 141.6 189.5
SDF T 148.3 146.1 156.8 142.0 147.7 114.5 100.9 158.0 122.3 137.6 119.1 168.6 176.9 102.0 238.4
KMMD ↓
SD 0.13 0.06 0.11 0.09 0.16 0.09 0.13 0.12 0.08 0.07 0.05 0.13 0.07 0.06 0.08
SDF T 0.07 0.05 0.06 0.04 0.11 0.05 0.05 0.08 0.03 0.05 0.04 0.11 0.09 0.03 0.17
CLIP Sim. ↑
SD 25.3 25.2 24.4 24.2 25.3 24.0 24.2 25.8 25.9 25.2 25.5 25.1 24.7 24.6 24.6
SDF T 25.9 25.6 25.6 25.4 25.3 24.5 25.1 26.7 26.5 26.1 25.5 26.0 25.6 25.8 24.5

Table 6. Quantitative results of results on floorplan image generation split by building type, comparing the quality of images generated
with the pretrained model and our fine-tuned model.

of pairs of images, generated by the prompt: ”a floorplan


of a <BUILDING TYPE>”. For example, ”a floorplan of a
cathedral”. For each pair, please select the image that best
conveys the text prompt (i.e., both looks like a floorplan di-
agram, and also looks like a plan of the specific building
type mentioned in the prompt). If you are unsure, please
make an educated guess to the best of your ability. Thank
you for participating!
A sample question from our user study is illustrated in
Figure 15. All of the questions were forced-choice, and
participants could only submit after answering all of of the
Figure 15. A sample question from our user study on text- questions.
conditioned floorplan generation.
C.4. Benchmark for Semantic Segmentation

model’s UNet weights. Images and conditions are resized We created a benchmark of 110 SVG images, containing
using the method described above. We train the model for wall, windows and door annotations. We included SVG im-
20K iterations with a batch size of 4, a learning rate of 10−5 ages from our test set. To obtain additional SVG images, we
and Adam optimizer, on one A5000 GPU. During inference also searched for SVGs that were filtered during the dataset
we use CFG scale 15.0 and condition scale 0.5, to fuse the filtering step. Then, we used Inkscape * which allowed us to
style of the input prompt (learned from floorplans in WAF- easily annotate full SVG components at once instead of do-
FLE) and the structure condition. ing it pixel-wise. This made the manual annotation process
less tedious and more accurate.
User study. Each study contained 36 randomly-generated
image pairs, with text prompts mentioning various building C.5. Wall Segmentation with a Diffusion Model
types that were sampled from the 100 most common types. We apply a diffusion-based architecture to wall segmen-
Overall, thirty one users participated in the study, resulting tation by training ControlNet [49] using CubiCasa5K [18]
in a total of 1,116 image pairs (one generated from the pre- (CC5K) layout maps as the target image to generate and in-
trained model, and the other generated from the finetuned put images as conditions. In particular, we convert CC5K
model) that were averaged for obtaining the final results re- annotations into binary images by denoting walls with black
ported in the main paper. pixels and use these as supervision for binary wall segmen-
Participants were provided with the following instruc-
tions: In this user study you will be presented with a series * https://inkscape.org/
D. Additional Results and Visualizations
D.1. Semantic Segmentation Results
Figures 16 and 17 contain examples of test images and
annotations from our benchmark for semantic segmenta-
tion, and the results of the existing CC5K [18] model on
them. These figures demonstrate how the model is chal-
lenged in detecting segments it wasn’t exposed to during
training, like pillars or curved walls. In Figure 18 we show
qualitative examples of wall detection using different model
architectures – the existing ResNet-152 [13] based model,
and the Diffusion-based model discussed in Section C.5. As
illustrated in the figure, using a more advanced model archi-
tecture allows for obtaining significantly cleaner wall seg-
mentations. Table 7 contains a quantitative analysis of the
two models on the wall segmentation prediction task.
Image GT CC5K D.2. Additional Open-Vocabulary Floorplan Seg-
Figure 16. Benchmark for semantic segmentation (over the walls, mentation Results
doors, windows, interior and background categories) on images
from WAFFLE using the strong supervised CC5K [18] pretrained In Figure 19, we show additional examples of text-driven
model. We can see that our data serves as a challenging benchmark floorplan image segmentation before and after fine-tuning
as the model struggles with more diverse and complex floorplans. on our data. We see that the baseline model struggles to
localize concepts inside floorplan images while our fine-
tuning better concentrates probabilities in relevant regions,
approaching the GT regions indicated in orange rectangles.
ResNet Diffusion In Figure 20 we visually compare the segmentation re-
sults to those of CC5K and CLIPSeg on residential build-
Precision 0.737 0.746 ings. We observe that the supervised CC5K model (trained
Recall 0.590 0.805 on Finnish residential floorplans alone) fails to generalize to
IoU 0.488 0.632 the diverse image appearances and building styles in WAF-
Table 7. A comparison between an existing ResNet-based wall de- FLE, even when they are residential buildings, while our
tection model (introduced in CC5K [18]) and a Diffusion-model model shows a more general understanding of semantics in
based one (detailed further in Section C.5), evaluated on our such images.
benchmark. We can see the Diffusion-based model outperforms
the ResNet-based model across all metrics, suggesting that newer D.3. Additional Generation Results
architectures show promise in improving localized knowledge of
in-the-wild data, such the floorplans found in WAFFLE. We show additional results for the generation task in Fig-
ure 21 and for the spatially-conditioned generation in Fig-
ure 22. We provide multiple examples for various building
types, showing that a model trained on our data learns the
distinctive structural details of each building type. For ex-
tation. We initialize with the CompVis/stable-di ample, castles have towers, libraries have long aisles, mu-
ffusion-v1-4 checkpoint and train for 200K iterations seums, hospitals, and hotels have numerous small rooms,
on train items in the CC5K dataset which provides us with churches have a typical cross shape, and theaters are char-
4,200 pairs of images. Other training hyperparameters are acterized by having rows of seats facing a stage. The differ-
the same as those used for ControlNet applied to other tasks ences between the various types and their unique details are
as described above. During inference, we input an image further shown in Figure 24, where we illustrate examples
(resized to the correct resolution) as a control, generate 25 from our training set of various types.
images with random seeds (and guidance scale 1.0, CFG In Table 6 we show a breakdown of the metrics for the
scale 7.5, 20 inference steps using the default PNDM sched- generated images according to the most common building
uler). We discretize output pixel values to the closest valid types in the dataset. The table compares our fine-tuned
layout color and then use pixel-wise mode values, thus re- model with a base SD model, showing that for the vast ma-
ducing noise from the random nature of each individual jority of building types, our fine-tuned model generates im-
generation. ages that are both more realistic and also semantically closer
Image GT CC5K Image GT CC5K
Figure 17. Additional results on the semantic segmentation benchmark. Images are annotated according to walls, doors, windows, interior
and background categories. On the right we show results obtained with CC5K [18].

condition and CFG scales during inference, illustrating the


significance of these settings. In particular, we see that
the condition scale controls the trade-off between fidelity
to the layout condition and matching the building type in
the prompt (rather than exclusively outputting images in the
style of the CC5K fine-tuning data).

Image GT Existing Diffusion


xxxxxxxxx xxxxxxxxx ResNet model model
Figure 18. Wall segmentation results, comparing the CubiCasa5K
(CC5K) [18] baseline segmentation model that uses a ResNet
backbone to a modified architecture that uses a Diffusion model,
as described in Section C.5. The Diffusion-based model yields re-
fined wall predictions, as illustrated by the examples shown above.

to the target prompt.


For structure-conditioned generation, we show addi-
tional results in Figure 23, where input conditions are
derived from the CC5K dataset annotations as described
above. In the figure, we show the effect of changing the
Transept

Narthex
Tower

Court
Choir
Nave

xxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxx
Court

Nave

xxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxx
Choir

Choir

xxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxx
Entrance

Court

xxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxx
GT CLIPSeg Ours GT CLIPSeg Ours
Figure 19. In each column: segmentation results on samples of our test set before (center) and after (right) fine-tuning on our data.
Living
room
Kitchen
room
Porch∗
room
room
←−Bedroom−→room
room
room
room

Image w/ GT CC5K [18] CLIPSeg [23] Ours



Our porch results correspond to the CubiCasa5K outdoor category.

Figure 20. Additional comparisons of our segmentation probability map results on residential buildings with the strongly-supervised
CubiCasa5K (CC5K) model [18].
pretrained ←− fine-tuned −→
ixxxx Cathedral
xxxxxx Theater
xxResidential Building xxxxxxx Palace
xxxxx Museum
xxxxxx Hospital

Figure 21. Additional generated floorplans, showing diverse building types (provided on the left). The left-most column shows samples
from the pretrained SD model and rest of the columns showcase the results from our fine-tuned model.
Input Images Extracted Masks House Castle Library

Input Images Extracted Masks Museum Hospital Hotel

Figure 22. Additional results for boundary-conditioned generation, showing a variety of shapes (shown on the left) and building types
(shown on top).
Condition CS 0.0 CS 0.25 CS 0.5 CS 1.0
ixxxxxx School
ixxxxx Cathedral
xxxxx Mausoleum

Condition CFG 7.5 CFG 10 CFG 15 CFG 25


Xxxxx Apartment
xxxxxxx Castle
xxxxxx Factory

Figure 23. Additional results for structure-conditioned generation, showing the effect of changing condition scale (CS) and CFG scales
during inference (with a fixed seed). The condition scale controls the trade-off between adherence to the structure condition and avoiding
leakage of the CubiCasa5K style which ControlNet was exposed to in fine-tuning. We also find a relatively high CFG value to improve
image quality. Chosen values for inference are in bold.
xxxxxx Museum xxxResidential Building xxxxxxxx Library xxxxxxx Theater xxxxxx Church

xxxxxxx House xxxxxxx School xxxxxxx Hotel xxxxxxx Castle xxxxxx Hospital

Figure 24. Examples of images from our dataset with their building types (shown on the left)
Image Category Building Name Building Type Location Information
[INST] [INST] [INST] [INST]

Please read the following Please read the following Please read the following Please read the following
(truncated) information (truncated) information (truncated) information (truncated) information
extracted from Wikipedia extracted from Wikipedia extracted from Wikipedia extracted from Wikipedia
related to an image: related to an image of a related to an image of the related to an image of the
building: building {building name}: building {building name}:
--- START WIKI INFO ---
--- START WIKI INFO --- --- START WIKI INFO --- --- START WIKI INFO ---
* Entity category:
xi{category} * Entity category: * Entity category: * Entity category:
* Entity description: xi{category} xi{category} xi{category}
xi{description} * Entity description: * Entity description: * Entity description:
* Image filename: xi{description} xi{description} xi{description}
xi{fn} * Image filename: * Image filename: * Image filename:
* Texts that appear in the xi{fn} xi{fn} xi{fn}
xi(image extracted with OCR) * Wiki page summary: * Wiki page summary: * Wiki page summary:
xi{ocr texts} xi{wiki shows} xi{wiki shows} xi{wiki shows}
* Texts that appear in the
--- END WIKI INFO --- xiimage (extracted with OCR): --- END WIKI INFO --- --- END WIKI INFO ---
xi{ocr texts}
What type or category of Where is {building name}
Now answer the following
question in English: What is building is building name}? located? Write the country,
--- END WIKI INFO ---
this file most likely a Write your answer in state (if exists) and city
depiction of? English, surrounded by surrounded by brackets < >
What is the name of the
(A) A floorplan brackets < > and separate between them
building depicted above?
(B) A building with a semi colon, for example:
Write it in English,
(C) A cross section of a [/INST] <City; State; Country>.
surrounded by brackets < >
building If one of them is unknown
(D) A garden/park The building {building name} write ’Unknown’, for example:
[/INST]
(E) A Map is a < <City; Unknown; Country>,
(F) A city/town <Unknown; State; Country>
The name of the building
(G) A physics/mathematics topic etc.
discussed by the article is <
(H) I don’t know
[/INST]
Please choose one answer
(A/B/C/D/E/F/G/H) {building name} is located in <

[/INST]

The best answer is (

Figure 25. The prompts used for LLM-based extraction of pGTs. Each {...} placeholder is replaced with the respective image data. At
first we only have raw data (as seen in the “Image Category” prompt), but once we gather pGTs we may use them in other prompts, for
example {building name} as used in “Building Type” and “Location Information”. We ask the LLM to return a semi-structured response
(choosing an answer from a closed set; wrapping the answer in brackets etc.) so that we can easily extract the answer of interest. From
left to right: The “Image Category” prompt is used for the initial text based filtering, where categories (A) and (B) are positive and the
rest are negative. The “Building Name” and “Building Type” prompts are used for setting the building name and type respectively. The
“Location Information” prompt extracts the country, state, and/or city (whichever of these exist). Note that the country is subsequently
used for defining our test-train split.
Legend Existence (Wikipedia) Legend Content (Wikipedia) Legend Existence (caption) Legend Content (caption)

[INST] [INST] [INST] [INST]

The image "{fn}" is a plan of The image "{fn}" is a plan of The image "{fn}" is a plan The image "{fn}" is a plan
the building {building name} and the building {building name} and of the building {building name} of the building {building name}
it contains the following texts, it contains the following texts, and it has the following and it has the following
detected by an OCR model: detected by an OCR model: description: description:

{ocr texts} {ocr texts} --- START IMAGE DESC. --- --- START IMAGE DESC. ---

Please read the following Please read the following {description} {description}
excerpt from an article about excerpt from an article about
the building which contains the building which contains --- END IMAGE DESC. --- --- END IMAGE DESC. ---
this image: this image:
Does the description above look Does the discussed image
--- START EXCERPT --- --- START EXCERPT --- like it contains a legend for contain a legend (as in a
the image, i.e. an itemized key/table/code for
{snippet} {snippet} list corresponding to regions understanding the image)?
marked by labels in the image, If so, what are the legend’s
--- END EXCERPT --- --- END EXCERPT --- explaining what each label contents? Answer with a
signifies? bulleted list in English of the
Now answer the following The excerpt contains a legend, Write yes/no/not sure in legend contents. Include only
question about the excerpt: i.e. an itemized list English, surrounded by full items and not just labels
Does the excerpt contain a corresponding to regions brackets < > (for example, ’1. nave’ should
legend for the image "{fn}", marked by labels in the image. be included, but ’1.’ alone
i.e. an itemized list Reproduce the legend below. [/INST] shouldn’t)
corresponding to regions marked
by OCR labels in the image, [/INST] < [/INST]
explaining what each label
signifies? Answer Sure! Here is the legend: Answer: The legend contains:
yes/no/unsure. *

[/INST]

The answer to the question is:

Figure 26. The prompts used for extracting the legend contents. The two left prompts are used for extracting data from Wikipedia, and the
two right ones for the image caption. In both cases this is a two-step extraction: first we ask the LLM if it thinks the text contains a legend.
Only if it answers yes, we ask it for its content. This reduces hallucinations and keeps the answers accurate.
Legend simplification
[INST]

The following texts contain a legend of a {building type} floor


plan in a key:value format:

--- START LEGEND ---

{legend}

--- END LEGEND ---

Please rewrite the legend using simple and generic words.

Do:
* Include all legend parts from the list above.
* Keep it simple and short: summarize each row in one/two words
* Keep the original legend keys
* In case the features have distinct names (e.g. Chapel of the
Ascension) treat their type only (e.g. a chapel) and disregard
any specific name.

Don’t:
* Don’t invent new information
* Don’t include specific names (use their type instead)
* Don’t skip any of the legend lines above

Write your answer in English, translating any non-English terms.

[/INST]

Sure, here’s a simplified and generalized version of the legend:


*

Figure 27. The prompt used for legend simplification, serving to


clean up the original legends for the image grounding process. The
goal is to obtain a list of keys to architectural features. We aim to
shorten long descriptions, remove names, and translate any non-
English text.
Legend Existence Legend Content Arc-Feats Existence Arc-Feats Content

[INST] [INST] [INST] [INST]

The image "{fn}" is a plan The image "{fn}" is a plan The image "{fn}" is a plan The image "{fn}" is a plan
of the building {building name} of the building {building name} of the building {building name} of the building {building name}
and it contains the following and it contains the following and it contains the following and it contains the following
texts, detected by an OCR model: texts, detected by an OCR model: texts, detected by an OCR model: texts, detected by an OCR model:

--- START IMAGE TEXTS --- --- START IMAGE TEXTS --- --- START IMAGE TEXTS --- --- START IMAGE TEXTS ---

{ocr legend candidate} {ocr legend candidate} {ocr legend candidate} {ocr legend candidate}

--- END IMAGE TEXTS --- --- END IMAGE TEXTS --- --- END IMAGE TEXTS --- --- END IMAGE TEXTS ---

Do the OCR detections above The above texts may contain, Do the OCR detections above The above texts may contain,
look like they contain a legend among other things, the content look like they contain words among other things,
for the image, i.e. an of an image legend (as in a that represent architectural architectural features marked
itemized list corresponding to key/table/code for feature labels? on the floorplan.
regions marked by OCR labels in understanding the image). Disregard anything that looks Out of the texts above, can you
the image, explaining what each Can you extract the legend like a symbol or a key (like extract those that represent
label signifies? contents from the above texts? numbers), and any words that architectural features?
Answer with a bulleted list of represent direction (e.g. Like room types, halls, porches,
Write yes/no/not sure in the legend contents. north, east, etc.) etc.
English, surrounded by Include only full items and Write yes/no/not sure in Don’t include anything that
brackets < > not just keys/labels (for English, surrounded by looks like a symbol or a key
example, ’1. nave’ can be brackets < > (like numbers). Don’t include
[/INST] included, but ’1.’ or ’nave’ any words that represent
alone shouldn’t). [/INST] direction (e.g. north, east,
< Disregard text that doesn’t etc.).
seem like it’s part of the < Disregard text that isn’t
legend. related to architectural
Include the original features.
keys/labels and don’t invent Answer with a bulleted list
new ones. If you can’t deduce of the architectural features.
a legend return "I don’t know". Use the original text, do not
modify, translate, or add
[/INST] extensions to the text you
chose to add.
Sure! Here are the legend If you can’t deduce any
contents: architectural features return
"I don’t know".

[/INST]

Sure! Here are the


architectural features that
appear in the texts you
provided:

Figure 28. The prompts used for extracting legends and architectural features from OCR detections. The two left prompts are used for
extracting legends, and the two right ones for architectural feature labels. In both cases this is a two-step extraction: first we ask the LLM
if it thinks the text contains a legend. Only if it answers yes, we ask it for its content. This reduces hallucinations and keeps the answers
accurate.

You might also like