Neural Floorplan

Github

Raster floorplan / color-coded sketch → semantic segmentation → (planned) classified vector CAD
Deep Learning  ·  Semantic Segmentation  ·  Transfer Learning  ·  CAD Automation

Stack:
Python 3.11  ·  PyTorch  ·  HuggingFace Transformers (SegFormer)
OpenCV  ·  Shapely  ·  pytest  ·  ruff

Overview

Architectural drawings exist in two very different forms. Vector-based formats (DWG, SVG) encode walls, doors, and rooms as typed geometry — queryable, editable, machine-readable. Raster formats (scanned blueprints, photo-captured sketches, rendered PNGs) are just pixels, stripped of semantic identity. The problem is that the raster form is far more common in practice: architects sketch on paper, older documents were never digitized properly, and real-world handoffs often produce images rather than source files.
This project builds a supervised learning pipeline that converts raster floorplans — including color-coded hand-drawn sketches — into per-pixel semantic segmentation masks, and (as a planned next step) into clean classified vector geometry exported as JSON. The segmentation stage is complete; the vectorization stage is under development.
The problem is genuinely hard. Raster floorplans are visually inconsistent: line weights vary, scans introduce noise, rooms with different functions are drawn identically in greyscale, and the same plan may appear at wildly different resolutions. Producing pixel-accurate class labels from this input requires a model that generalizes across these variations rather than memorizing individual drawings.
Interview framing: this is structurally the same problem as turning any messy real-world 2D/3D engineering representation into a structured, typed downstream format. The same class of approach — supervised segmentation on carefully constructed raster/label pairs, transfer-learned backbone, explicit data-quality strategy — applies broadly to engineering-drawing understanding, point-cloud labeling, or scan-to-BIM workflows.

Pipeline

The full pipeline has eight stages. Stages 1–6 are complete; stages 7–8 are planned. The diagram below uses color to distinguish them.

Input Data

Dataset — CubiCasa5K

The primary dataset is CubiCasa5K, a collection of ~5,000 residential floor plan images paired with SVG vector annotations. Each SVG encodes walls, openings (doors/windows), room polygons, and icons as typed geometry. The high_quality_architectural subset was used throughout.

Data-quality strategy

The original raster images in CubiCasa5K are scraped from real estate listings. They are often misaligned with the SVG annotations, inconsistently scaled, or visually noisy. Using them directly as training inputs would introduce label noise — the pixel-level class boundary in the mask would not accurately correspond to what the raster image shows.
To address this, the dataset was built in two parts:
(a) Clean-subset rasters. A hand-selected subset of original raster images where the visual content is clearly aligned with the SVG annotation. These are kept as-is because their raster↔mask correspondence is trustworthy.
(b) SVG-rasterized images (spec_v003). For the rest of the dataset, the raster is generated by rendering the SVG directly — walls, rooms, and openings are drawn onto a white background as a model_clean.png. Because the raster and the mask both originate from the same SVG source, their alignment is exact by construction.
The rationale: reliable supervised segmentation requires clean, exactly-aligned input↔label pairs. Messy or misaligned originals inject label noise that degrades training signal and ultimately hurts generalization.
Original raster floorplan from CubiCasa5K Original raster (CubiCasa5K)
SVG-rendered clean raster SVG-rendered clean raster (model_clean.png)
Semantic class map — ground truth mask Semantic class map (ground truth mask)
Class legend
0 — background
1 — wall
2 — opening (door/window)
3 — room
4 — icon (furniture/fixtures)

Data Augmentation

Sketch-style augmentation (spec_v004) is applied offline before training. Each spatial transform is applied identically to the input image and all mask files to preserve pixel-level alignment. Pixel-level variations (blur, brightness) are applied to the image only — never to semantic masks, which would corrupt class-boundary accuracy.
The goal is twofold: (1) make the model robust to the variation seen in real, hand-drawn, or inconsistently-scanned floorplans; (2) multiply the effective training set size from a limited corpus of clean, aligned pairs.
Augmentation pipeline (verified from source)

horizontal flip (50% prob)   — spatial symmetry; floor plans are equally valid mirrored
vertical flip (50% prob)     — same rationale as horizontal
90° rotation (k ∈ {0,1,2,3}) — plans presented at any cardinal orientation
translation ±10 px         — slight positional shift simulating imprecise scans
Gaussian blur r ∈ [0.3, 1.0] — simulates scan softness, low-resolution input
brightness × [0.85, 1.15]   — exposure variation in scanned/photographed drawings
Original input Original
Horizontal flip Horizontal flip
Vertical flip Vertical flip
90° rotation 90° rotation
Gaussian blur Gaussian blur (r = 2.5)
Brightness reduction Brightness × 0.78

Model Architecture

SegFormer backbone (frozen)

SegFormer is a transformer-based semantic segmentation architecture. Its encoder is a hierarchical (multi-scale) Mix Transformer (MiT) that produces feature maps at four resolutions — 1/4, 1/8, 1/16, and 1/32 of the input. Rather than fixed positional encodings, it uses overlapping patch embeddings and Mix-FFN layers, which gives it robustness to varying input resolutions. This matters for floorplans, which come in a wide range of pixel dimensions.
This project uses SegFormer-B0 (nvidia/mit-b0), the lightest variant, with encoder stage channel widths of [32, 64, 160, 256]. The backbone is loaded with pretrained ImageNet weights and then fully frozen — no gradient flows through it during training. As an optimization, backbone features are extracted once per image and cached to disk; the frozen forward pass is never called during training epochs.
Frozen-backbone transfer learning. The SegFormer encoder is pretrained on ImageNet and its weights are locked — no gradients flow through it during training. Backbone feature extraction runs once per image and the results are cached to disk; the encoder forward pass is not invoked again for the rest of training. Only the custom decoder head is optimized. This strategy is appropriate here because the supervised dataset is small relative to the model capacity: fine-tuning the full encoder would overfit, while the cached-feature approach reduces per-epoch compute to the cost of the decoder alone.

FloorplanDecoder (trainable)

The decoder receives the four frozen multi-scale feature maps, fuses them into a single representation at H/4 × W/4 resolution, refines with two convolutional hidden layers, and upsamples to the full 512 × 512 output. All parameters are trained from scratch.
FloorplanDecoder — layer-by-layer (verified from models.py)

IN    Backbone hidden states: 4 tensors, ch widths [32, 64, 160, 256] (B0)
     Projection: 1×1 Conv per stage → 256 ch each
     Upsample all to stage-0 res (H/4 × W/4), element-wise sum
     Conv 3×3 → 256 ch · BatchNorm2d · GELU · Dropout2d(0.1)  hidden 1
     Conv 3×3 → 128 ch · BatchNorm2d · GELU · Dropout2d(0.1)  hidden 2
     1×1 Conv → 5 classes (logits)
OUT  Bilinear upsample → [B, 5, 512, 512]

Training configuration (verified)

VariantSegFormer-B0 (nvidia/mit-b0)
Input size512 × 512 px
Classes5 — background, wall, opening, room, icon
LossCrossEntropy (use_dice: false)
OptimizerAdamW
SchedulerCosineAnnealingLR (T_max = epochs)
Learning rate6 × 10⁻⁵
Weight decay0.01
Batch size4
Epochs50
Mixed precisionenabled (torch.amp)
Metrics loggedloss and mIoU (per-class IoU available)
Checkpointingbest (by val mIoU) + latest saved each epoch
Seed42

Results — Segmentation Output

Preview images are saved every 5 epochs. The examples below are from epoch 50 (end of training). Each row shows four panels for one sample: the input raster, the model's predicted mask, a color overlay of the prediction on the input, and the ground-truth target mask.
Quantitative mIoU: [TODO: insert final validation mIoU from checkpoint metadata]
Sample 03 — epoch 50
Input raster Input
Model prediction Prediction
Overlay Overlay
Ground truth Target
Sample 04 — epoch 50
Input raster Input
Model prediction Prediction
Overlay Overlay
Ground truth Target

Roadmap

The pipeline is complete through semantic segmentation and evaluation. The remaining two stages convert pixel-level masks into structured, classified vector geometry — the "to CAD" half of the original goal.
Stage 7 will use Shapely to extract contours from predicted masks, simplify polylines, and snap geometry to an orthogonal grid. Stage 8 will produce a structured JSON file classifying each element by type — walls with centerlines and thickness, openings attached to host walls, room polygons with inferred type labels. This output is the structured, machine-readable form that a downstream CAD or BIM tool could consume directly.

Technical Stack

LanguagePython 3.11
Deep learningPyTorch (mixed-precision training via torch.amp)
ModelHugging Face Transformers — SegFormer (nvidia/mit-b0)
Raster processingOpenCV, Pillow
Vector geometryShapely (planned, stages 7–8)
Testing / lintpytest  ·  ruff (format + check)
Environmentconda env floorplan-cad
WorkflowSpec-driven development — versioned specs in /specs, one spec at a time, feature branches, commit only after test+lint pass