Daegeun Kim

Street-Block Urbanity Prediction

Github: Linear Regression & Logistic Regression
Github: Graph Neural Network

Linear Regression
Logistic Regression
Graph Neural Network

Tools:
QGIS
PostgreSQL Database
PyTorch

Abstract

This project presents a quantitative geometric analysis of street blocks across the top ten U.S. metropolitan areas by GDP: New York, Los Angeles, Chicago, Dallas, Washington D.C., San Francisco Bay Area, Boston, Houston, Atlanta, and Seattle. Focusing purely on physical geometry, four shape-based metrics — Area, Isoperimetric Quotient, Rectangularity, and Solidity — were derived from each block to characterize its form and compactness. An exponential regression model was developed to evaluate which of these four geometric indicators best predicts population density, and to assess the independent usefulness of each metric as a predictor. In parallel, a logistic regression model was trained to classify the urbanity of each block using the dataset's original UR20 label, which defines urban versus non-urban areas based on criteria such as population concentration, adjacency of developed blocks, and built environment continuity. Model reliability and generalization were validated through 10-fold cross-validation, ensuring robustness against sample bias. This study aims to isolate geometric form as an explanatory variable in urban analysis, demonstrating that the morphology of street blocks alone can provide meaningful insight into patterns of density and urbanization across major U.S. cities.

Data Sources

Original Data Source: United States Census Bureau
Study Scope: Top 10 U.S. Metropolitan Areas by GDP
https://www2.census.gov/geo/tiger/TIGER2025/TABBLOCK20/
Spatial Extent: Blocks within a 30-mile radius from each city center

Workflow

QGIS: Loaded shapefiles, clipped 30 mi radius, calculated four geometric features.
PostgreSQL/PostGIS: Stored processed data and linked to Python environment.
Python Notebook: Performed statistical and machine-learning analyses (linear/logistic regression, bootstrapping, 10-fold CV).

Projection

City-specific CRS applied to minimize geometric distortion.

New York City	EPSG: 2263	NAD83 / New York Long Island (ftUS)
Los Angeles	EPSG: 2229	NAD83 / California zone V (ftUS)
Dallas	EPSG: 2277	NAD83 / Texas North Central (ftUS)
Houston	EPSG: 32140	NAD83 / Texas South Central (ftUS)
Chicago	EPSG: 3435	NAD83 / Illinois East (ftUS)
San Francisco Bay Area	EPSG: 26910	NAD83 / UTM zone 10N (m)
Washington D.C.	EPSG: 26918	NAD83 / UTM zone 18N (m)
Boston	EPSG: 26986	NAD83 / Massachusetts Mainland (ftUS)
Atlanta	EPSG: 26916	NAD83 / Georgia West (ftUS)
Seattle	EPSG: 2285	NAD83 / Washington North (ftUS)

*All spatial analyses, metric derivations, and model inputs were computed within these localized projections to minimize distortion in area and perimeter calculations.

Isoperimetric Quotient

Definition:
The ratio of the curve area to the area of a circle with the same perimeter as the curve

Formula:
(4π × Area) / (Perimeter²)

Range: 0 – 1

Explanation:
The Isoperimetric Quotient (IQ) is a numerical measure of how compact a shape is. A perfect circle has an IQ of 1.0, representing the most compact possible shape for a given perimeter. As a shape becomes more elongated, irregular, or fragmented, its perimeter increases relative to its area, causing the IQ value to decrease. In urban geometry analysis, a higher IQ indicates more efficient, compact street blocks, while lower values suggest irregular or stretched urban forms.

Rectangularity

Definition:
The ratio between the area of the shape and the area of its minimum bounding rectangle (MBR)

Formula:
Area / Smallest Fit Rectangle

Range: 0 – 1

Explanation:
A value of 1.0 indicates a perfect rectangle, while lower values signify shapes that deviate from rectangular form, such as irregular, curved, or indented geometries. A triangle has exactly 0.5 rectangularity. In urban geometry analysis, higher rectangularity values correspond to well-aligned grid blocks, while lower values often reflect organic or historical street patterns with more complex boundaries.

Solidity

Definition:
The ratio between the area of the shape and the area of its convex hull

Formula:
Area / Convex Hull Area

Range: 0 – 1

Explanation:
A value of 1.0 indicates that the shape is perfectly convex, meaning all of its boundary points lie on or outside a convex hull with no inward curves. As the value decreases, the shape becomes increasingly concave or fragmented, showing deeper indentations, gaps, or irregular protrusions along its perimeter. In the context of urban geometry, high solidity values correspond to compact, uniform blocks that efficiently occupy space within their boundaries.

Overview: Top 10 U.S. Cities

Top 10 Cities Population Density Heatmap — by GDP

New York City
GDP: 2,298,868 M
population: 19,940,274

Los Angeles
GDP: 1,295,361 M
population: 12,927,614

Chicago
GDP: 894,862 M
population: 9,408,576

Bay Area
GDP: 778,878 M
population: 4,648,486

Dallas
GDP: 744,654 M
population: 8,344,032

Washington DC
GDP: 714,685 M
population: 6,436,489

Houston
GDP: 696,999 M
population: 7,796,182

Boston
GDP: 610,486 M
population: 5,025,517

Atlanta
GDP: 570,663 M
population: 6,411,149

Seattle
GDP: 566,742 M
population: 4,145,494

0 people/sq mi 50,000+ people/sq mi

Area

Isoperimetric Quotient

Rectangularity

Solidity

New York City

Los Angeles

Chicago

Bay Area

Dallas

Washington DC

Houston

Boston

Atlanta

Seattle

010+ sq mi

sample streetblocks

Linear Regression: Geometry as Predictor of Population

Log-transformed linear regression was applied to evaluate whether each of the four geometric features is a meaningful predictor of population density. The analysis was conducted with all 10 cities combined, totaling 746,049 street blocks.

The model was fitted using a logarithmic transformation on the dependent variable, allowing linear regression to capture the exponential relationship between block geometry and density. Scikit-learn's Linear Regression (Ordinary Least Squares, no regularization) was used.

population density = e^{(β₀ + β₁ · log(area))}

population density = e^{(β₀ + β₁ · feature)} (Isoperimetric Quotient, Rectangularity, Solidity)

Model applied to each feature independently:

Area
MSE = 0.889 · R² = 0.436 (Moderate)

Isoperimetric Quotient
MSE = 1.499 · R² = 0.048 (Very Weak)

Rectangularity
MSE = 1.382 · R² = 0.123 (Weak)

Solidity
MSE = 1.419 · R² = 0.099 (Weak)

Combined Model

population density = e^{(β₀ + β₁·log(area) + β₂·IQ + β₃·Rectangularity + β₄·Solidity)}

Area Coefficient	−631.634
Isoperimetric Quotient Coefficient	−59.216
Rectangularity Coefficient	163.857
Solidity Coefficient	−91.977
Mean Squared Error	0.873
R²	0.446

Geometric variables alone explain approximately 44.6% of the variance in log-transformed population density (R² = 0.446). While this is limited relative to the full complexity of urban density — shaped by socioeconomic, infrastructural, and historical factors — it confirms that block geometry captures a statistically meaningful portion of urban form variation correlated with density.

The p-values for all four variables are near zero, confirming that the observed relationships are statistically significant — a finding made reliable by the large sample size of 746,049 street blocks.

Logistic Regression: Geometry as Classifier of Urbanity

The original dataset includes an urbanity indicator (UR20) that classifies each street block as Urban or Rural. This classification, defined by the U.S. Census Bureau, is based on factors such as population density and spatial continuity — not on any geometric attributes. The objective of this analysis is to evaluate whether the four geometric features (Area, Isoperimetric Quotient, Rectangularity, and Solidity) can effectively predict this label.

A logistic regression model was trained on data from all ten metropolitan areas, using bootstrapping and 10-fold cross-validation to ensure robustness and generalization. The trained model was subsequently tested on an unseen city (Philadelphia) to evaluate transferability.

log P(UR20 = 1)1 − P(UR20 = 1) = β₀ + β₁·area + β₂·IQ + β₃·rectangularity + β₄·solidity

Intercept (β₀)	3.940
Area (β₁)	−1.023
Isoperimetric Quotient (β₂)	−0.028
Rectangularity (β₃)	−0.330
Solidity (β₄)	0.988

Error Rate (10 training cities)	0.0299 (97.01% accuracy)
Error Rate (Philadelphia, unseen)	0.0626 (93.74% accuracy)

The model achieved 97.01% accuracy across the ten training cities and 93.74% accuracy on the unseen city of Philadelphia. These results demonstrate that street-block geometry alone provides a strong predictive basis for distinguishing urban from rural classifications, and that the learned relationships generalize well beyond the training data.

Philadelphia: Predicted Urban vs. Rural Blocks

Philadelphia: Urban vs. Rural Blocks (Ground Truth)

Graph Neural Network

Abstract

This project extends the Street-Block Machine Learning study under the same fundamental premise: how much urban structure can be inferred solely from street-block geometry, independent of socioeconomic or land-use data. Using street-block data from the United States Census Bureau, the study focuses on the top 10 U.S. metropolitan areas by GDP. The ground-truth label is the UR20 attribute, framing the task as a binary classification problem (urban vs. rural).

While the previous ML approach relied on four handcrafted geometric features, this project asks a more fundamental question: can a model learn urbanity directly from raw geometry? Street blocks are represented as graphs derived from polygon topology, and a Graph Neural Network (GNN) is trained to classify urban versus rural blocks based purely on learned geometric representations. By replacing feature engineering with end-to-end geometric learning, the project evaluates the expressive limits of deep learning for urban form inference and clarifies what information is intrinsically embedded in street-block geometry alone.

Workflow

QGIS: Loaded shapefiles, clipped to a 30-mile radius per city, calculated base geometric attributes, and exported processed layers.
PostgreSQL/PostGIS: Stored city-wise processed street-block geometries and UR20 labels, enabling consistent querying and reproducible splits.
Python + PyTorch Geometric: Loaded all 10 cities, converted blocks into graph representations, trained a GNN on 9 cities with Houston as a fixed validation set, and tested generalization on an unseen city: Philadelphia.

Street Block to Graph

Street blocks can be modeled either as rasterized images or as graphs. An image representation forces a grid-based approximation and often loses precise corner-to-corner structure, while a graph preserves the true topology of the polygon — capturing explicit relationships between corners (nodes) and edges (connections).

Each street block is converted into a planar graph where polygon vertices become nodes and consecutive boundary segments become edges. Node features are stored in an N × 5 matrix, where each of the N vertices contains five attributes: (x, y) coordinates, turning angle θ at the vertex, and the lengths of the two adjacent boundary edges (l₁, l₂).

The conversion pipeline standardizes vertex ordering, removes duplicate and near-duplicate points, computes angles and edge lengths from the cleaned polygon boundary, builds the edge index from sequential vertex connectivity (including wrap-around from last to first vertex), and outputs a graph object: x ∈ ℝ^N×5 and edge_index ∈ ℕ^2×E.

Image vs. Graph representation of a street block

Sample graph structure (x, y, θ, l₁, l₂)

Neural Network Iterations

The deep learning model is built around a GNN that learns hierarchical geometric representations of street-block graphs. Two architectural configurations were tested to evaluate the effect of model capacity on generalization.

Iteration 1 — Node embeddings are projected through a graph convolution layer into a 64-dimensional latent space, followed by global pooling and a multilayer perceptron (MLP) for binary classification:

Graph Conv → N × 64 → Global Pooling → MLP: 64 → 128 → 256 → 1

Iteration 2 — Model capacity is increased to test whether richer geometric abstractions improve generalization. The convolution output is expanded to 128 dimensions, and the MLP uses a deeper but uniform architecture:

Graph Conv → N × 128 → Global Pooling → MLP: 128 → 128 → 128 → 128 → 1

Across both iterations, the network is trained end-to-end using supervised learning with a binary classification objective. This iterative setup enables a direct comparison between compact and high-capacity architectures, clarifying the trade-off between expressiveness and overfitting when learning urban form purely from geometry.

Neural Network Architecture

Treating Imbalanced Data

The dataset exhibits severe class imbalance: urban blocks account for approximately 98% of all street blocks by count, while rural blocks represent a small minority. However, when evaluated by total area, the imbalance is less extreme — urban blocks cover roughly twice the area of rural blocks, reflecting the much larger spatial footprint of rural street blocks. Block count alone is therefore a misleading indicator of spatial dominance.

Synthetic oversampling methods such as SMOTE were intentionally not used. Street-block geometry is highly sensitive to small perturbations, and generating synthetic samples risks producing geometries that are topologically invalid or physically implausible — corrupting the learning signal in a graph-based model.

Instead, class imbalance is addressed through area-based loss weighting, so that misclassification penalties are proportional to the spatial footprint of each block rather than raw sample counts. This aligns the optimization objective with spatial reality, reduces bias toward over-predicting the majority urban class, and preserves the integrity of original geometries. By decoupling data representation from error weighting, the model maintains consistent graph inputs while correcting imbalance at the learning stage — enabling more meaningful generalization to unseen cities.

Conclusion

The deep learning approach achieves 94.55% accuracy, only marginally improving over the 93.74% accuracy obtained from the logistic regression model. A closer examination of the confusion matrices reveals where this difference originates. Logistic regression strongly favors the majority class, predicting urban blocks far more aggressively due to the overwhelming count imbalance. This results in very high urban recall but poor rural recognition, with many rural blocks incorrectly classified as urban.

Logistic Regression Confusion Matrix

Neural Network (GNN) Confusion Matrix

The GNN demonstrates a more balanced error distribution. While overall accuracy gains are modest, the network improves both precision and recall for the minority rural class — reducing false urban predictions and yielding a cleaner separation between urban and rural blocks. This indicates that the GNN is better at extracting subtle geometric signals that distinguish large rural blocks from dense urban fabrics, even under severe class imbalance.

Despite this improvement, the results expose a clear limitation. Across 50 training epochs and both architectural iterations, the loss and error rate plateau after approximately 20 epochs without further meaningful improvement. This suggests that street-block geometry alone has bounded explanatory power for urban classification. While the GNN is more expressive than logistic regression, it is also less efficient and more computationally expensive, delivering only incremental gains. The outcome implies that deep learning does not fundamentally overcome the information ceiling imposed by geometry-only inputs — rather, it slightly refines classification within those limits. This reinforces the conclusion that urbanity is only partially encoded in block shape, and that richer contextual or semantic data would be required for significant performance gains beyond this threshold.

Ambiguous street block geometric patterns: Suburb New Jersey

Ambiguous street block geometric patterns: Bay Area