Building a Real-World Benchmark for 3D Reconstruction

Stack: PyTorch, Python · GitHub link · For a condensed version of this post see the README.

The Problem
Workflow Overview
Key Design Decisions & Techniques
Results & Reflections
Explicit vs Implicit 3D Representations: A Practitioner’s Perspective
From Research to Production: What I’d Do Differently on Day One

The Problem

Most NeRF and 3DGS benchmarks use curated datasets. They’re either synthetic scenes, or captures with controlled lighting and professional-grade multi-camera rigs. The results look impressive, but they don’t answer the real-world deployment question: how do these methods perform with imperfect data, consumer hardware, and a custom pipeline you built yourself?

I didn’t set out to find out whether NeRF or Gaussian Splatting is “better.” I set out to understand what it takes to go from raw photographs to 3D reconstruction. I wanted to scrutinize every stage, decision, and failure. Real captures have exposure inconsistencies, clipped data, motion in the scene, sensor resolution limits, and challenging surfaces. Pipelines have dependency conflicts, undocumented build failures, and tools that crash on your hardware. This project works under those conditions to answer:

How do NeRF and 3DGS compare when the input data is imperfect?
Can you build a rigorous evaluation pipeline that works on all datasets?
What are the practical trade-offs you face when going from captures to usable 3D?

Using radiance field methods on real-world data involves a chain of problems that compound. Capture strategy directly affects Structure-from-Motion (SfM) quality. That in turn determines the camera poses that NeRF and 3DGS train on. Sensor limitations (dynamic range, sensor size and resolution) restrict what any method can reconstruct. The toolchain itself may not build on your hardware.

Each of these stages has its own failure modes, and weaknesses propagate downstream. To figure out whether a problem lives in the data, the method, or the evaluation, you can’t look only at the output.

3DGS at 30k steps

NeRF at 30k steps

[VISUAL: Rendered held-out evaluations of nerfacto and splatfacto (Ground truth on the left)]

Workflow Overview

Capture images of an indoor space using a digital stills camera
Post-process the images with a custom tool for consistent exposure, color, and sharpness
Process the images using COLMAP to produce a sparse 3D point cloud
Train separate NeRF & Gaussian Splat (3DGS) models using nerfstudio to generate 3D reconstructions
Render the NeRF & 3DGS models via the training cameras
Export Poisson meshes from the NeRF & COLMAP point clouds
Evaluate & compare the results across multiple metrics with custom benchmarking tools

Key Design Decisions & Techniques

Manual Captures

I used a Fujifilm X-series mirrorless camera with a 16-55mm f/2.8 zoom lens to capture ~100 images of a single indoor scene. This was instead of video extraction or turnkey solutions like RealityScan or SplatKing. I selected this camera because its color science produces accurate, consistent renditions. It also supports exposure bracketing for HDR captures. The zoom lens provides sharp images across a range of useful focal lengths for interior spaces.

Why stills over video? Video frames are convenient but come with potential challenges. Movement while recording introduces motion blur and rolling shutter artifacts. Storage and device constraints can result in lower per-frame resolution. Stills on the other hand give you full control over exposure, focus, and white balance per shot. This directly impacts SfM feature matching and downstream reconstruction quality.

Why manual over automated? The goal was to understand the full pipeline, not abstract it away. I planned out capture positions and overlap by hand. This helped develop intuition for what makes a good capture set. I focused on having enough baseline between views and consistent lighting. Coverage of textureless regions and specular surfaces was secondary in attention, but not in importance. They matter when designing capture protocols for production systems.

I captured the images by following a circular path through the room. I used three heights and an inward-facing orientation. This pattern provides good baseline coverage and loop closure for SfM. However, it introduces specific challenges that show up in the results. The hardwood floor has fine grain detail that exceeds what the camera sensor resolves at capture distance. The ceiling is visible, but the skylight is blown out (white pixels). This is due to the dynamic range limitations of a single-exposure capture. Unfortunately, the models can’t reconstruct detail that doesn’t exist in the input pixels.

I post-processed each image using a custom image-match tool. The tool normalized exposure, color balance, and sharpness across the set. This was partly a mitigation for the dynamic range problem. The images facing the skylight were significantly brighter than those facing interior walls. Evening out exposure and color across the images improves SfM feature matching consistency. It gives the models a more uniform training signal. But it’s not a substitute for true HDR capture. You can’t recover clipped highlights in post. Input data quality is a first-class concern (garbage in, garbage out!). You have to do what you can to reduce the damage from exposure variations.

Nerfstudio as the Unified Platform

I used nerfstudio to train both the nerfacto (NeRF) and splatfacto (3DGS) models. Using a single framework for both was a deliberate choice to reduce variables. The two methods share the same data loader, train/eval split, coordinate system, and export pipeline.

The more important decision was what to change from defaults. They were too conservative and underrepresented detail. I immediately adjusted the train/eval split ratio to ensure the held-out set covered the full camera trajectory. Nerfstudio exposes a huge number of parameters and allows methodical tuning. That allowed for experimenting with nerfacto’s hash grid resolution and number of levels. These are crucial in improving detail. For splatfacto, the key parameters were cull and densification thresholds and learning rate.

The framework also provides infrastructure that made the evaluation pipeline possible:

ns-viewer — for real-time training visualization
Tensorboard/W&B integration — for loss curves and training dynamics
ns-export — for extracting point clouds and Poisson meshes
ns-eval — for held-out set evaluation
Scriptability — the entire training pipeline is CLI-driven. Adding GPU profiling and structured logging becomes a scripting task.

These helped me to run experiments and develop a feel for how each method responds to its key parameters. The training scripts I used to facilitate this are available here.

External Evaluation

Nerfstudio’s built-in ns-eval computes PSNR, SSIM, and LPIPS on the held-out test set. This is standard practice, but it only tells half the story. A model scoring well on held-out images but poorly on training views is a different problem than poor scores on both.

I built recon-bench, a custom benchmarking tool, to fill the gaps:

Training set evaluation: Compute image quality metrics (PSNR, SSIM, LPIPS) on training views. Then compare against held-out (eval set) performance. A large gap between training and test metrics is a strong overfitting signal.
Geometric evaluation: Using COLMAP’s dense reconstruction as a reference point cloud, I computed Chamfer distance, Hausdorff distance, and F-score against the NeRF and 3DGS exported point clouds. These metrics quantify geometric accuracy independent of rendering quality.
Unified reporting: All metrics (image quality, geometric, timing, memory) are collected into structured reports. It’s straightforward to see the full picture.

This separation of rendering quality from geometric accuracy is important. A method can produce photo-realistic novel views but with poor geometry, or vice versa. For applications that need accurate 3D models (not just pretty renders), geometric metrics are essential.

Building Everything from Source

The entire software stack (COLMAP, nerfstudio, Open3D, tiny-cuda-nn) was installed and compiled from source. This wasn’t by preference. It was by necessity.

The compute platform was an NVIDIA GB10 (DGX Spark). It runs an ARM64 Grace CPU paired with a Blackwell GPU using CUDA 13.x. This combination is bleeding-edge. At the time of this project, no pre-built wheels existed for PyTorch + CUDA 13 on ARM64. Every dependency in the chain needed to be built against this specific toolchain.

What this required:

Resolving Application Binary Interface (ABI) compatibility across COLMAP (C++), PyTorch (Python/C++), tiny-cuda-nn (CUDA), and Open3D (C++/Python)
Debugging build failures from libraries that assumed x86_64 or older CUDA versions
Managing CUDA compute capability flags for the Blackwell architecture

This was not a quick detour. Open3D took 1–2 days of compilation work. COLMAP swallowed another 2–3 days. Two episodes in particular illustrate the kind of debugging the platform demanded.

War story 1 — Open3D and the impossible NVCC flag

After the usual setup tax (Python interpreter hints for CMake, a broken /usr/bin/nvcc symlink, exporting torch.utils.cmake_prefix_path so FindPytorch.cmake could locate Torch) there was a single line in the configure output that shouldn’t have been possible:

-- Added CUDA NVCC flags for: -gencode;arch=compute_20,code=sm_121

sm_121 is Blackwell. compute_20 is Fermi (a virtual architecture from 2010). These two cannot coexist in a valid gencode pair. Something in the build was fabricating an architecture tuple out of mismatched parts.

Following my first instinct to export TORCH_CUDA_ARCH_LIST="12.1" changed nothing. That was a useful signal. The bad value was not being read from the environment, it was being derived upstream and injected into PyTorch’s flag generator. So the question shifted from “what do I set?” to “what is writing this value, and when?”

Reading the CMake output made the handoff clear:

Open3D’s top-level CMake sees CMAKE_CUDA_ARCHITECTURES. Then, under its native/auto path, derives a TORCH_CUDA_ARCH_LIST from it.
PyTorch’s Caffe2 CMake then dismisses CMAKE_CUDA_ARCHITECTURES: "pytorch is not compatible… will ignore its value" and sets it OFF.
But Open3D’s derived value has already been handed across the boundary, and PyTorch’s torch_cuda_get_nvcc_gencode_flag(...) emits the malformed pair.

The fix followed from the diagnosis. Bypass Open3D’s auto path by clearing the cache, and pinning PyTorch and Open3D variables:

rm -f CMakeCache.txt
cmake -S .. -B . \
  -DBUILD_CUDA_MODULE=ON \
  -DCMAKE_CUDA_ARCHITECTURES=121 \
  -DTORCH_CUDA_ARCH_LIST=12.1

Result: -- Added CUDA NVCC flags for: -gencode;arch=compute_121,code=sm_121. Clean pair, correct architecture.

What remained was a long tail of smaller obstacles. It’s worth naming them because they illustrate how bleeding-edge builds fail.

A make run whose error scrolled off the terminal too quickly. (resolved by make -j1 VERBOSE=1 2>&1 | tee build.log and grepping upward for error:, undefined reference, nvcc fatal, ld:)
An Open3D-ML submodule failing to clone because CMake had rewritten https:// to https:/ and git was falling back to SSH on port 22 (resolved by cloning manually and pointing the build at the local path)
Missing yapf, wheel, and setuptools in the build venv caused the pip package build to fail.
uv add on the built wheel failing because requires-python = ">=3.12" opened splits for 3.13/3.14 on win32 that had no matching wheel. (resolved by pinning to 3.12.*)

The Open3D build was less about individual errors and more about reading a layered CMake handoff as a causal chain instead of a pile of warnings.

War story 2 — COLMAP stereo fusion: zero fused points

COLMAP itself was a gentler build. The onnxruntime package shipped CMake targets pointing at /usr/local/lib64/libonnxruntime.so.1.24.1 while the actual library lived in lib/. I patched the target file and added a lib64 → lib symlink:

sed -i 's#/lib64/#/lib/#g' _deps/onnxruntime-build/share/onnxruntime/cmake/onnxruntimeTargets.cmake
(cd _deps/onnxruntime-build && ln -s lib lib64)

Qt6 needed the usual multi-package apt incantation. Sparse reconstruction then ran cleanly. The interesting failure came later, during dense reconstruction:

fusion.cc:331] Could not fuse any points. This is likely caused by incorrect settings —
               filtering must be enabled for the last call to patch match stereo.
fusion.cc:337] Number of fused points: 0

The error message is a red herring. It names one possible cause, not the actual one. I worked through the obvious suspects:

Dev branch instability — checked out and rebuilt tagged 3.13.0. No change.
Filtering flag — the message’s literal suggestion. Tried --PatchMatchStereo.filter true and 1. No change.
VRAM pressure from 4K images — capped with --PatchMatchStereo.max_image_size 2000. No change.
Two-pass patch match — ran photometric (geom_consistency 0) then geometric (geom_consistency 1) separately. No change.
Over-strict geometric consistency — relaxed filter_min_num_consistent 1, filter_geom_consistency_max_cost 3, write_consistency_graph 1. No change.
Bad camera poses upstream — plotted camera positions with a small plot_poses.py utility. Poses were fine.

Each attempt eliminated a hypothesis rather than fixing the problem. Frustrating, but still a kind of progress. At that point I stopped tweaking parameters and started inspecting artifacts. I built inspect_depth_map.py to render COLMAP’s binary depth maps as images. Then I used hexdump on the file headers to confirm the dimensions and data regions had values. Depth maps from the first patch-match pass: full of data. Normal maps and consistency graphs: also fine.

Then I decomposed the orchestration script and ran each stage manually. The depth maps produced during the run that called fusion were all zeros. But the identical patch-match invocation, run standalone, produced valid depths! Same binary, same inputs, same flags, different outputs. That narrowed the cause to something in the pipeline’s runtime state rather than CLI parameters. The rest of the stack of CUDA 13 on a brand-new GPU architecture left one plausible remaining suspect. The CUDA driver / Blackwell kernel path was silently producing zero-valued writes.

I decided to stop debugging the symptom on the suspect hardware, and verify on a known-good platform. I rsynced the full project to a machine with an older RTX 3080 and started rebuilding.

That move introduced a detour of its own. The 3080 machine needed its NVIDIA drivers and toolkit bumped to 13.2 to match the CUDA toolchain. Upgrading broke the (linux) desktop environment. Login would drop straight back to the greeter. Thankfully this is a commonly known failure: the open-source nouveau driver races the proprietary one at boot. I rebuilt the graphics stack from a TTY. I started by blacklisting nouveau via /etc/modprobe.d/. Then I forced the affected GUI apps into a no-GPU path so I could at least get a working session.

It wasn’t a COLMAP problem, but it’s the kind of collateral damage that’s routine when chasing driver compatibility. It’s a reminder that “switch hardware” is never as simple as it sounds.

With the 3080 box usable again, I rebuilt COLMAP from source and reran the exact scripts. Fusion completed without errors and the depth maps had data. I transferred the dense point cloud back to the Spark and resumed the Poisson meshing stage.

The COLMAP failure was not fixed by a flag. By refusing to trust any one layer, I isolated the one variable (hardware ) that every previous attempt had held constant.

What this enabled:

A working pipeline on hardware that no one else had packaged for yet
Direct comparison of training performance with and without tiny-cuda-nn acceleration
Understanding the full dependency graph (valuable when debugging silent failures in reconstruction pipelines)

See [build notes] for full reproduction instructions.

Results & Reflections

[VISUAL: side-by-side interpolated flythrough of nerfacto vs splatfacto using ns-render interpolate along the circular training camera path]

What Was Achieved

A complete, reproducible pipeline from raw photos to quantitative multi-method comparison. Several aspects of this evaluation go beyond standard practice in the field:

Training-vs-held-out evaluation for both methods. Most benchmarks only report held-out metrics. Computing PSNR, SSIM, and LPIPS on the training set lets this pipeline detect overfitting. This manifests as poor novel view synthesis. It’s a critical diagnostic typically invisible in published results.
Geometric evaluation against a dense reference. Standard NeRF/3DGS evaluations focus on rendering quality. This project also measures geometric accuracy via Chamfer distance, Hausdorff distance, and F-score against a COLMAP dense point cloud. Rendering quality and geometric accuracy can diverge. A method can ace one and fail the other.
Controlled comparison on real-world data. Both methods were trained with the same framework (nerfstudio), on the same data split, with similar compute budgets. This eliminates the confounding variables that make it difficult to compare results.
Custom open-source tooling. Two new tools to support the pipeline: image-match for input normalization and recon-bench for evaluation. These are reusable beyond this project.
Full compute profiling. Training time, peak GPU memory, and iteration throughput were logged for both methods. With and without tiny-cuda-nn acceleration.

3DGS Outperformed NeRF on Image Quality

[VISUAL: side-by-side renders]

For both training and held-out views, Splatfacto scored higher on PSNR and SSIM than Nerfacto. LPIPS was also lower (better perceptual quality). The difference in metrics was not dramatic. Both methods did produce recognizable reconstructions of the scene. The 3DGS renders were slightly sharper and had fewer color artifacts.

The gap between training and eval views was small for both methods. This suggests that neither was intensely overfitting to the training views. With a well-distributed capture set (~100 images), both methods generalized reasonably.

[PSNR/SSIM/LPIPS comparison table — training vs held-out for both methods]

Metric	NeRF (train)	3DGS (train)	NeRF (eval)	3DGS (eval)
PSNR	23.5383	29.5054	21.7424	23.6512
SSIM	0.8332	0.8745	0.7901	0.8666
LPIPS	0.4241	0.3477	0.4931	0.3887

3DGS Also Won on Geometry

Against the COLMAP dense point cloud reference, splatfacto’s exported point cloud had lower Chamfer distance, lower Hausdorff distance, and higher F-score than nerfacto’s. This was somewhat expected. 3DGS explicitly represents the scene as a set of positioned Gaussians, while nerfacto’s geometry must be extracted via marching cubes or Poisson reconstruction from a density field, which introduces approximation.

The Poisson meshes told a similar story. The nerfacto-derived mesh had more holes and surface noise, while the splatfacto-derived mesh was more complete but still had issues in textureless regions (walls, ceiling).

[Chamfer/Hausdorff/F-score comparison table — training vs held-out for both methods]

method	voxel_size	threshold	ref_points	pred_points	chamfer	hausdorff	fscore
nerf	0.01	0.01	962740	264752	5.717	18.558	0.030
nerf	0.01	0.02	962740	264752	5.717	18.558	0.120
nerf	0.01	0.04	962740	264752	5.717	18.558	0.371
3dgs	0.01	0.01	962740	410207	2.815	13.672	0.082
3dgs	0.01	0.02	962740	410207	2.815	13.672	0.313
3dgs	0.01	0.04	962740	410207	2.815	13.672	0.872
===	===	===	===	===	===	===	===
nerf	0.02	0.02	427045	145625	7.894	18.558	0.129
nerf	0.02	0.04	427045	145625	7.894	18.558	0.453
nerf	0.02	0.08	427045	145625	7.894	18.558	1.132
3dgs	0.02	0.02	427045	228586	3.182	13.672	0.350
3dgs	0.02	0.04	427045	228586	3.182	13.672	1.148
3dgs	0.02	0.08	427045	228586	3.182	13.672	2.816
===	===	===	===	===	===	===	===
nerf	0.05	0.05	110505	85588	10.947	18.558	0.474
nerf	0.05	0.1	110505	85588	10.947	18.558	1.174
nerf	0.05	0.2	110505	85588	10.947	18.558	2.434
3dgs	0.05	0.05	110505	84524	3.632	13.672	1.935
3dgs	0.05	0.1	110505	84524	3.632	13.672	5.077
3dgs	0.05	0.2	110505	84524	3.632	13.672	10.404

[VISUAL: point cloud visualizations]

Colmap point clouds

3DGS point clouds

NeRF point clouds

[Video (optional) — rotating point cloud comparison: COLMAP dense vs nerfacto vs splatfacto exports, showing density and coverage differences]

Training Efficiency

3DGS trained slower than nerfacto, but reached comparable quality in fewer total iterations. 3DGS’s higher memory usage was due to the explicit storage of Gaussian parameters (position, covariance, color, opacity for each splat). On the GB10, this wasn’t a bottleneck, but on consumer GPUs with less VRAM it would be a consideration.

With tiny-cuda-nn enabled, nerfacto’s per-iteration speed improved 14x. The hash grid encoding was the main beneficiary of the speedups.

[Training time and memory comparison chart]

Method	Training Time	GPU Memory	Iterations
nerfacto	2.5h	15.5GB	30,000
splatfacto	6.2h	18GB	30,000

The Brutal Truth

Neither method produced output that was convincingly “real,” especially in novel views. The interesting bit isn’t that they fell short, but why? Also, are the failure modes fundamental or fixable?

Blurriness in high-detail regions. Both methods displayed blurriness on wood grain, patterned fabrics, and small objects. This has many contributing causes:

For splatfacto: Each Gaussian has a minimum effective size determined by its covariance parameters. When scene detail is finer than the smallest Gaussians the model converges to, that detail gets averaged out. Densification (splitting large Gaussians into smaller ones) attempts to address this but is bounded by a threshold and total Gaussian budget.
For nerfacto: The hash grid resolution sets a hard ceiling on representable spatial frequency. Detail finer than the grid spacing is simply not encodable. Increasing the grid resolution helps but costs memory and training time.
For both: Even at 4K resolution, the input images may not contain enough detail in the first place. The camera sensor resolves detail up to the Nyquist limit for its pixel pitch at the capture distance. If the hardwood grain or fabric texture is below that limit, no reconstruction method can recover it. The information was never in the training data.

The overall scene structure was accurate. The accurate walls, furniture placement, and room geometry show that both methods handle low-frequency content well. The failure is in high-frequency texture. This points to resolution limits at multiple stages of the pipeline. It’s not a fundamental flaw in either method.

The real cost isn’t training — it’s iterating without signal. Modern radiance field methods are not uniformly “slow.” Per-run cost has plummeted with newer implementations. Even so, iterating on hyperparameters can still be costly. Tweaking initial Gaussian count, hash grid resolution, and densification thresholds takes multiple runs. This is where ns-eval falls short (and where recon-bench shines). You need a method that can distinguish a better result from one that scores “better” on the wrong metric. Training-vs-held-out gap detection catches overfitting. Geometric metrics catch the case where renders look good but the underlying 3D is poor. Good tooling turns “run it again with new settings and hope” into a measurable decision.

The Iteration Problem

Extended training time requires a good understanding of the construction of radiance fields. With ns-viewer, TensorBoard, or Weights & Biases you can watch the reconstruction as it trains. This real-time feedback is incredibly valuable. It enables you to actually see the 3D structure forming. It’s the complete opposite of training standard ML models. You’re usually watching loss curves and hoping for results.

Despite immediate feedback, you may not be able to see the effects of parameters until well into training. This makes each experiment a time-consuming commitment. The feedback loop for parameter selection might be real-time, but it’s slow.

The COLMAP Problem

I used COLMAP’s dense reconstruction as the geometric reference, which was problematic. The dense stereo step repeatedly crashed on the NVIDIA GB10. This was likely due to compute capability (CC) 12.x compatibility issues in the PatchMatch implementation. I had to run the dense reconstruction on a separate machine with an older GPU and CC 8.6 (See war story above).

This introduced an inconsistency in the pipeline. The sparse reconstruction (used to initialize both NeRF and 3DGS training) ran on the GB10. But the dense point cloud was computed elsewhere. I wasn’t able to test it, but I’d want to see if there are differences between PatchMatch on different GPUs.

Not having a real “ground truth” shaped the entire project. I don’t have access to a LiDAR scanner. Some phone apps (Polycam, Scaniverse) can use the iPhone’s LiDAR sensor to supplement the capture. That sensor is a low-resolution dToF sensor with limited range. Using it as “ground truth” would lead to measuring agreement with a noisy reference instead of geometric accuracy.

Without a proper ground truth, COLMAP dense was the best available option. It’s computed with established and well-documented algorithms. Image quality metrics (where the input images are ground truth) become more important. Using multiple geometric metrics instead of one is no longer optional.

A rigorous evaluation protocol for geometric accuracy would look something like:

LiDAR scan as the primary geometric reference. A true sensor-based measurement, independent of any photogrammetric reconstruction
ICP alignment between the LiDAR reference and each method’s output. This handles coordinate system differences
Per-region metrics instead of scene-wide totals. This shows where each method has trouble. For example, if it struggles with textureless walls, detailed surfaces, or shiny objects.
Multiple scenes to separate method-level trends from scene-specific artifacts

This project uses an approximation of that protocol.

The NeRF Cleanup Problem

Nerfacto produced a large number of points outside the scene bounds. It’s a known issue with NeRF methods that model the entire volume, including empty space. The exported point cloud required a lot of manual cropping and filtering. Only then was it ready for comparison with the COLMAP reference.

This is not a cosmetic issue. For any workflow that goes from NeRF to mesh (e.g., physics simulation, CAD integration, or 3D printing), cleanup adds significant effort. 3DGS uses explicit point-based representations. This leads to cleaner exports and fewer stray points.

Future Improvements

Better capture strategy The current capture set was ad-hoc. A more systematic approach would reduce variability and improve image quality. Structured grid positions, controlled lighting, and overlap ratios reduces the number of unusable images. More uniform captures facilitates investigating the relationship between input views and reconstruction quality.

HDR / exposure bracketing The single-exposure capture was the biggest limitation of the input data. Using image-match for post-processing helped to normalize exposure. However, information that was never recorded by the sensor can’t be recovered. Exposure bracketing at each camera position would overcome that limitation. Merging captures into HDR images preserves detail across the entire luminance range. Areas like the blown-out skylight and dark interior corners are no longer problematic. With HDR, the model training is no longer limited by the clipped image data.

Newer methods The field moves fast. New methods appear constantly. Like 2D Gaussian Splatting, PatchNeRF, and various 3DGS extensions (anti-aliasing, triangle primitive variants). Re-running the same evaluation pipeline with newer methods would be straightforward.

Outdoor scenes Indoor scenes have specific challenges (textureless walls, complex lighting, specular surfaces). Outdoor scenes present different challenges (sky modeling, varying illumination, scale). Using the pipeline on outdoor environments would test if the relative ranking of methods holds across domains.

LiDAR ground truth COLMAP dense reconstruction is still only an approximation. LiDAR scanning is the most accurate method currently available.

Explicit vs Implicit 3D Representations: A Practitioner’s Perspective

My background is in traditional 3D. Polygon meshes, NURBS surfaces, explicit geometry that you can inspect vertex-by-vertex. I’ve also worked with differentiable mesh rendering via PyTorch3D. This project was my first serious work with implicit and hybrid representations, and the contrast is worth discussing.

The representation spectrum. These methods sit at different points on a spectrum:

Method	Representation	Geometry Access	Editability
Traditional mesh	Explicit triangles	Direct	Full
Differentiable mesh (PyTorch3D)	Explicit triangles, optimized via gradients	Direct	Full (post-optimization)
3D Gaussian Splatting	Semi-explicit (positioned primitives)	Extractable	Limited
NeRF	Implicit (neural density/color field)	Requires extraction (marching cubes)	Very limited

What implicit methods gain NeRF’s strength is that it doesn’t commit to a surface representation during training. The network learns a continuous volumetric function. That means it can represent fuzzy boundaries, semi-transparent objects, and view-dependent effects naturally. You don’t need to decide the mesh topology upfront.

What implicit methods lose The geometry is locked inside the network weights. Extracting a mesh means evaluating the density field on a grid and using marching cubes or Poisson reconstruction. Both introduce discretization artifacts and require choosing threshold parameters. The “extra points outside the scene” problem I encountered with nerfacto is a direct consequence. The density field doesn’t have a clean boundary, so extraction always requires cleanup.

Where 3DGS sits Gaussian Splatting is an interesting middle ground. Each Gaussian is an explicit primitive with a position, covariance, color, and opacity. You can enumerate, filter, and export them. But they’re not a mesh. Converting to a triangle mesh still requires surface reconstruction. Gaussians don’t inherently define a surface normal or connectivity. It’s explicit, but not traditional 3D.

Practical takeaway For applications that need a mesh (game engines, CAD, 3D printing, physics simulation), going from NeRF to geometry is simpler than from 3DGS. Current research shows that exporting usable meshes from Gaussians is still hard. The path from differentiable mesh optimization is the shortest and most direct. You start and end with triangles. The trade-off is that differentiable methods need a good initial mesh and struggle with topology changes. NeRF and 3DGS can reconstruct scenes from scratch.

From Research to Production: What I’d Do Differently on Day One

This project was scoped as only a research comparison. Working through the pipeline exposed where the gaps would be when scaling beyond a single scene or a single engineer. If I were doing this again I’d prioritize four areas.

Capture protocol, not capture art The circular walkthrough worked this one time, but it wasn’t auditable. Production systems need repeatable capture protocols covering:

Camera positions and overlap requirements
Lighting specifications for specular and reflective surfaces
Sensor resolution matched to target output quality

Automated quality gates Input data needs validation before training. Flag images with motion blur or exposure clipping. Verify sufficient overlap. Check COLMAP registration quality (reprojection error, number of registered images, point cloud density). Catching bad data early saves more time than model optimization and avoids costly re-captures.

Evaluation as CI, not as a post-hoc step The recon-bench tool works for manual comparison. The next step is having it run automatically after every training job. The tooling is already scriptable and CLI driven.

Method selection as an engineering decision A production system shouldn’t hard-code the best methods. The evaluation pipeline itself is the product. It lets you make decisions with data instead of intuition.

Benchmarking 3D Reconstruction