Back to projects
May 06, 2026
28 min read

Benchmarking 3D Reconstruction

What does it take to go from raw photographs to 3D reconstruction?

Building a Real-World Benchmark for 3D Reconstruction

Stack: PyTorch, Python · GitHub link · For a condensed version of this post see the README.

Table of Contents


The Problem

Most NeRF and 3DGS benchmarks use curated datasets. They’re either synthetic scenes, or captures with controlled lighting and professional-grade multi-camera rigs. The results look impressive, but they don’t answer the real-world deployment question: how do these methods perform with imperfect data, consumer hardware, and a custom pipeline you built yourself?

I didn’t set out to find out whether NeRF or Gaussian Splatting is “better.” I set out to understand what it takes to go from raw photographs to 3D reconstruction. I wanted to scrutinize every stage, decision, and failure. Real captures have exposure inconsistencies, clipped data, motion in the scene, sensor resolution limits, and challenging surfaces. Pipelines have dependency conflicts, undocumented build failures, and tools that crash on your hardware. This project works under those conditions to answer:

  • How do NeRF and 3DGS compare when the input data is imperfect?
  • Can you build a rigorous evaluation pipeline that works on all datasets?
  • What are the practical trade-offs you face when going from captures to usable 3D?

Using radiance field methods on real-world data involves a chain of problems that compound. Capture strategy directly affects Structure-from-Motion (SfM) quality. That in turn determines the camera poses that NeRF and 3DGS train on. Sensor limitations (dynamic range, sensor size and resolution) restrict what any method can reconstruct. The toolchain itself may not build on your hardware.

Each of these stages has its own failure modes, and weaknesses propagate downstream. To figure out whether a problem lives in the data, the method, or the evaluation, you can’t look only at the output.

3DGS at 30k steps 3DGS at 30k steps

NeRF at 30k steps NeRF at 30k steps

[VISUAL: Rendered held-out evaluations of nerfacto and splatfacto (Ground truth on the left)]

Workflow Overview

  1. Capture images of an indoor space using a digital stills camera
  2. Post-process the images with a custom tool for consistent exposure, color, and sharpness
  3. Process the images using COLMAP to produce a sparse 3D point cloud
  4. Train separate NeRF & Gaussian Splat (3DGS) models using nerfstudio to generate 3D reconstructions
  5. Render the NeRF & 3DGS models via the training cameras
  6. Export Poisson meshes from the NeRF & COLMAP point clouds
  7. Evaluate & compare the results across multiple metrics with custom benchmarking tools

Key Design Decisions & Techniques

Manual Captures

I used a Fujifilm X-series mirrorless camera with a 16-55mm f/2.8 zoom lens to capture ~100 images of a single indoor scene. This was instead of video extraction or turnkey solutions like RealityScan or SplatKing. I selected this camera because its color science produces accurate, consistent renditions. It also supports exposure bracketing for HDR captures. The zoom lens provides sharp images across a range of useful focal lengths for interior spaces.

Why stills over video? Video frames are convenient but come with potential challenges. Movement while recording introduces motion blur and rolling shutter artifacts. Storage and device constraints can result in lower per-frame resolution. Stills on the other hand give you full control over exposure, focus, and white balance per shot. This directly impacts SfM feature matching and downstream reconstruction quality.

Why manual over automated? The goal was to understand the full pipeline, not abstract it away. I planned out capture positions and overlap by hand. This helped develop intuition for what makes a good capture set. I focused on having enough baseline between views and consistent lighting. Coverage of textureless regions and specular surfaces was secondary in attention, but not in importance. They matter when designing capture protocols for production systems.

I captured the images by following a circular path through the room. I used three heights and an inward-facing orientation. This pattern provides good baseline coverage and loop closure for SfM. However, it introduces specific challenges that show up in the results. The hardwood floor has fine grain detail that exceeds what the camera sensor resolves at capture distance. The ceiling is visible, but the skylight is blown out (white pixels). This is due to the dynamic range limitations of a single-exposure capture. Unfortunately, the models can’t reconstruct detail that doesn’t exist in the input pixels.

I post-processed each image using a custom image-match tool. The tool normalized exposure, color balance, and sharpness across the set. This was partly a mitigation for the dynamic range problem. The images facing the skylight were significantly brighter than those facing interior walls. Evening out exposure and color across the images improves SfM feature matching consistency. It gives the models a more uniform training signal. But it’s not a substitute for true HDR capture. You can’t recover clipped highlights in post. Input data quality is a first-class concern (garbage in, garbage out!). You have to do what you can to reduce the damage from exposure variations.

Nerfstudio as the Unified Platform

I used nerfstudio to train both the nerfacto (NeRF) and splatfacto (3DGS) models. Using a single framework for both was a deliberate choice to reduce variables. The two methods share the same data loader, train/eval split, coordinate system, and export pipeline.

The more important decision was what to change from defaults. They were too conservative and underrepresented detail. I immediately adjusted the train/eval split ratio to ensure the held-out set covered the full camera trajectory. Nerfstudio exposes a huge number of parameters and allows methodical tuning. That allowed for experimenting with nerfacto’s hash grid resolution and number of levels. These are crucial in improving detail. For splatfacto, the key parameters were cull and densification thresholds and learning rate.

The framework also provides infrastructure that made the evaluation pipeline possible:

  • ns-viewer — for real-time training visualization
  • Tensorboard/W&B integration — for loss curves and training dynamics
  • ns-export — for extracting point clouds and Poisson meshes
  • ns-eval — for held-out set evaluation
  • Scriptability — the entire training pipeline is CLI-driven. Adding GPU profiling and structured logging becomes a scripting task.

These helped me to run experiments and develop a feel for how each method responds to its key parameters. The training scripts I used to facilitate this are available here.

External Evaluation

Nerfstudio’s built-in ns-eval computes PSNR, SSIM, and LPIPS on the held-out test set. This is standard practice, but it only tells half the story. A model scoring well on held-out images but poorly on training views is a different problem than poor scores on both.

I built recon-bench, a custom benchmarking tool, to fill the gaps:

  • Training set evaluation: Compute image quality metrics (PSNR, SSIM, LPIPS) on training views. Then compare against held-out (eval set) performance. A large gap between training and test metrics is a strong overfitting signal.
  • Geometric evaluation: Using COLMAP’s dense reconstruction as a reference point cloud, I computed Chamfer distance, Hausdorff distance, and F-score against the NeRF and 3DGS exported point clouds. These metrics quantify geometric accuracy independent of rendering quality.
  • Unified reporting: All metrics (image quality, geometric, timing, memory) are collected into structured reports. It’s straightforward to see the full picture.

This separation of rendering quality from geometric accuracy is important. A method can produce photo-realistic novel views but with poor geometry, or vice versa. For applications that need accurate 3D models (not just pretty renders), geometric metrics are essential.

Building Everything from Source

The entire software stack (COLMAP, nerfstudio, Open3D, tiny-cuda-nn) was installed and compiled from source. This wasn’t by preference. It was by necessity.

The compute platform was an NVIDIA GB10 (DGX Spark). It runs an ARM64 Grace CPU paired with a Blackwell GPU using CUDA 13.x. This combination is bleeding-edge. At the time of this project, no pre-built wheels existed for PyTorch + CUDA 13 on ARM64. Every dependency in the chain needed to be built against this specific toolchain.

What this required:

  • Resolving Application Binary Interface (ABI) compatibility across COLMAP (C++), PyTorch (Python/C++), tiny-cuda-nn (CUDA), and Open3D (C++/Python)
  • Debugging build failures from libraries that assumed x86_64 or older CUDA versions
  • Managing CUDA compute capability flags for the Blackwell architecture

This was not a quick detour. Open3D took 1–2 days of compilation work. COLMAP swallowed another 2–3 days. Two episodes in particular illustrate the kind of debugging the platform demanded.

War story 1 — Open3D and the impossible NVCC flag

After the usual setup tax (Python interpreter hints for CMake, a broken /usr/bin/nvcc symlink, exporting torch.utils.cmake_prefix_path so FindPytorch.cmake could locate Torch) there was a single line in the configure output that shouldn’t have been possible:

-- Added CUDA NVCC flags for: -gencode;arch=compute_20,code=sm_121

sm_121 is Blackwell. compute_20 is Fermi (a virtual architecture from 2010). These two cannot coexist in a valid gencode pair. Something in the build was fabricating an architecture tuple out of mismatched parts.

Following my first instinct to export TORCH_CUDA_ARCH_LIST="12.1" changed nothing. That was a useful signal. The bad value was not being read from the environment, it was being derived upstream and injected into PyTorch’s flag generator. So the question shifted from “what do I set?” to “what is writing this value, and when?”

Reading the CMake output made the handoff clear:

  1. Open3D’s top-level CMake sees CMAKE_CUDA_ARCHITECTURES. Then, under its native/auto path, derives a TORCH_CUDA_ARCH_LIST from it.
  2. PyTorch’s Caffe2 CMake then dismisses CMAKE_CUDA_ARCHITECTURES: "pytorch is not compatible… will ignore its value" and sets it OFF.
  3. But Open3D’s derived value has already been handed across the boundary, and PyTorch’s torch_cuda_get_nvcc_gencode_flag(...) emits the malformed pair.

The fix followed from the diagnosis. Bypass Open3D’s auto path by clearing the cache, and pinning PyTorch and Open3D variables:

rm -f CMakeCache.txt
cmake -S .. -B . \
  -DBUILD_CUDA_MODULE=ON \
  -DCMAKE_CUDA_ARCHITECTURES=121 \
  -DTORCH_CUDA_ARCH_LIST=12.1

Result: -- Added CUDA NVCC flags for: -gencode;arch=compute_121,code=sm_121. Clean pair, correct architecture.

What remained was a long tail of smaller obstacles. It’s worth naming them because they illustrate how bleeding-edge builds fail.

  • A make run whose error scrolled off the terminal too quickly. (resolved by make -j1 VERBOSE=1 2>&1 | tee build.log and grepping upward for error:, undefined reference, nvcc fatal, ld:)
  • An Open3D-ML submodule failing to clone because CMake had rewritten https:// to https:/ and git was falling back to SSH on port 22 (resolved by cloning manually and pointing the build at the local path)
  • Missing yapf, wheel, and setuptools in the build venv caused the pip package build to fail.
  • uv add on the built wheel failing because requires-python = ">=3.12" opened splits for 3.13/3.14 on win32 that had no matching wheel. (resolved by pinning to 3.12.*)

The Open3D build was less about individual errors and more about reading a layered CMake handoff as a causal chain instead of a pile of warnings.

War story 2 — COLMAP stereo fusion: zero fused points

COLMAP itself was a gentler build. The onnxruntime package shipped CMake targets pointing at /usr/local/lib64/libonnxruntime.so.1.24.1 while the actual library lived in lib/. I patched the target file and added a lib64 → lib symlink:

sed -i 's#/lib64/#/lib/#g' _deps/onnxruntime-build/share/onnxruntime/cmake/onnxruntimeTargets.cmake
(cd _deps/onnxruntime-build && ln -s lib lib64)

Qt6 needed the usual multi-package apt incantation. Sparse reconstruction then ran cleanly. The interesting failure came later, during dense reconstruction:

fusion.cc:331] Could not fuse any points. This is likely caused by incorrect settings —
               filtering must be enabled for the last call to patch match stereo.
fusion.cc:337] Number of fused points: 0

The error message is a red herring. It names one possible cause, not the actual one. I worked through the obvious suspects:

  • Dev branch instability — checked out and rebuilt tagged 3.13.0. No change.
  • Filtering flag — the message’s literal suggestion. Tried --PatchMatchStereo.filter true and 1. No change.
  • VRAM pressure from 4K images — capped with --PatchMatchStereo.max_image_size 2000. No change.
  • Two-pass patch match — ran photometric (geom_consistency 0) then geometric (geom_consistency 1) separately. No change.
  • Over-strict geometric consistency — relaxed filter_min_num_consistent 1, filter_geom_consistency_max_cost 3, write_consistency_graph 1. No change.
  • Bad camera poses upstream — plotted camera positions with a small plot_poses.py utility. Poses were fine.

Each attempt eliminated a hypothesis rather than fixing the problem. Frustrating, but still a kind of progress. At that point I stopped tweaking parameters and started inspecting artifacts. I built inspect_depth_map.py to render COLMAP’s binary depth maps as images. Then I used hexdump on the file headers to confirm the dimensions and data regions had values. Depth maps from the first patch-match pass: full of data. Normal maps and consistency graphs: also fine.

Then I decomposed the orchestration script and ran each stage manually. The depth maps produced during the run that called fusion were all zeros. But the identical patch-match invocation, run standalone, produced valid depths! Same binary, same inputs, same flags, different outputs. That narrowed the cause to something in the pipeline’s runtime state rather than CLI parameters. The rest of the stack of CUDA 13 on a brand-new GPU architecture left one plausible remaining suspect. The CUDA driver / Blackwell kernel path was silently producing zero-valued writes.

I decided to stop debugging the symptom on the suspect hardware, and verify on a known-good platform. I rsynced the full project to a machine with an older RTX 3080 and started rebuilding.

That move introduced a detour of its own. The 3080 machine needed its NVIDIA drivers and toolkit bumped to 13.2 to match the CUDA toolchain. Upgrading broke the (linux) desktop environment. Login would drop straight back to the greeter. Thankfully this is a commonly known failure: the open-source nouveau driver races the proprietary one at boot. I rebuilt the graphics stack from a TTY. I started by blacklisting nouveau via /etc/modprobe.d/. Then I forced the affected GUI apps into a no-GPU path so I could at least get a working session.

It wasn’t a COLMAP problem, but it’s the kind of collateral damage that’s routine when chasing driver compatibility. It’s a reminder that “switch hardware” is never as simple as it sounds.

With the 3080 box usable again, I rebuilt COLMAP from source and reran the exact scripts. Fusion completed without errors and the depth maps had data. I transferred the dense point cloud back to the Spark and resumed the Poisson meshing stage.

The COLMAP failure was not fixed by a flag. By refusing to trust any one layer, I isolated the one variable (hardware ) that every previous attempt had held constant.

What this enabled:

  • A working pipeline on hardware that no one else had packaged for yet
  • Direct comparison of training performance with and without tiny-cuda-nn acceleration
  • Understanding the full dependency graph (valuable when debugging silent failures in reconstruction pipelines)

See [build notes] for full reproduction instructions.

Results & Reflections

[VISUAL: side-by-side interpolated flythrough of nerfacto vs splatfacto using ns-render interpolate along the circular training camera path] Training camera renders

What Was Achieved

A complete, reproducible pipeline from raw photos to quantitative multi-method comparison. Several aspects of this evaluation go beyond standard practice in the field:

  • Training-vs-held-out evaluation for both methods. Most benchmarks only report held-out metrics. Computing PSNR, SSIM, and LPIPS on the training set lets this pipeline detect overfitting. This manifests as poor novel view synthesis. It’s a critical diagnostic typically invisible in published results.
  • Geometric evaluation against a dense reference. Standard NeRF/3DGS evaluations focus on rendering quality. This project also measures geometric accuracy via Chamfer distance, Hausdorff distance, and F-score against a COLMAP dense point cloud. Rendering quality and geometric accuracy can diverge. A method can ace one and fail the other.
  • Controlled comparison on real-world data. Both methods were trained with the same framework (nerfstudio), on the same data split, with similar compute budgets. This eliminates the confounding variables that make it difficult to compare results.
  • Custom open-source tooling. Two new tools to support the pipeline: image-match for input normalization and recon-bench for evaluation. These are reusable beyond this project.
  • Full compute profiling. Training time, peak GPU memory, and iteration throughput were logged for both methods. With and without tiny-cuda-nn acceleration.

3DGS Outperformed NeRF on Image Quality

[VISUAL: side-by-side renders] Side by Side 1 Side by Side 2 Side by Side 3 Side by Side 4

For both training and held-out views, Splatfacto scored higher on PSNR and SSIM than Nerfacto. LPIPS was also lower (better perceptual quality). The difference in metrics was not dramatic. Both methods did produce recognizable reconstructions of the scene. The 3DGS renders were slightly sharper and had fewer color artifacts.

The gap between training and eval views was small for both methods. This suggests that neither was intensely overfitting to the training views. With a well-distributed capture set (~100 images), both methods generalized reasonably.

[PSNR/SSIM/LPIPS comparison table — training vs held-out for both methods]

MetricNeRF (train)3DGS (train)NeRF (eval)3DGS (eval)
PSNR23.538329.505421.742423.6512
SSIM0.83320.87450.79010.8666
LPIPS0.42410.34770.49310.3887

3DGS Also Won on Geometry

Against the COLMAP dense point cloud reference, splatfacto’s exported point cloud had lower Chamfer distance, lower Hausdorff distance, and higher F-score than nerfacto’s. This was somewhat expected. 3DGS explicitly represents the scene as a set of positioned Gaussians, while nerfacto’s geometry must be extracted via marching cubes or Poisson reconstruction from a density field, which introduces approximation.

The Poisson meshes told a similar story. The nerfacto-derived mesh had more holes and surface noise, while the splatfacto-derived mesh was more complete but still had issues in textureless regions (walls, ceiling).

[Chamfer/Hausdorff/F-score comparison table — training vs held-out for both methods]

methodvoxel_sizethresholdref_pointspred_pointschamferhausdorfffscore
nerf0.010.019627402647525.71718.5580.030
nerf0.010.029627402647525.71718.5580.120
nerf0.010.049627402647525.71718.5580.371
3dgs0.010.019627404102072.81513.6720.082
3dgs0.010.029627404102072.81513.6720.313
3dgs0.010.049627404102072.81513.6720.872
========================
nerf0.020.024270451456257.89418.5580.129
nerf0.020.044270451456257.89418.5580.453
nerf0.020.084270451456257.89418.5581.132
3dgs0.020.024270452285863.18213.6720.350
3dgs0.020.044270452285863.18213.6721.148
3dgs0.020.084270452285863.18213.6722.816
========================
nerf0.050.051105058558810.94718.5580.474
nerf0.050.11105058558810.94718.5581.174
nerf0.050.21105058558810.94718.5582.434
3dgs0.050.05110505845243.63213.6721.935
3dgs0.050.1110505845243.63213.6725.077
3dgs0.050.2110505845243.63213.67210.404

[VISUAL: point cloud visualizations]

Colmap point clouds COLMAP point
clouds

3DGS point clouds 3DGS point clouds

NeRF point clouds Nerfacto point clouds

[Video (optional) — rotating point cloud comparison: COLMAP dense vs nerfacto vs splatfacto exports, showing density and coverage differences]

Training Efficiency

3DGS trained slower than nerfacto, but reached comparable quality in fewer total iterations. 3DGS’s higher memory usage was due to the explicit storage of Gaussian parameters (position, covariance, color, opacity for each splat). On the GB10, this wasn’t a bottleneck, but on consumer GPUs with less VRAM it would be a consideration.

With tiny-cuda-nn enabled, nerfacto’s per-iteration speed improved 14x. The hash grid encoding was the main beneficiary of the speedups.

[Training time and memory comparison chart]

MethodTraining TimeGPU MemoryIterations
nerfacto2.5h15.5GB30,000
splatfacto6.2h18GB30,000

The Brutal Truth

Neither method produced output that was convincingly “real,” especially in novel views. The interesting bit isn’t that they fell short, but why? Also, are the failure modes fundamental or fixable?

Blurriness in high-detail regions. Both methods displayed blurriness on wood grain, patterned fabrics, and small objects. This has many contributing causes:

  • For splatfacto: Each Gaussian has a minimum effective size determined by its covariance parameters. When scene detail is finer than the smallest Gaussians the model converges to, that detail gets averaged out. Densification (splitting large Gaussians into smaller ones) attempts to address this but is bounded by a threshold and total Gaussian budget.
  • For nerfacto: The hash grid resolution sets a hard ceiling on representable spatial frequency. Detail finer than the grid spacing is simply not encodable. Increasing the grid resolution helps but costs memory and training time.
  • For both: Even at 4K resolution, the input images may not contain enough detail in the first place. The camera sensor resolves detail up to the Nyquist limit for its pixel pitch at the capture distance. If the hardwood grain or fabric texture is below that limit, no reconstruction method can recover it. The information was never in the training data.

The overall scene structure was accurate. The accurate walls, furniture placement, and room geometry show that both methods handle low-frequency content well. The failure is in high-frequency texture. This points to resolution limits at multiple stages of the pipeline. It’s not a fundamental flaw in either method.

The real cost isn’t training — it’s iterating without signal. Modern radiance field methods are not uniformly “slow.” Per-run cost has plummeted with newer implementations. Even so, iterating on hyperparameters can still be costly. Tweaking initial Gaussian count, hash grid resolution, and densification thresholds takes multiple runs. This is where ns-eval falls short (and where recon-bench shines). You need a method that can distinguish a better result from one that scores “better” on the wrong metric. Training-vs-held-out gap detection catches overfitting. Geometric metrics catch the case where renders look good but the underlying 3D is poor. Good tooling turns “run it again with new settings and hope” into a measurable decision.

The Iteration Problem

Extended training time requires a good understanding of the construction of radiance fields. With ns-viewer, TensorBoard, or Weights & Biases you can watch the reconstruction as it trains. This real-time feedback is incredibly valuable. It enables you to actually see the 3D structure forming. It’s the complete opposite of training standard ML models. You’re usually watching loss curves and hoping for results.

Despite immediate feedback, you may not be able to see the effects of parameters until well into training. This makes each experiment a time-consuming commitment. The feedback loop for parameter selection might be real-time, but it’s slow.

The COLMAP Problem

I used COLMAP’s dense reconstruction as the geometric reference, which was problematic. The dense stereo step repeatedly crashed on the NVIDIA GB10. This was likely due to compute capability (CC) 12.x compatibility issues in the PatchMatch implementation. I had to run the dense reconstruction on a separate machine with an older GPU and CC 8.6 (See war story above).

This introduced an inconsistency in the pipeline. The sparse reconstruction (used to initialize both NeRF and 3DGS training) ran on the GB10. But the dense point cloud was computed elsewhere. I wasn’t able to test it, but I’d want to see if there are differences between PatchMatch on different GPUs.

Not having a real “ground truth” shaped the entire project. I don’t have access to a LiDAR scanner. Some phone apps (Polycam, Scaniverse) can use the iPhone’s LiDAR sensor to supplement the capture. That sensor is a low-resolution dToF sensor with limited range. Using it as “ground truth” would lead to measuring agreement with a noisy reference instead of geometric accuracy.

Without a proper ground truth, COLMAP dense was the best available option. It’s computed with established and well-documented algorithms. Image quality metrics (where the input images are ground truth) become more important. Using multiple geometric metrics instead of one is no longer optional.

A rigorous evaluation protocol for geometric accuracy would look something like:

  • LiDAR scan as the primary geometric reference. A true sensor-based measurement, independent of any photogrammetric reconstruction
  • ICP alignment between the LiDAR reference and each method’s output. This handles coordinate system differences
  • Per-region metrics instead of scene-wide totals. This shows where each method has trouble. For example, if it struggles with textureless walls, detailed surfaces, or shiny objects.
  • Multiple scenes to separate method-level trends from scene-specific artifacts

This project uses an approximation of that protocol.

The NeRF Cleanup Problem

Nerfacto produced a large number of points outside the scene bounds. It’s a known issue with NeRF methods that model the entire volume, including empty space. The exported point cloud required a lot of manual cropping and filtering. Only then was it ready for comparison with the COLMAP reference.

This is not a cosmetic issue. For any workflow that goes from NeRF to mesh (e.g., physics simulation, CAD integration, or 3D printing), cleanup adds significant effort. 3DGS uses explicit point-based representations. This leads to cleaner exports and fewer stray points.

Future Improvements

Better capture strategy The current capture set was ad-hoc. A more systematic approach would reduce variability and improve image quality. Structured grid positions, controlled lighting, and overlap ratios reduces the number of unusable images. More uniform captures facilitates investigating the relationship between input views and reconstruction quality.

HDR / exposure bracketing The single-exposure capture was the biggest limitation of the input data. Using image-match for post-processing helped to normalize exposure. However, information that was never recorded by the sensor can’t be recovered. Exposure bracketing at each camera position would overcome that limitation. Merging captures into HDR images preserves detail across the entire luminance range. Areas like the blown-out skylight and dark interior corners are no longer problematic. With HDR, the model training is no longer limited by the clipped image data.

Newer methods The field moves fast. New methods appear constantly. Like 2D Gaussian Splatting, PatchNeRF, and various 3DGS extensions (anti-aliasing, triangle primitive variants). Re-running the same evaluation pipeline with newer methods would be straightforward.

Outdoor scenes Indoor scenes have specific challenges (textureless walls, complex lighting, specular surfaces). Outdoor scenes present different challenges (sky modeling, varying illumination, scale). Using the pipeline on outdoor environments would test if the relative ranking of methods holds across domains.

LiDAR ground truth COLMAP dense reconstruction is still only an approximation. LiDAR scanning is the most accurate method currently available.

Explicit vs Implicit 3D Representations: A Practitioner’s Perspective

My background is in traditional 3D. Polygon meshes, NURBS surfaces, explicit geometry that you can inspect vertex-by-vertex. I’ve also worked with differentiable mesh rendering via PyTorch3D. This project was my first serious work with implicit and hybrid representations, and the contrast is worth discussing.

The representation spectrum. These methods sit at different points on a spectrum:

MethodRepresentationGeometry AccessEditability
Traditional meshExplicit trianglesDirectFull
Differentiable mesh (PyTorch3D)Explicit triangles, optimized via gradientsDirectFull (post-optimization)
3D Gaussian SplattingSemi-explicit (positioned primitives)ExtractableLimited
NeRFImplicit (neural density/color field)Requires extraction (marching cubes)Very limited

What implicit methods gain NeRF’s strength is that it doesn’t commit to a surface representation during training. The network learns a continuous volumetric function. That means it can represent fuzzy boundaries, semi-transparent objects, and view-dependent effects naturally. You don’t need to decide the mesh topology upfront.

What implicit methods lose The geometry is locked inside the network weights. Extracting a mesh means evaluating the density field on a grid and using marching cubes or Poisson reconstruction. Both introduce discretization artifacts and require choosing threshold parameters. The “extra points outside the scene” problem I encountered with nerfacto is a direct consequence. The density field doesn’t have a clean boundary, so extraction always requires cleanup.

Where 3DGS sits Gaussian Splatting is an interesting middle ground. Each Gaussian is an explicit primitive with a position, covariance, color, and opacity. You can enumerate, filter, and export them. But they’re not a mesh. Converting to a triangle mesh still requires surface reconstruction. Gaussians don’t inherently define a surface normal or connectivity. It’s explicit, but not traditional 3D.

Practical takeaway For applications that need a mesh (game engines, CAD, 3D printing, physics simulation), going from NeRF to geometry is simpler than from 3DGS. Current research shows that exporting usable meshes from Gaussians is still hard. The path from differentiable mesh optimization is the shortest and most direct. You start and end with triangles. The trade-off is that differentiable methods need a good initial mesh and struggle with topology changes. NeRF and 3DGS can reconstruct scenes from scratch.

From Research to Production: What I’d Do Differently on Day One

This project was scoped as only a research comparison. Working through the pipeline exposed where the gaps would be when scaling beyond a single scene or a single engineer. If I were doing this again I’d prioritize four areas.

Capture protocol, not capture art The circular walkthrough worked this one time, but it wasn’t auditable. Production systems need repeatable capture protocols covering:

  • Camera positions and overlap requirements
  • Lighting specifications for specular and reflective surfaces
  • Sensor resolution matched to target output quality

Automated quality gates Input data needs validation before training. Flag images with motion blur or exposure clipping. Verify sufficient overlap. Check COLMAP registration quality (reprojection error, number of registered images, point cloud density). Catching bad data early saves more time than model optimization and avoids costly re-captures.

Evaluation as CI, not as a post-hoc step The recon-bench tool works for manual comparison. The next step is having it run automatically after every training job. The tooling is already scriptable and CLI driven.

Method selection as an engineering decision A production system shouldn’t hard-code the best methods. The evaluation pipeline itself is the product. It lets you make decisions with data instead of intuition.