By prompting a text-to-image model we can generate images of a wide variety of objects. With clever prompting, it’s also possible to synthesize different perspectives of a specific object. To use text to generate 3d objects, we start by feeding a caption to a text-to-image model, such as Imagen, adding minor tweaks to the prompt depending on the random camera viewpoint we want to generate, such as “front view” “top view” or “side view”. We then feed the same position and angle parameters to an untrained NeRF model to predict an initial image rendering of the object. The initial render is then fed to Imagen, guided by our caption and initial rendering with added noise using our pre-trained text-to-image model. Noise is removed from the resulting higher-quality image, with the result being used to train NeRF: this is done so that the model is only trained on the improved parts of the image. The process is repeated until the 3D model is satisfying enough, after which it can be exported using the grid ray-cast + marching cubes procedure.