Not quite. How it does it is even cooler and more mind-blowingly insane. It generates the images from scratch based on what it's previously learned from billions of other images.
A slightly deeper description: the application ("model") has been previously trained on ginormous sets of images that have been annotated with text labels that describe what is in them. A bunch of math is done during the prior training to "learn" what visual attributes correspond to the text. It's very abstract at this point - the learned stuff is stored as a mish-mash of overlapping/shared numbers. Those numbers can them be used (with a ton more math) to score how "correct" an image is to some arbitrary text that you give to it. That's not the crazy thing.
The crazy thing is that when DALL-E starts creating an image from text you give it, it begins with a random smattering of pixels - like random noise. The "score" for that random noise is pretty low (e.g., noise is nothing like "Land Cruiser 200 on a boulder"). It then adjusts the random noise in whatever way it can to increase its score. Eventually the image pixels settle into patterns/ groupings that score better, and you end up with the image it created.
You can be as descriptive and strange as you want in the text you give it, and it will try to create an image that it thinks matches the text. You can describe almost unlimited objects, locations, styles, etc. It will create something different each time, even if you give it the same text.
For example, here's what it generated when I give it the absurd prompt "A macro close-up of a land cruiser driving across a coin".
View attachment 3116102