1. In a digital medium environment to create multimodal image editing requests, a method implemented by a computing device, the method comprising:displaying a pair of images including a first image and a second image that includes an edit of the first image;
presenting an option to skip the pair of images, the option including that the first image and the second image are too similar;
recording a multimodal image editing request that includes a user gesture and a voice command that describe the edit of the first image;
receiving a user transcription of the voice command;
generating a data object in a searchable format, the data object including the voice command, the user gesture, and the user transcription; and
training a neural network to recognize the multimodal image editing requests using the data object and the first image as inputs of the neural network and the second image as an output of the neural network.