A developer is building an application that needs to answer questions about product images uploaded by users. They have experience with text-only LLMs and ask: "How does GPT-4V 'see' an image? Does it describe the image with another model first and then pass the description to the LLM?" What is the actual mechanism by which vision-language models like GPT-4V process image inputs?