We use CLEVER to evaluate several state-of-the-art LLMs prompted in a few-shot manner and show that they can only solve up to end-to-end verified code generation 1/161 problem, establishing CLEVER as a challenging frontier benchmark for program synthesis and formal reasoning. In summary, our contributions include: 1.
Membership inference and memorization is a key challenge with diffusion models. Mitigating such vulnerabilities is hence an important topic. The idea of using an ensemble of model is clever.