Review of Automated Evals

#ai #review #discovery #draft # Overview https://www.youtube.com/watch?v=N-qAOv_PNPc https://maven.com/parlance-labs/evals?promoCode=evals-info-url Teresa Torres built an app to automate feedback on discovery calls: * https://www.producttalk.org/2020/03/interview-customers-together/ * https://www.producttalk.org/2021/08/product-discovery/ ## Major takeaways & concepts * Evals, Failure modes * Tools: airtable for eval tracking, jupyter notebooks for eval analysis * Used jupyter notebooks to quickly evaluate responses because it was easy to compare. Really embraced engineering mindset. Rolled her own eval tool. * Judges * Traces * Context engineering # Failure modes Automated coach takes in transcript from a customer discovery call to give you feedback, e.g. . Ran into a couple of failure modes recommendations: * Leading questions: where the question implies the answer * General questions: tell me about your morning routine => tell me about your morning routine **this morning** ![[Pasted image 20250828113406.png]] # Coding Uses Claude Code inside of VSCode extensively -- doesn't "vibe code" in the sense that she doesn't simply let claude write code blindly. Teresa: * "I'm really good at describing what I want". * "I just demand things from Claude Code." * "Every time I give Claude Code a longer leash, it goes wrong for me." * "I'm really terrified of ending up with a product that I can't maintain myself" ![[Pasted image 20250828114810.png]] Has transitioned to using notebook for analysis, visualization, individual transcript details. ![[Pasted image 20250828115335.png]] # Toward Production Teresa Plans to integrate automated coach into vistaly: https://www.vistaly.com/ ![[Pasted image 20250828115937.png]]