The idea here is that 10 frames from each of several batches should be representative of the different image characteristics the system encounters and if I save those just before the end of the batch I can see how different types of images compare to the expected value as produced by a bench top analyzer. The expected values for those 10 frames should be very close to each other and to the post-roast measurement.
From there, I can see if there's an easy way to automatically classify the images that perform well (I have some thoughts on this, but I need a data set to test those ideas against) and see if there's anything reliable in worse performing images where maybe a different approach to calculating the measurement might produce a better quality measurement.