Well, Rust has a lot of string flavors, and I like utf-8 being the norm, but there are a bunch of cases where enforcing utf-8 is a nuisance, so getting string features without the aggro enforcement is nice.
There’s probably some fruity way to make this a security issue, but I care about ascii printables and not caring about anything else. This is a nice trade off: the technical parts are en-US utf-8, the rest is very liberal.
Sounds reasonable, but a lot of recent advances come from being able to let the machine train against itself, or a twin / opponent without human involvement.
As an example of just running the thing itself, consider a neural network given the objective of re-creating its input with a narrow layer in the middle. This forces a narrower description (eg age/sex/race/facing left or right/whatever) of the feature space.
Another is GAN, where you run fake vs spot-the-fake until it gets good.