Audiocraft depends on what Meta calls “EnCodec Neural Audio Codec,” which processes audio in the identical tokenized format as your common AI chatbots like ChatGPT or Bard. From the samples shared by Meta to this point, it appears you possibly can dictate the kind of tones you need and the voice sources — which could be a musical instrumental or another object starting from a chook to a bus — to generate a sound clip utilizing a textual content immediate.
This is a pattern of a textual content immediate: “Earthy tones, environmentally acutely aware, ukulele-infused, harmonic, breezy, easygoing, natural instrumentation, light grooves.” It produces a 30-second clip, which really would not sound half-bad, as you possibly can take heed to right here in Meta’s weblog submit. As handy because it sounds, you will not have a lot granular management over producing your sound clips as you’ll have with an actual instrument in your arms or an expert synth.
MusicGen, which Meta claims was “particularly tailor-made for music technology,” was educated utilizing roughly 400,000 recordings and metadata value 20,000 hours of music. However as soon as once more, the range of the coaching knowledge is an issue and Meta acknowledges that, too. The coaching dataset is predominantly Western-style music with corresponding audio-text knowledge fed within the English language. To place it merely, you’ll have higher luck producing a rustic music-inspired tune as an alternative of a folks Persian melody. One of many key targets behind pushing the mission into the open-source world is to work on the range side, it appears.