Twelve Labs is building models that can understand videos at a deep level through AI..

Image Credits: boonchai wedmakawand / Getty Images

One aspect of text generative AI. However, image-understanding AI models could open up a whole new set of opportunities.

Take, for example, Twelve Labs. As co-founder and CEO Jae Lee states, the San Francisco-based startup teaches AI models to deal with complicated audio-video alignment issues.

He said that “Twelve Labs was set up with a view to developing the necessary infrastructure for multi-modal video understanding; the initial undertaking being video semantic search – ‘CTRL+F for videos.” “Twelve’s vision is about building tools for creating intelligent software that can hear

The purpose of Twelve Labs’ models is to be able to link some words in natural languages with what is occurring within a video like actions, objects as well as some background sound that has enabled developers to develop mobile applications which are able to search through videos as well as classification of scenes and extr

According to Lee, Twelve Labs’ technology can be used in things such as inserting ads and content moderation, like distinguishing which knife video is violent or pedagogical. Lee further added that it could be used as media analytics, and auto-generated highlight reel — or blog post headline and tag from video.

During my conversation with Lee, I enquired her whether there are any chances of bias in such models being modeled and this is because established scientific fact indicates that models exacerbate the biases within and upon which they were trained.o As an example, a video understanding model trained based on mostly clips from local news—that focuses a lot of attention on crime, using racialized and sensationalized language—may internalize its own lessons by picking up racism and sexism along the way.

One example of it would be text-generating AI. However, next-generation computer vision systems will open up for a whole world of possibilities.

Take, for example, Twelve Labs. According to Jae Lee, a co-founder and CEO of the San Francisco-based start-up, the company trains AI models to “crack hard video-language alignment puzzles”.

As explained by Lee to TechCrunch via email,” Twelve labs was established…for constructing a multimodal video comprehension platform with the initial aim being semantic search — or control F for videos.” The mission is for experts to be able to develop programs whose senses are similar

The twelve lab’s models, try to map out some natural language with what is going on in a video such as; movements, objects and sound effects surrounding, which allow developers to make applications capable of searching inside the videos, splitting the videos into chunks or chapters, summing up the videos,

Twelve labs’ technology can help in activities such as adding ads or moderating violent knife videos. According to Lee, it may also be applied in media analytics and to create highlight reels/blog post headlines or tags based on videos.

Thus, when I asked Lee whether there is a potential for any kind of bias in these models, it is worth noting that it is long-standing science practice and theory that models tend to enhance the pre-existing data biases. As a case in point, teaching a video cognition model primarily with clips of local news – that frequently devotes much time to the racialized and sensational reporting about crime – may lead the model to incorporate both gender bias and racial discrimination.

According to Lee, Twelve Labs tries to adhere to internal biases and “fairness” for the models’ performance before their release will happen. He also asserts that they plan to publish model-ethics-related benchmarks and datasets at one point. However, apart from that point there was nothing else to talk about.“As to the difference between our product and the large language models such as Chat GPT, our model is specifically designed and constructed to make sense of all things in video – integrating visual signals, audio, speech elements which together give.

MUM is another multimodal model that Google is working on, which its incorporating into video suggestions across Google Search as well as YouTube. In addition to MUM, Google – along with Microsoft and Amazon – provide API-based, AI-enabled functionality for object, place and action recognition in videos with frame-level extraction of rich metadata.

However, Lee maintains that Twelve Labs is segmented by its superior models and the automaton function of the platform where users can refine the platforms generic model with their own data for more “domain-specific” video analysis.

Mock-up of API for fine tuning the model to work better with salad-related content.

Today Twelve Labs is launching its most recent “multimodal model”, called “Pegasus-1”; this model understands different types of prompts referring to full-video analysis. Pegasus-1 can also provide an in-depth description of a specific video or simply some highlights and timestamp.

According to Lee “Most business use cases require understanding video material in much more complex ways than the limited and simplistic capabilities of conventional video AI models can provide. The only way enterprise organization can get a human level understanding without having to analyse video material in person is to leverage powerful multimodal.

Lee claims that since launching in private beta back in early May, Twelve Labs currently has over 17k developers registered. Currently, the company is working together with an indeterminate amount of organizations in sectors such as sports, media, e-learning, entertainment and security. Moreover, recently the organization established partnership with NFL.

Speaking of Twelve Labs, which is still raising money – this is a must-have component in any start-up. The firm said it has just closed a USD 10M strategic financing deal from Nvideo, Intell, an Samsung Next, taking its total up to USD 27M.

According to Lee, this new investment targets strategic partners who can quicken the company in research (Compute), products and distribution. “According to research from our lab, it is a fuel for the continued video-understanding innovation which allows us to introduce the most potent models to end users regardless of potential uses. We push the industry boundaries at a time when enterprises are doing amazing stuff,” it says.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *