Training neural networks takes a lot of time, even with the fastest and costliest accelerators on the market. It’s maybe no surprise then that a number of startups are looking at how to speed up the process at the software level and remove some of the current bottlenecks in the training process. For Strong Compute, a Sydney, Australia-based startup that was recently accepted into Y Combinator’s Winter ’22 class, it’s all about removing these inefficiencies in the training process. By doing so, the team argues that it can speed up the training process by 100x or more.
“PyTorch is beautiful and so is TensorFlow. These toolkits are amazing, but the simplicity they have — and the ease of implementation they have — comes at the cost of things being inefficient under the hood,” said Strong Compute CEO and founder Ben Sand, who previously co-founded AR company Meta (before Facebook used that name).
While there are companies that focus on optimizing the models themselves and Strong Compute will also do that if its customers request it, Sand noted that this may compromise the results. What the team focuses on instead is everything around the model. That may be a slow data pipeline or pre-computing a lot of the values before the training begins. Sand also noted that the company has optimized some of the often-used libraries for data augmentation.
The company also recently hired Richard Pruss, a former Cisco principal engineer, to focus on removing networking bottlenecks in the training pipeline, which can quickly add up to a lot of latency. But, of course, the hardware, too, can make a lot of difference, so Strong Compute works with its customers to run models on the right platform, too.
“Strong Compute took our core algorithm training from thirty hours to five minutes, training hundreds of terabytes of data,” said Miles Penn, the CEO of MTailor, which specializes in creating custom clothes for its online clients. “Deep learning engineers are probably the most precious resource on this planet, and Strong Compute has enabled ours to be 10x more productive. Iteration and experimentation time is the most important lever for ML productivity, and we were lost without Strong Compute.”
Sand argues that the large cloud providers don’t really have any incentives to do what his compay does, given that their business model relies on people using their machines for as long as possible, something Y Combinator managing director Michael Seibel agrees with. “Strong Compute is aimed at a serious incentive misalignment in cloud computing, where faster results that are valued by clients are less profitable for providers,” Seibel said.
Currently, the team still provides white-glove service to its customers, though developers shouldn’t notice too much of a difference since integrating its optimizations should not really change their workflow. The promise Strong Compute makes here is that it can “10x your dev cycles.” Looking ahead, the idea is to automate as much of the process as possible.
“AI companies can keep their focus on their customer, data and core algorithm, which is where their core IP and value lies, leaving all the configuration and operations work to Strong Compute,” said Sand. “This not only gives them the rapid iteration they need for success, it critically makes sure that their developers are only focused on work that is adding value for the company. Today they are spending up to two-thirds of their time on complex system administration work ‘ML Ops,’ which is largely generic across AI companies and often outside their area of expertise — it makes no sense for that to be in house.”
Bonus: Here’s a video of our own Lucas Matney trying out the Meta 2 AR headset from Sand’s last company back in 2016.