PLStack, the artificial intelligence management platform

solution

We are customer-centric and provide customized or one-stop full-stack solutions to empower all industries.

product description

Relying on its strong R&D capabilities and rich industry experience, PowerLeader provides one-stop solutions for machine/deep learning research, practical training and teaching scenarios. Based on new design concepts, it has created a fully functional PLStack artificial intelligence management platform that can provide end-to-end process management, covering data annotation, algorithm development, model training, model management, model services and other AI complete life cycle process support.

The PLStack platform uses lightweight container virtualization as its foundation to achieve pooling of GPU, CPU, memory, storage and other infrastructure resources for multiple clusters and nodes; the orchestration and scheduling tools customized based on Kubernetes enable efficient and flexible resource scheduling; at the same time, the enterprise-level design concept enables the platform to have rich platform capabilities such as multi-tenant multi-level user management, authority management, resource management, and vGPU, fully meeting users’ requirements for high availability, high reliability, and high stability of AI development platforms. It greatly alleviates the bottleneck of deep learning algorithm training, thereby releasing new capabilities of artificial intelligence, so that users are no longer deterred by the high cost of GPUs.

The platform provides users with a simple WEB interface, rich functions and diversified tools. For example, the development module provides one-click environment generation and online interactive development tool Mlab; model training provides parameter tuning, distributed parallel training, etc.; model services provide online model deployment reasoning and model service calls; the platform also integrates data annotation tools, image warehouses, etc. to achieve one-stop AI development.

At the same time, the deep learning framework image is connected to the system as a plug-in, integrating a variety of commonly used frameworks in the industry, such as Tensorflow, PyTorch, Caffe and MXnet, supporting custom extensions, greatly improving the scalability and maintainability of the overall system.

Product Architecture

The PLStack AI platform is divided into three layers: infrastructure layer, resource scheduling layer, and platform function layer. The architecture is as follows:

The infrastructure layer mainly includes physical machines, virtual machines, storage devices, network equipment, all-in-one machines and other resources, which provide basic computing power support for the business.

The resource scheduling layer uses the Docker engine to achieve lightweight virtualization of CPU, GPU, memory, storage and other resources, and implements flexible scheduling of tasks and resources based on customized research and development of Kubernetes. It has the characteristics of multi-tenant isolation and logical isolation of task resources. Combined with functional components such as highly reliable storage services and distributed parallel training services, it provides a solid foundation for upper-level business functional modules.

The platform function layer provides end-to-end AI scientific research process support. The user side includes development modules, AI frameworks, training modules, data management modules, model service modules, image warehouses, work order management, etc.; the management side includes metering and billing, multi-tenant management, alarm monitoring settings, platform operation and maintenance, etc.

The PLStack platform consists of two systems: the management side and the business side. The management side is the platform administrator’s view, including seven modules: resource overview, business management, product management, operation and maintenance management, configuration management, financial management, and security center; the business view is the general user view, including eight modules: account center, resource overview, development environment, model training, storage management, model service, model management, and image warehouse. As shown in the figure below.

Product features

1. Provide management of containers + images, support full life cycle management of web-side containers, and efficiently manage, schedule, and monitor heterogeneous resources.

2. Multi-data center management, users can choose to use the resources of a certain data center; at the same time, it has a three-layer organizational structure management, such as administrators, organization administrators, members, etc., and sets resource quotas for organizations and users;

3. The platform integrates various deep learning frameworks (such as TensorFlow, pytorch, caffe, keras, etc.) for user model development and model training; supports custom framework extensions;

4. The platform has multiple billing modes; it provides complete metering and billing functions. Management can set fees for GPU, CPU, memory and other resources through the billing module, and charge users according to their usage time;

5. Support unified management and allocation of GPU card resources in multiple physical areas; support resource allocation in single machine single card, single machine multiple cards, multiple machines multiple cards, and single GPU card multiple users sharing mode; allocate computing resources in task mode; recycle computing resources after task completion;

6. Supports management of the operating status of GPU servers and GPU cards in the cluster, resource usage, including the total number of GPUs/number of GPUs used, average GPU core utilization, average GPU memory utilization, etc.

7. Support users to set up deep learning environment on demand, including deep learning framework, network model, GPU and CPU resources; destroy the running environment after training is completed to release computing resources; support rapid creation of deep learning environment, application programs and hardware resources are isolated from each other and run independently;

8. Built-in hundreds of optimized AI algorithms to meet the needs of multiple business scenarios, lower user barriers and improve AI development efficiency

9. Provide efficient collaborative AI algorithm model development tools on the web, provide integration of Jupyter lab and jupyter notebook, and support integration of AI-related data into Jupyter;.

10. Supports submitting training tasks via web and shell, and allows users to view the running results of their tasks in real time;

11. Supports functions such as visual job management, version management, and task cloning (parameter management). Based on parameter management, tasks can be created quickly to improve the iteration efficiency of model training tasks;

12. Support multi-version task parameter tuning based on common AI algorithm frameworks and prepared algorithms, and optimize and enhance the full life cycle management process of machine learning;

13. The cluster can realize distributed parallel training function; it supports dynamic application of the number of GPUs and nodes required for distributed training, and the platform can monitor the operation status of each node in real time;

14. It has a local image repository, supports image group management and sharing, allows users to upload custom images, supports one-click packaging of user environments to the image repository, and supports setting permissions to public or private;

15. Supports unified management of multiple versions of models. It can centrally manage all models obtained in model training, import and manage locally developed models, and meet the management needs of continuous iteration and debugging of models;

16. Provide SCP access/Web access/capacity expansion/rename/password change/shared storage/deletion and other functions; support individual users to upload data sets and then call them simultaneously by multiple users

Solution Architecture Diagram

By analyzing users’ research directions and actual needs, PowerLeader’s artificial intelligence cluster solution uses the PLStack AI management software platform and combines it with PowerLeader’s latest generation of servers to create an AI cluster platform with strong computing power, high resource utilization, convenient management and ultra-high security.

Solution Advantages

Open Integration

Heterogeneous hardware fusion and computing power optimization support a variety of CPU and GPU cards, while integrating a variety of mainstream deep learning frameworks and interactive IDE development environments.

Ultimate performance

The AI platform built with cloud computing technologies such as containers and K8S, combined with a GPU cluster with super computing power, provides extreme performance of high parallelism, high throughput, and low latency. In scientific computing, the performance is more than 50 times higher than that of traditional architecture.

Flexible

Elastic computing resources and computing power optimization management, vGPU technology, GPU sharing, multi-machine and multi-card distributed parallel training, multi-level organizational structure, organizational and user resource quotas, etc.

Ready to use

The platform comes pre-installed with multiple deep learning open source frameworks such as TensorFlow, PyTorch, and Caffe. You can submit training tasks quickly with one click without installing or configuring the environment. At the same time, the interactive development tool MLab for data model analysis also supports adding code and data sets with one click.