I am*	↓ ↓
E-mail*	↓ ↓

Back to search results

Improving Giant Neural Network Performance with Innovating Parallel Programming Techniques

Ref. ABG-133483	Thesis topic
2025-09-19		Cifre

Huawei Technologies France SASU

Workplace

- Ile-de-France - France

Topic title

Improving Giant Neural Network Performance with Innovating Parallel Programming Techniques

Scientific expertise

Computer science

Keywords

parallel programming, complexity and performance, distributed neural network, LLM

Topic description

Scientific Context and Motivation

The unprecedented success of Artificial Intelligence in recent years is largely due to the colossal
amount of parameters processed by deep neural network (DNN) models. From ResNet-50 with
26 Million parameters, through BERT-Large with 340 Million parameters then GPT-3 with 175
Billion parameters, to today’s DeepSeek with 671 Billion parameters and GPT-4 with a Trillion of
parameters.
Meanwhile, many AI accelerators and clusters were proposed to accelerate these neural networks.
We can cite 3 examples: Google’s TPU Cloud, Nvidia’s DGX GPU Pod, and Huawei’s Atlas NPU
Cluster. These machines include computing nodes connected via a dedicated (high-bandwidth and
specific-topology) network, where each computing node is again a set of processors and accelerators,
and finally an accelerator is composed of multiple (and even heterogeneous) cores dedicated to DNN.
Reducing the cost of DNN computing by leveraging those AI machines with parallel programming
is challenging. Data Parallelism (DP), which partitions the batch input dimension through the
model, Tensor Parallelism [KSH12] (TP), which partitions parameters and tensors of the model,
and Pipeline Parallelism [HCB+19] (PP), which partitions the whole model among its layers, were
proposed to formalize the three fundamental parallelisms of deep learning.
Later on, more parallelisms were proposed. Optimizer parallelism [RRRH20] alters DP by dis-
tributing the parameters, effectively alleviating the memory footprint but requires gathering param-
eters before use. Expert parallelism [LLX+20], together with Mixture of Expert (MoE) models, like
GPT-4 and Mixtral, benefits from executing different sub-models, by inflating model parameter size
to get better quality. Sequence parallelism [LXB+21] is a specific case of TP for handling very long
sequence DNN by partitioning this dimension. 1F1B scheduling [NHP+19], interleaving [SPP+19],
and graph pipeline parallelism [JWC+24] were also proposed to improve PP.
However, the Model Floating Point Operations per Second Utilization (MFU) on the above AI
clusters is still very low, only about 30-40%, and it could even be 10% in some special cases. This
means more than half of the computing power of an AI cluster is wasted. This results in both
economic and ecological impacts, making large DNNs still expensive in price and in CO2.

Technology Fields & Objectives

Different parallelisms have their pros and cons. To have a a good MFU, one does not just have to
choose a parallelism but a combination of those. The size of this combination depends on the number
of devices used; the bigger the model, the more devices are required. In 2018, GPT2 [RWC+19]
trained 1 Billion parameters using a dozen of GPUs. In 2023, GPT4 [AAA+23] with 1 Trillion
requires a cluster of 25000 GPUs. This increase in parallelism dimensions and the number of devices
makes parallelism configuration a difficult problem today, and an even harder one tomorrow.
Many approaches [JLQA18, LYL+17, ZLZ+22] have tackled this complex problem however, none
seem satisfactory to us for two reasons: 1) none treats all parallelism dimensions we introduced
earlier; 2) most rely on profiling methods. Comprehensive profiling would require (a) spending as
much time profiling as running, which is prohibitive. Moreover, using it comprehensively as a search
method implies (b) profiling as many times as possible configurations N . This would mean running
N times a program to optimize its single execution. This is intolerable and is addressed by choosing
a small representative computation subset for (a) and pruning the search space using heuristics or
Monte Carlo-like methods for (b). Although these techniques effectively reduce the time used, they
inevitably introduce an imprecision as large as the time is reduced.
In our previous work [WLT+21], a technique to systematically combine DP and TP was proposed
for a medium-size DNN such as BERT. With the results of [WTL+22] we demonstrated the existence
of further performance optimization for bigger DNNs. We optimized PP and achieve and end-to-end
speedup of 1.2× over a cluster of 16 000 NPUs, as world’s first very-large-scale and industrial-level
automatic solution [WLT+25]. However, how to leverage all different parallelisms to systematically
generate a high-performance execution for gigantic DNNs on a massively parallel cluster with 10000+
accelerators is still a big challenge today. Our first objective is to study and propose a high-
performance solution to generate an optimized hybrid parallelism strategy for very large-scale AI
models and clusters.
Moreover, with the arrival of MoE models, multi-modal models, and large reinforcement learning
models, parallel techniques had to be innovated. Their architectures are more complex than the
linear layer pile of Transformers in large language models (LLM). In addition, newer AI cluster
hardware becomes more complex, such as the Nvidia Grace Hopper Superchip and Huawei At-
las SuperPod, which provide super-bandwidth connections to form large-scale computing nodes.
Meanwhile, many parallel techniques were invented with a specific DNN model architecture. For
instance, in the Expert Parallelism Load Balancer (EPLB) of DeepSeek-v3 [ea25], redundant ex-
perts architecture and group-limited expert routing technique were introduced together to improve
the accuracy and performance of the model. Our second objective is to innovate DNN acceleration
techniques by considering both model architecture and hardware architecture.

References

[AAA+23] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Floren-
cia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat,
et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
[ea25] DeepSeek-AI et. al. Deepseek-v3 technical report, 2025.
[HCB+19] Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen,
HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. Gpipe: Efficient train-
ing of giant neural networks using pipeline parallelism. Advances in neural information
processing systems, 32, 2019.
[JLQA18] Zhihao Jia, Sina Lin, Charles R Qi, and Alex Aiken. Exploring Hidden Dimensions in
Parallelizing Convolutional Neural Networks. In ICML, pages 2279–2288, 2018.
[JWC+24] Byungsoo Jeon, Mengdi Wu, Shiyi Cao, Sunghyun Kim, Sunghyun Park, Neeraj Ag-
garwal, Colin Unger, Daiyaan Arfeen, Peiyuan Liao, Xupeng Miao, et al. Graphpipe:
Improving performance and scalability of dnn training with graph pipeline parallelism.
arXiv preprint arXiv:2406.17145, 2024.
[KSH12] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with
deep convolutional neural networks. Advances in neural information processing systems,
25, 2012.
[LLX+20] Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping
Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. Gshard: Scaling giant models
with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668,
2020.
[LXB+21] Shenggui Li, Fuzhao Xue, Chaitanya Baranwal, Yongbin Li, and Yang You. Se-
quence parallelism: Long sequence training from system perspective. arXiv preprint
arXiv:2105.13120, 2021.
[LYL+17] Wenyan Lu, Guihai Yan, Jiajun Li, Shijun Gong, Yinhe Han, and Xiaowei Li. Flexflow:
A flexible dataflow accelerator architecture for convolutional neural networks. In 2017
IEEE International Symposium on High Performance Computer Architecture (HPCA),
pages 553–564. IEEE, 2017.
[NHP+19] Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R. Deva-
nur, Gregory R. Ganger, Phillip B. Gibbons, and Matei Zaharia. PipeDream: General-
ized Pipeline Parallelism for DNN Training. In Proceedings of the 27th ACM Symposium
on Operating Systems Principles, SOSP ’19, page 1–15, New York, NY, USA, 2019. As-
sociation for Computing Machinery.
[RRRH20] Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory
optimizations toward training trillion parameter models. In SC20: International Confer-
ence for High Performance Computing, Networking, Storage and Analysis, pages 1–16.
IEEE, 2020.
[RWC+19] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever,
et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
[SPP+19] Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper,
and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models
using model parallelism. arXiv preprint arXiv:1909.08053, 2019.
[WLT+21] Haoran Wang, Chong Li, Thibaut Tachon, Hongxing Wang, Sheng Yang, Sébastien
Limet, and Sophie Robert. Efficient and systematic partitioning of large and deep neural
networks for parallelization. In Euro-Par 2021: Parallel Processing: 27th International
Conference on Parallel and Distributed Computing, Lisbon, Portugal, September 1–3,
2021, Proceedings, page 201–216, Berlin, Heidelberg, 2021. Springer-Verlag.
[WLT+25] Ruiwen Wang, Chong Li, Thibaut Tachon, Raja Appuswamy, and Teng Su. BMPipe:
Bubble-Memory Co-optimization Strategy Planner for Very-large DNN Training. In
2025 IEEE International Conference on Cluster Computing, to appear, 2025.
[WTL+22] Haoran Wang, Thibaut Tachon, Chong Li, Sophie Robert, and Sébastien Limet. Smsg:
Profiling-free parallelism modeling for distributed training of dnn. Int. J. Parallel Pro-
gram., 51(2–3):109–127, December 2022.
[ZLZ+22] Lianmin Zheng, Zhuohan Li, Hao Zhang, Yonghao Zhuang, Zhifeng Chen, Yanping
Huang, Yida Wang, Yuanzhong Xu, Danyang Zhuo, Eric P Xing, et al. Alpa: Automat-
ing inter- and Intra-Operator parallelism for distributed deep learning. In 16th USENIX
Symposium on Operating Systems Design and Implementation (OSDI 22), pages 559–
578, 2022.

Funding category

Cifre

Funding further details

Presentation of host institution and host laboratory

Huawei Technologies France SASU

Through our dedication to customer-centric innovation and strong partnerships, we have established end-to-end capabilities and strengths across the carrier networks, enterprise, consumer, and cloud computing fields. We are committed to creating maximum value for telecom carriers, enterprises and consumers by providing competitive ICT solutions and services. Our products and solutions ranging from processors, servers to mobile phones have been deployed in over 170 countries, serving more than one third of the world's population. Huawei has launched in 2014 the France Research Center focusing on mathematics and algorithmic science, with more than 80 researchers. The Distributed and Parallel Software Lab has its unique and high-talent team in Paris, who develops algorithms and software for massively parallel big-data applications, highperformance AI and machine learning, heterogeneous, distributed and Cloud technologies.

PhD title

Doctorat en informatique

Country where you obtained your PhD

France

Institution awarding doctoral degree

Sorbonne Université

Graduate school

Informatique, télécommunications et électronique de Paris

Candidate's profile

- Master en informatique fondamentale

- Solides competences en programmation parallele, compilation, analyse de la complexite

- Connaissance en architecture et systeme informatique

- Connaissance en Pytorch/Tensorflow, deep learning

- Ayant reside en France

Partager via

Apply

Vous avez déjà un compte ?

Nouvel utilisateur ?

Mr/Mrs*	↓ ↓
First name*	↓ ↓
Last name*	↓ ↓
E-mail*	↓ ↓
Confirm your e-mail*	↓ ↓
Password*	8 characters minimum, including at least one figure, one lower case letter and one uppercase letter. ↓ ↓
Please confirm password*	↓ ↓