Driven by hardware constraints, today's hardware landscape is more heterogeneous than ever before. Accelerators vary in memory hierarchies and even contain their own mini-accelerators in the form of customized ASICs (e.g., GPUs and tensor/matrix cores). To effectively utilize such systems, scientists must reformulate their applications beyond the level of single-loop optimization and compiler directives. In this talk, we will discuss local (i.e., peephole) optimizations, global program analysis via data-centric means, and how data movement orchestration can capture the optimization space necessary to achieve portability on large scientific applications. As a case study, we will review the FV3 climate model and the porting adaptations that were necessary to run its full dynamical core on GPU supercomputers for the first time.