After surveying large-scale AI and HPC workloads, we have identified a set of principles and best practices widely used in partitioning large sparse tensors and graphs for parallel execution. We present these in the context of a roadmap for standardization of both Data and Compute offload from costly and power-hungry memory and compute tiers to NDP (near data processing) devices. Our work on standardizing partitioning operators will improve the portability and performance of the programming interface and protocol between AI framework backends that emit large tensor operations and the most power-efficient computational memory and storage devices that implement processing elements as low as subarray level and can only handle smaller chunks of data.