Prevailing VLSI trends point to a growing gap in the scaling of on-chip processing throughput and off-chip memory bandwidth. Therefore, increasingly, an efficient use of memory bandwidth must become a first-class design consideration in order to fully utilize the processing capability of highly concurrent processing platforms like FPGAs. In this paper, we present key aspects of this challenge in developing FPGA-based implementations of two-dimensional fast Fourier transform (2D-FFT) where the large datasets must reside off-chip in DRAM. Our scalable implementations address the memory bandwidth bottleneck through both (1) algorithm design to enable efficient DRAM access patterns and (2) datapath design to extract the maximum compute throughput for a given level of memory bandwidth. We present evaluations based on double-precision 2D-FFT up to size 2,048-by-2,048. On an Altera DE4 platform based on the Stratix IV EP4SGX530 FPGA, our implementation of the 2,048-by-2,048 2D-FFT can achieve over 19.2 Gflop/s from the 12 GB/s maximum DRAM bandwidth available. The evaluations also show that our FPGA-based implementations of 2D-FFT are more efficient than 2D-FFT running on state-of-the-art CPUs and GPUs in terms of the ratio between achieved performance and available memory bandwidth and in terms of the ratio between achieved performance and power dissipation.