As memory accesses increasingly limit the overall performance of reconfigurable accelerators, it is desirable for high level synthesis (HLS) flows to adopt a systematic and portable way to utilize the abundant distributed block RAMs and high aggregated memory bandwidth found in modern FPGA devices. The development of such an approach is especially critical for many algorithmically complex and memory-intensive embedded applications, which were originally designed and coded for general-purpose processors. Because they often contain complex and unpredictable memory access patterns whose dependency may not be statically determined, very limited opportunities can be exposed for parallelizing the memory accesses. These difficulties motivated us to develop 1) a framework where memory level parallelism can be effectively discovered and provided to a high level synthesis flow, and 2) a novel multi-accelerator/multi-cache architecture to effectively exploit memory-level parallelism found in the binary traces of target applications. To make our study concrete, we implemented both a baseline platform and a prototype CPU+accelerator hybrid machine using a Xilinx Virtex 5 FPGA. Our experimental results have shown that for 10 accelerators generated for 9 benchmark applications, circuits using our proposed memory structure achieve on average 51% improved performance over accelerators using a traditional memory interface. We believe that our study represents a solid advance towards achieving memory-parallel embedded computing on hybrid CPU+FPGA platforms.