Towards performance and cost-efficiency for data-intensive applications in distributed data processing systems