.. This file is part of the OpenDSA eTextbook project. See
.. http://opendsa.org for more details.
.. Copyright (c) 2012-2020 by the OpenDSA Project Contributors, and
.. distributed under an MIT open source license.

.. avmetadata::
   :author: Cliff Shaffer

=========
Project 3
=========

Project 3: External Sort
------------------------

.. revealjs-slide::

* You will implement something similar to the External Sort shown in
  OpenDSA Module 9.6.

  * Important simplification: You should use a standard Heapsort
    instead of Replacement Selection.
  * This makes the runs be a consistent size, which makes it much
    easier to implement the multiway merge phase.

* The records consiste of two 32-bit fields, each storing integer
  values between 1 and 2,000,000,000.

  * The first field is the key value (what you sort on) and the
    second is data.

* Note that the act of sorting modifies the input file.


Sorting the File
----------------

.. revealjs-slide::

* You use a "working memory" of 50,000 bytes.

  * You need to handle files that are much larger than 50,000 bytes
    in size.

  * Files are guaranteed to be multiples of 4096 bytes.
      
* You read data from the file into the working memory, and out again
  to a second file as you need to.

  * You can't put records from the file anywhere else.
    
  * You may use a small number (< 5) of integer variables to store
    values temporarily while working.

* You should use ``RandomAccessFile`` to interact with the file.

* You should use ``ByteBuffer`` and ``IntBuffer`` to interact with the
  working memory.


Managing Your Memory
--------------------

.. revealjs-slide::

* Within these constraints, you can manage the data within the memory
  pool any way that you like to do the sort.

  * You can decide where to read data to, and where to write data from
    (within the working memory).

  * You can decide the relative sizes of the heap and the output buffer
    (within the working memory) during the Heapsort phase.

  * You can decide the size and number of the buffers for the multiway
    merge phases (they are all actually different places in working
    memory).

  * Read about buffer pools.
    You don't want to actually implement your memory as a standard
    buffer pool, but many of the ideas apply.

* While you **may NOT** put data from the file anywhere other than the
  working memory, you **may** have other data structures (arrays or
  lists, for example) to keep track of what information is where in
  the working memory.

  * Example: Track the current position in each run.



Development Stages
------------------

.. revealjs-slide::

Suggested development stages:

#. Get it to work on files that fit into the memory pool (only one run
   is produced by the Heapsort phase). This is Milestone 1.

#. Get it to work for one multiway merge pass (the number of runs
   produced during the Heapsort phase is no more than the number of
   run buffers available for the multiway merge phase).

#. Get it to work for more than one multiway merge phase.


Project 3 Special Considerations
--------------------------------

This project is pretty different from the other three this semester.

* Three week lifecycle.

* No formal testing requirement.

* Only one milestone.

* Empirical performance is being checked.

  * There is a massive time difference between reading 4096 bytes at
    once vs. a series of 1024 integer (4-byte) reads.

  * If you don't get that right, then Web-CAT reference tests will
    time out.
