and if you want to get a real headache, just throw an eye to the Lance routine comments and analyzes by Paolo Simoes.
Lance routine is incredibly optimised. for example, frequencies are not calculated , they are made of addition of several frequencies, to create one. this way you can use several fixed incrementing code.
here is the analyze, i found it :
; Hacking Lance
; by Paulo Simoes February 2013
; 1 Introduction
; The only purpose of this "hacking" is to try to find out a way to have
; a 25 KHz replay at a better % CPU than the existing 50 KHz version.
; I was informed about this routine by Leonard from Oxygene in 2004 but
; as i took a superficial look, i found out that the tricks used were
; really specific and could not be ported to my core program used in
; Hextracker and YM50K that i built in 1991 and 1992.
; So the main hacking was done these last few days.
; The text that follows reflects that hacking and my experience with this
; Soundtracker business.
; 2 The Sountracker challenge
; Soundtracker music was in the old days one of the main arguments for
; Amiga owners to nagg the ST owners.
; Let's face it. The Paula Amiga soundchip is really powerfull.
; So let's see what is the best it can do.
; It uses the Amiga master clock to read the samples in a controlable
; rate by means of a divider that can be set from $000 to $3FF.
; Loops and end of sample data are controlled by the HW.
; Those 8 bit signed samples will be then volumed by a 64 volume register
; which means a signed multiplication giving a 14 bit result.
; Those 4 14-bit values, one per real digital voice, will be mixed into
; two 15 bit values and sent at around 28 KHz(A500) to two DACs that will
; produce the Left and Right stereo analog signals.
; Finally, a low pass filter can be activated to reduce noice sent to the
; All this stuff is done by hardware with plenty of DMA channels to read
; from the memory costing almost 0 to the CPU ...
; And what does the Atari ST have to do this kind of job ?
; Well for the pre-STE models, known mostly as STFs, the Atari ST has
; nothing except SW and an old YM2149 soundchip.
; The Atari STE has a DMA that can read from memory 1 sample in mono or 2
; interleaved samples in stereo that will be sent to 8 bit DACs to
; produce the mono analog signal or the Left and Right stereo analog
; signals at 6.25 KHz, 12.5 KHz, 25 KHz and 50 KHz.
; It is then easy to understand that SW would have to play an important
; part in porting the Soundtracker music to the Atari ST, including the
; Atari STE.
; 3 Splitting the challenge in small parts
; One of the keys to solve any big problem is to divide it in smaller
; problems without loosing the view to the main picture.
; The same applies here.
; Let's start from the end to the beginning with the Paula soundchip
; At the end, we have low pass filters.
; As we have no hardware to do that on Atari ST and as the cost to do
; this in terms of SW is terribly high thinking about KHz rates, this is
; the first feature to be dropped as we assume the ST will not do that.
; Before that, we have the 15 bits DACs in stereo or the DAC in mono.
; Well, on STF, we have no DAC, so we emulate one by using combinations
; of 4 bit volume levels (registers 8, 9 and 10) on the YM2149
; soundchip with or without tones active (register 7) (Quartet method
; and ST Replay method). The quality of table will define the quality
; of the DAC emulation. That depends on the number of YM2149 voices used
; (1, 2 or 3) and on the selected combinations for each corresponding
; digital level. Stereo is impossible on STF with the base HW so we drop
; the stereo case for STF.
; On STE, we have 8 bit DACs so we should use them and we can do stereo.
; Paula sends data at around 28 KHz(A500) to the DACs.
; The STF has no DMA to send data to its emulated DAC. So that part has
; to be done by SW. Interrupts is the most common solution used to read
; the mixed data and send it via the DAC emulation table to the YM2149.
; One can also do it in a timed way, updating the YM2149 every XXX cycles
; but that is not compatible with better CPUs or better clock speeds.
; The STE has a DMA to do this job. We just have to store the mixed data
; in the way the DMA wants it to be read and sent to the DACs. In case
; of mono this means a single buffer with the set of 8 bit values to be
; sent to the DAC. In case of stereo, we should have a single buffer with
; interleaved data: 8 bit for the Left DAC followed by 8 bit to the Right
; DAC followed by Left, Right, Left and so on ...
; The next part is the mixer ...
; This where the job starts to be nearly identical both for STFs and STEs
; as the job is to mix 4 voices data to a buffer respecting the STE DMA
; read constraints or the self established rules for STF.
; From this moment on, i will forget the general case and focus only on
; Lance's challenge: 50 KHz replay in stereo on STE.
; Stereo means that the mixer SW will do two times the job to mix two
: voices into one 8 bit value that will be interleaved as the STE DMA
; Mixing means mainly adding signed values. To get a 8 bit mixing result
; with 2 voices there are 2 solutions: a convertion via a table like any
; size bits previous mixing result is converted into 8 bit or the speedy
; solution: 7 bit + 7 bit = 8 bit ...
; It is easy to guess which one Lance choosed and which one most of the
; ST Sountracker players choose.
; But the first one is much more accurate to emulate Paula: the 15 bit
; mixing result is converted into 8 bit data to send to the DACs.
; We are now at the point where one should discuss the individual voice
; Before it is mixed, each voice has to volume the sample data respecting
; the volume set to the sample data.
; Again, this is normally achieved via a lookup table where the sample
; data is converted into volumed sample data.
; We then have the variable speed data reading.
; No HW to do that so it has to be done via SW with memory reading when
; needed via specific addressing modes or pointer increments.
; Finally, we have the loop and end of sample controls. This is the easy
; part as SW can "add" data at the end of the sample to emulate the loop
; with the size of that data corresponding to the maximum data that can
; be read before the pointers are checked which is normally 1 VBL.
; 4 Lance's solutions to each small problem
; Now we will visit again each problem in the reverse order.
; The first choice is that the BPM feature is not emulated by this
; version. This means that the MOD control is done at every VBL.
; This with a replay rate of 50 KHz, this means that we have to update
; the DMA buffer with 50000 / 50 VBLs = 1000 blocks of Left and Right
; data at every VBL. For 25 KHz we would have only 500.
; The dividers for variable speed reading found in the Protacker tables
; go from 108 to 907 (mt_periodtable).
; Considering the european PAL Amiga clock rate of 7.09379 MHz (and not
; simply 7.09 MHz found in this source), this means reading from memory
; at rates from: 7093790/(2x108) = 32841.6 Hz to 7093790/(2x907) = 3910.6
; Hz. Considering the VBLs, this means reading from 32841.6 / 50 = 656.8
; bytes per VBL down to 3910.6 / 50 = 78.2 bytes per VBL for each voice.
; So we have to produce 1000 8 bit mixing values with a maximum of 657
; reads per VBL for each voice. One can see that the maximum read speed
; compared to the mixing speed is lower than 1: 657 / 1000 = 0.657
; This means that we can use the simplest addressing mode for reading:
; move.b (An)+,... or add.b (An)+,...
; This is where we have the first problem at 25 KHz. As we have only to
; produce 500 mixing results per VBL, we have 657 / 500 = 1.314 which is
; bigger than 1 but lower than 2. This means that for a part of the
; dividers one can not apply the simplest addressing mode: one has to
; read 2 times or correct the pointer: move.b (An)+,... move.b (An)+,...
; or addq #1,An move.b (An)+,... This is slower ...
; One can also limit the MOD to use dividers up to the case where we get
; to the limit: 7093790 / 500 updates / 50 VBLs / 2 = 141.9. This means
; dealing with 2.67 octaves instead of 3.
; tuning 0, normal
; dc.w 856,808,762,720,678,640,604,570,538,508,480,453
; dc.w 428,404,381,360,339,320,302,285,269,254,240,226
; dc.w 214,202,190,180,170,160,151,143,135,127,120,113
; The last 4 values are not usable without addq #1,An inserted ...
; Now back to 50 KHz replay, one has to mix two streams read at variable
; speed from memory. How to do that ?
; Lance solution is to divide the VBL in 25 parts where we produce 40
; mixing results (25 x 40 = 1000). At 25 KHz we would have to produce
; only 20 mixing results.
; For each of those 25 blocks, the program will read at two different
; speeds from two samples to mix them.
; To allow that, 23 different reading speed are allowed per block.
; As we have 2 voices mixing, this means that we have 23 possible read
; speeds for voice 0 and 23 possible read speeds for voice 1: 23 x 23 =
; 529 combinations.
; So 529 different code combinations are generated to handle each one of
; the 529 cases (mt_make_mixcode).
; But you will say, we have almost a thousand diferent reading speeds.
; That's right, but we have 25 blocks per VBL. So if we do block 0 at
; speed 13 and block 1 at speed 12 and block 2 at speed 13 and block 3 at
; speed 12 and so on we will get an average speed of 12.5 and the
; listenner will not notice it. That is a first compromise needed for
; memory space reasons: all those code combinations consume memory.
; Now to volume control ...
; Lance choosed to take advantage of the Microwire volume control.
; On STE one can set the volume of both Left and Right stereo signals in
; an independent way.
; So the idea is to volume only 1 of the 2 samples we are mixing.
; Let's do an example: voice 0 has $30 volume and voice 1 has a $20
; volume. You can set the global Microwire volume to the equivalent of
; $30 for a maximum of $40 and volume the voice 1 sample data with the
; relative volume between the two samples: $20/$30 = 0.6667 or 43 in 64.
; The important is to always volume the voice with the lowest volume.
; This is why this routine does not work on Falcon and why it is dificult
; to control the global replay volume. After you have called Lance rout,
; you can change the Microwire volume but you have to respect the set
; relationship between the values found at Left and Right volume.
; If you find 100% for Left and 50% for Right, you can change to 80//40
; or 60//30 or any other 2:1 ration values.
; This Microwire solution has another compromise: the number of volume
; levels available at the Microwire is much less than the 65 Paula levels
; and they are not linear. The converted table can be find here:
; dc.w 0
; dc.w 2,5,7,8,9,10,10,11,11,12,12,13,13,13,14,14
; dc.w 14,14,15,15,15,15,16,16,16,16,16,16,17,17,17,17
; dc.w 17,17,17,18,18,18,18,18,18,18,18,18,18,19,19,19
; dc.w 19,19,19,19,19,19,19,19,19,20,20,20,20,20,20,20
; The relative volume between the two mixing samples is obtaing via a div
; table built at start (mt_make_divtab) with 64x64 = 4096 combinations.
; So Lance only has to read at 2 variable speeds from 2 sources and the
; data from 1 source is volumed (goes via a table) into another value.
; The typical worse case scenario (for time) is the following:
; move.b (a0)+,d2 voice 0 data
; move.b (a1)+,d1 voice 1 data that goes to D1
; move.l d1,a2 that points to the volume table $xxxxxx00
; add.b (a2)+,d2 mixing with volumed data
; move.b d2,(sp)+
; So here we have the remaing Lance solutions.
; Mixing is done by simple adds so 7 bit samples are used: 7bit + 7bit =
; 8 bit. The reduction to 7 bit can be found at .mt_shift_down.
; STE stereo buffer interleaving problem is solved using the 68000 SP
; protection mechanism that increments the pointer by 2 in case of byte
; access: this was unknown to me until i looked first at this routine in
; 2004 ...
; The volume table is located at a 256 byte even boundary that allows to
; get the volumed converted value in a simple and speedy way. I have a
; similar solution in Hextracker except for real 8 bit samples replay.
; the difference is taht i get a word as a result and so the sample bytes
; have bit 0 set to 0 (reduction to 7 bits) instead of a signed right
; shift like it is done here by Lance.
; For the cases where no read is required then the previous mixed value
; is sent to the buffer or only 1 read is done: this is the job of the
; generated code to take care of each of those cases (mt_make_mixcode).
; All this is compatible with 25 KHz replay except the need to insert a
; addq #1,An for steps bigger than 1 and to reduce the buffer updates to
; 20 per block instead of 40.
; The loop control is done in the general way: 640 bytes are added to
; each sample at the end with the looped sample data in case of loops
; or zeros. This is done at mtloop3 and space for that is reserved here:
;mt_data incbin "modules\*.mod"
; ds.w 31*640/2 ;These zeroes are necessary!
; This means that 640 is the maximum number of bytes that Lance expects
; to have to read per VBL. But our calculations point to 657 ...
; May be this was done before Finetune was included. If we do not
; include Finetune, the minimu divider is 113 (mt_periodtable tuning 0).
; 7093790 / (2x113) / 50 VBLs = 627.8 bytes
; The Portamento effects also limit the divider to 113.
; mt_make_tables only starts at 113 keeping the same reading pace for
; dividers below. So if a 108 divider is set by the Finetune, 113 will be
; Bug or not in the implementation of Finetune, this is not our concern
; now ...
; So we know almost everything now except the core: how is the generated
; code built ? What are the rules ?
; One that is obvious is that registers d0, d1 and d2 are used in their
; byte parts and that the rest of d1 points to the volume table generated
; by mt_make_voltab. a0 and a1 point the the sample data and are
; incremented at each needed read. a2 is used for the volume convertion
; and the data in sent to (sp)+ via a move.b from d0, d1 or d2.
; What remains is the most complex part: when do we need to read from 0,
; from 1, we can use previous data and so on ...
; This is where our analysis of:
; - mt_make_freq
; - mt_make_frame_f
; - mt_make_mixcode
; will be crucial to find out what to change to have a 25 KHz solution
; Let's start with mt_make_frame_f.
; That proc is used to fill two very important arrays.
; mt_frame_freq_p points to a table where one finds 551 pointers, one for
; handled read increment in a VBL from 75 bytes to 625 bytes with both
; values included. Depending on the amount of data to be read from a
; sample during a VBL, so depending on the divider (from 113 on), the
; code will dectect how many bytes it needs to read and gets from this
; table the corresponding pointer.
; That pointer will point to a location inside the table pointed by
; That table contains a serie of 25 words (50 bytes) for each pointer
; from the table pointed by mt_frame_freq_p. SO its length should be
; 551 x 25 = 13775 words or 27550 bytes.
; At this stage i do not understand why Lance reserves 27500 words for it
; here: mt_frame_freq ds.w 27500
; Anyway this data only reflects the pace at which data is read for one
; voice giving us for each of the 25 VBL blocks the selected speed from
; the 23 available from 0 to 22.
; So one word per block, 25 words per VBL and each word with this format:
; [PPPPPPPPPSSSSS00] where [PPPPPPPPP] = [SSSSS] x 23 and SSSSS is 0...22
; Inside the macro mt_channel, for each of the 25 blocks, the value
; for the corresponding frequency divider for one of the voices will be
; "mixed" with the value corresponding to the frequency divider for the
; other voice giving us a pointer to one of the 529 (23 x 23) available
; generated code sub routines to where the program jumps like shown here:
; lea .mt_return,a6 points to 1st block
; move.w (a3)+,d3
; move.w (a4)+,d4
; and.w d5,d4 isolate 000000000SSSSS00
; and.w d6,d3 isolate PPPPPPPPP0000000
; lsr.w #5,d3 00000PPPPPPPPP00
; add.w d3,d4 + 000000000SSSSS00
; move.l (a5,d4.w),a2 0000xxxxxxxxxx00 0...528 = 529 ptrs
; jmp (a2)
; rept 24
; lea $16(a6),a6 points to next block
; move.w (a3)+,d3
; move.w (a4)+,d4
; and.w d5,d4
; and.w d6,d3
; lsr.w #5,d3
; add.w d3,d4
; move.l (a5,d4.w),a2
; jmp (a2)
; The program comes back from the sub routine with a jmp (a6) where a6
; points to the next block or to the end of the 25 blocks for teh last
; As this is only handling number of reads and read speeds, changing to
; 25 KHz replay should have no impact here.
; We are then left with:
; - mt_make_freq
; - mt_make_mixcode
; Before we go any thurther, we know enough about Lance routine to safely
; estimate how much CPU cycles per VBL can be gained in going down to 25
; KHz replay.
; Let's divide the main part in 3:
; - reading part;
; - mixing part;
; - storing part;
; As we have seen, if we reduce to 25 KHz replay, then we can only do 500
; reads per VBL. But there are 4 out of the 36 octave notes that require
; more than 500 reads to avoid skipping data: B-3, A#3, A-3 and G#3 with
; dividers 113, 120, 127 and 135 respectively. In fact they would require
; 7093790 / (2 x divider) / 50 VBLs = 628, 591, 559 and 525 reads per VBL
; respectively. As we can only do 500 at 25 KHz, we will skip 128, 91, 59
; and 25 bytes respectively having to insert in the generated code as many
; addq #1,An. The average bytes to skip per note is then: (128+91+59+25)
; / 36 notes = 303 / 36 = 8.42. The clock cycles spent for reading vary
; from 8 (simple read (voice 0)) to 20 (read and volume the data via
; table (voice 1)) and the average is: (8+20)/2 = 14 cycles. The addq
; costs 8 cycles. SO for each read not done, we will save 14-8 = 6 cycles.
; As we do 8.42 reads less in average, we will save 6 * 8.42 = 50.52
; cycles in average.
; In the mixing part, we want to know how many adds we can save. Let's do
; an example that can be considered an average case: a C-2 on one of the
; voices and a C-3 on the other one. A C-2 means aproximately 166.666
; reads per VBL. The correct amount is 7093790 / (2 x 428) / 50 = 165.7.
; A C-3 means twice that amount. To simplify i will use the 166.666 value.
; 166.666 reads out of 1000 updates at 50 KHz means 1 read every 6
; updates. For the C-3, we will have 1 read every 3 updates. At 25 KHz,
; we are down to 1 read every 3 updates and 1 read every 1.5 updates.
; Doing the respective combinations we will get in average at 50 KHz:
; 55.55 simultaneous reads on the two voices, 388.88 reads on one of the
; two voices and 555.55 cases where we just update the buffer with the
; previous mixing result. The same at 25 KHz will result in: 111.11
; simultaneous reads, 277.78 single reads and 111.11 cases where we just
; update the buffer with the previous mixing result. In both cases we
; keep a total of 166.66 + 333.33 = 500 reads. Comparing the two cases,
; this means taht going from 50 KHz to 25 KHz increases the simultaneous
; reads by 55.55 and reduces by 111.11 the single reads. SO in average,
; we have: +55.55 - 111.11 = -55.55 which means 55.55 cases of mixing
; less than before. Each mixing costs a maximum of 8 cycles: move.b
; from the voice register (d0 or d1) to d2 and add.b new one to d2. So
; in total this saves us in average a maximum of 55.55 x 8 = 444.44
; Finally, in the storing part, counting is easy. We have 1000 updates
; at 50 KHz and 500 at 25 KHz so we save 500 updates like this one:
; move.b d2,(sp)+. I assume updating via (sp) with .b would cost the
; same as using any other An despite that specific behaviour that gives
; us the interleaved buffer update. If so, each update costs 8 cyles.
; We save then 500 x 8 = 4000 cycles.
; Adding the 3 parts, we can save up to 50.52 + 444.44 + 4000 = around
; 4495 cycles per mixing. As we have two mixings for Left and Right, we
; can save a maximum of 4495 + 4495 = 8990 cycles by going down to 25
; KHz replay. That represents a maximum of 5.6% of the CPU time.
; Still interested ?
; If we want to continue, we will have to go down to the real deal.
; Let's look at mt_make_freq.
; This procedure fills a table pointed by mt_freq_list with a serie of
; words with 0 and 1. The table size is 23 x 40 words. For each of the
; 23 available read speeds, it will fill 40 words with 0 or 1. These
; 40 words correspond to the 40 digital buffer updates at 50 KHz for each
; of the 25 VBL code blocks. If a 1 is found then a sample read has to be
; done for that digital buffer update for that voice.
; Looking at the code, one can see that the 23 different speeds relate to
; 23 different increments from 3 to 25 both included. So the minimum read
; bytes from a sample is 3 x 25 VBL blocks = 75 bytes. The maximum read
; bytes is 25 x 25 VBL blocks = 625 bytes for the minimum divider case.
; So first the read step is calculated dividing D0 (number of bytes to
; read) by 40 digital buffer updates. A result below 1 is expected as the
; maximum value for D0 (25) is smaller than 40. The result is then
; rounded if the division rest is bigger than 20 (half of 40). Having now
; a long value with the reading pace per digital buffer update, we add it
; for 40 updates and for each one a check is made is a new integer part
; was reached or if the comma part is bigger than 0.5:
; moveq #39,d7 40 updates to the digital buffer
; add.w d2,d1 adds the pace in D2 to the counter in D1
; negx.w d4 D4 contains the integer part of the counter
; neg.w d4 update D4 with X if the add overflowed
; move.w d4,d5 copy result integer part to D5
; move.w d1,d6 copy result comma part to D6
; add.w d6,d6 if result comma part bigger than 0.5 ($8000) this
; negx.w d5 previous add will overflow
; neg.w d5 and so we correct the result integer part with X
; cmp.w d3,d5 if the new value is lower or equal to the
; ble.s .mt_set_zero previous one then we SET a ZERO in table
; move.w d5,d3 otherwise we keep the new value in D3 for
; moveq #1,d5 next time and we SET a ONE in the table.
; move.w d5,(a0)+ (A0) = 1
; dbra d7,.mt_make_freq 40 times
; addq.w #1,d0 from 3 to 25 values
; cmp.w #26,d0
; bne.s .mt_maker
; moveq #0,d5
; move.w d5,(a0)+ (A0) = 0
; dbra d7,.mt_make_freq 40 times
; addq.w #1,d0 from 3 to 25 values
; cmp.w #26,d0
; bne.s .mt_maker
; Here we have obviously impacts in trying to go down to 25 KHz.
; First one is to reduce everything that is 40 to 20. The division has to
; be done by 20, the round compare value is 10 and the D7 register would
; get 19(20-1) instead of 39(40-1). But that is not all. As we divide by
; 20, the division result can be bigger or equal to 1 for read paces of
; 20, 21, 22, 23, 24 and 25. These are 6 of the 23 read paces. For those
; cases the code must be updated to allow steps bigger than 1 and may be
; also to store a different value in memory other than 0 or 1 for the
; code generation routine to insert the necessary addq #1,An to skip some
; reads. So this routine has to be completly re-written but only after
; the analysis of the code generation routine: mt_make_mixcode.
; At last we have mt_make_mixcode that is of course the most complex
; part of this whole code.
; It does not surprise anyone to find here a 40 times dbf loop inside a
; two 23 times dbfs. For each combination of read speeds (23 x 23), we
; have code to generate to update 40 times the digi buffer. For each
; case two values are read from the table generated by mt_make_freq and
; so we can have the following combinations:
; 00 just update digi buffer
; 01 update digi buffer but read from voice 0
; 10 update digi buffer but read from voice 1
; 11 update digi buffer but read from both voices
; On top of that, there is an optimization scheme that compares the
; current combination different from 0 with the next one different from 0
; in order to save CPU time related to registers switches. This includes
; updating the digi buffer with d0 or d1 instead of d2 and other stuff
; not important in the goal to go down to 25 KHz, i think.
; So now that have analysed this last procedure, we can identify the full
; impacts both for this procedure as for mt_make_freq in going down to
; 25 KHz.
; mt_make_freq must generate values 0, 1 and 2 instead of 0 and 1.
; 0 will mean no read just as it does now.
; 1 will mean read with 1 byte increment as it does now.
; 2 will mean read with 2 bytes increment so that we add a addq #1,An.
; mt_make_mixcode will have to be changed in order to handle not only the
; 00, 01, 10 and 11 combinations but now 00, 01, 02, 10, 11, 12, 20, 21
; and 22 combinations.
; Every loop in both procs dealing with 40 digi updates will be reduced
; to 20. The divider rounding value for compare will be reduced from 20
; to 10. The division in mt_make_freq must be adapted to handle results
; above 1. The control table size will be reduced in half as well as
; the leas size to go to the next block in mt_make_mixcode.
; Last but not least mt_frequency dc.w $0003 has to be changed to $0002.