Blitter execution times

From Atari Wiki
Revision as of 20:35, 18 September 2025 by Uko (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

This article documents blitter execution times for blitter operations.

It presents execution times from a software developer point of view, i.e. the number of bus cycles the CPU has to wait before taking back control.

It does not present execution times from a hardware point of view, i.e. internal timings ; for example a logical shift (SKEW) is seen here as no cost wrt. to a bus cycle, whereas it might take internally a slight part of a bus cycle.

Moreover it focuses on the STe and Mega STe architectures. Applicability for Falcon is to take with caution, since timings seem to differ [1].

Legacy Execution Times Table

The following well-known table has been used as reference for many years.

It can be used directly for many “simple cases”, let’s say, very roughly, when source and destination are aligned on the beginning of a word (no shifting), and that the masks are set to $FFFF. Moreover they don’t mention the bus arbitration cycles at the begin and at the end of a blitting.

For other cases please refer to the below sections.

All timing figures are given in nops or bus cycles per word of transfer (1 nop is equivalent to one bus cycle, and a standard 68000 bus cycle takes four clock cycles). So a value of 2 would take the equivalent time of 2 nops to transfer 1 word of data.

HOP
LOP 0 1 2 3
0 1 1 1 1
1 2 2 3 3
2 2 2 3 3
3 1 1 2 2
4 2 2 3 3
5 2 2 2 2
6 2 2 3 3
7 2 2 3 3
8 2 2 3 3
9 2 2 3 3
10 2 2 2 2
11 2 2 3 3
12 1 1 2 2
13 2 2 3 3
14 2 2 3 3
15 1 1 1 1

HOP = Halftone Operation

LOP = Logical Operation

Execution Times Computation Logic

For each word of transfer, the blitter has to perform the following bus accesses:

  • Read a word from the source address (optional)
  • Read a word from the destination address (optional)
  • Write a word to the destination address (mandatory)


Each of these reading or writing operations takes 1 bus cycle. So a minimum of 1 bus cycle and a maximum of 3 bus cycles are necessary for a word transfer.

Access to HALFTONE is for free, i.e. it does not require any additional bus access.

The blitter execution time for a transfer of one word is exactly this sum of the numbers of bus access cycles used for reading and writing.


The reading of a word from the source is optional. It depends especially (not only) on the OP and HOP operations: for OP = 0 (all zeros), OP = 1 (all ones), HOP = 0 (all ones) or HOP = 1 (halftone), then the blitter has no need to read a word from the source address.


The reading of a word from the destination is also optional. To explain this point, it is necessary to understand that the blitter is not able to perform logical operations directly onto the destination.

For example with OP = 1 (source AND destination), the blitter has to read the source, read the destination, perform an AND between the both and write to the destination.

It cannot directly AND the source onto the destination.

So for all such logical operations it will have to read the destination before writing, requiring a bus cycle.


The writing operation is always performed (to be exact, except in the case of FXSR, we’ll address it later ; let’s consider this statement always true for now). It is at least important to remember that the value in the XCOUNT register always exactly indicates the number of write operations that will be done for a line. We’ll get back to this later when we will consider the usage of FXSR or NFSR.


So, if we take some examples:

Case 1:

  • OP = 3 (source)
  • HOP = 2 (source)

The blitter will perform one read access from source address, and one write access to destination address, so the execution time will be 2 bus cycles or nops.


Case 2:

  • OP = 4 (NOT source AND destination)
  • HOP = 3 (source AND HALFTONE)

The blitter will perform one read access from source address (AND with HALTONE has no impact), one read access from destination (because of logical OP) and one write access to destination address, so the execution time will be 3 bus cycles or nops.


You can perform exhaustively the computation for all cases of OP and HOP, and you’ll find back the “legacy” table results.

Using Masks

When the value set in one of the ENDMASK 1, 2, 3 registers is different from $FFFF, then bits of the destination address word which correspond to zeros in the corresponding ENDMASK register will remain unchanged.

This means that the blitter will perform a logical operation between the source and the destination.

And we’ve seen above that the blitter is not able to perform logical operations directly onto the destination.

So for all such logical operations it will have to read the destination before writing, requiring a specific bus access.


Therefore if a blit operation normally should not require to read the destination before writing, e.g. OP = 3 (source), but the mask value is different from $FFFF, then the blitter will finally read the destination before writing, requiring an additional bus access.

Hopefully, if the blit operation already required such a destination read access, the mask value has no additional impact.


If we go back to the previous examples, but with a ENDMASK value different from $FFFF:

Case 1:

  • OP = 3 (source)
  • HOP = 2 (source)

The blitter will perform one read access from source address, one destination read access because of the mask, and one write access to destination address, so the execution time will be 3 bus cycles or nops instead of 2 with mask = $FFFF.


Case 2:

  • OP = 4 (NOT source AND destination)
  • HOP = 3 (source AND HALFTONE)

Since OP value already implies one read access from the destination address, the execution time is not impacted by the mask value and remains 3 bus cycles or nops.


So the legacy table is no longer applicable directly when using ENDMASK.

Using SKEW

The SKEW register is used to shift the source data. It is the amount of the data in the source data latch that is shifted right before being combined with the halftone mask and destination data.

Using a SKEW value different from zero has no impact in itself on the blitter execution time as seen up to now.

It is the generally associated usage of NFSR and FXSR with SKEW that will have an impact.

Using NFSR

When one wants to copy a source well aligned on the beginning of a word (i.e. Xsource = 0) to a destination position shifted in the middle of a word (i.e. Xdest > 0), he will have to set the NFSR bit (in addition to the SKEW value).

Let’s take the following example of a 32 pixels wide (so 2 words) graphics area:

We can see that we have to copy a 2 word length source data onto a 3 word length destination area.

The blitter XCOUNT register has to be set to 3 (number of words to write, see previous sections), and we therefore have to indicate to the blitter that we do not want to read the source for the last word writing.

This is done through the NFSR (which stands for No Final Source Read) bit. When this bit is set the last source read of each line is not performed.


If we consider OP = 3 (source) and HOP = 2 (source), the blitter will perform, for the complete line, two read accesses from source address, and three write accesses to destination address, so the execution time will be 5 bus cycles or nops.


So the legacy table is no longer applicable directly when using NFSR (or only as an intermediate result to be corrected by the NFSR impact).


Warning #1: if XCOUNT is set to one, the Blitter doesn't fully honor the NFSR bit and performs a final source read regardless.


Warning #2 : When using NFSR with SKEW>0, it is necessary to set an ENDMASK value different from $FFFF (corresponding to the SKEW value), to avoid garbage bits to be displayed because the data latch register is not filled with zeros as one could expect. And of course as explained above, setting a mask value different from $FFFF has an impact on the number of destination read accesses !

Using FXSR

Now the opposite problematic, when one wants to copy a source position shifted in the middle of a word (i.e. Xsource > 0) to a destination position well aligned on the beginning of a word (i.e. Xdest = 0) he will have to set the FXSR bit (in addition to the SKEW value).


Let’s take the following example of a 32 pixels wide (so 2 words) graphics area:

We can see that we have to copy a 3 word length source data onto a 2 word length destination area.

The blitter XCOUNT register has to be set to 2 (number of words to write, see previous sections), and we therefore have to indicate to the blitter that we want to have an initial iteration for reading the source without writing.

This is done through the FXSR (which stands for Force eXtra Source Read) bit. When this bit is set one extra source read is performed at the start of each line to initialize the remainder portion source data latch.


If we consider OP = 3 (source) and HOP = 2 (source), the blitter will perform, for the complete line, three read accesses from source address, and two write accesses to destination address, so the execution time will be 5 bus cycles or nops.


So the legacy table is no longer applicable directly when using XFSR (or only as an intermediate result to be corrected by the FXSR impact).

Blitter Starting Time

In addition to all previously mentioned execution times, we must also take into account the time to start the blitter.

On STe there are 4 clock cycles (one bus cycle or a nop) of bus arbitration at the blitter start, and 4 additional clock cycles of bus arbitration at the end of blitting.

On Mega STe the blitter takes 4 extra clock cycles to start.

References

This page is mainly sourced from the following discussion threads on atari-forum, a general thanks goes out to everyone who's ever written anything on the subject.

  • Blitter Execution Times [2]
  • STE Blitter & No Final Source Read [3]
  • Blitter startup time [4]