SSE2 SIMD implementation of Huffman encoding

Full-color compression speedups relative to libjpeg-turbo 1.4.2:

2.8 GHz Intel Xeon W3530, Linux, 64-bit:  2.2-18% (avg. 9.5%)
2.8 GHz Intel Xeon W3530, Linux, 32-bit:  10-25% (avg. 17%)

2.3 GHz AMD A10-4600M APU, Linux, 64-bit:  4.9-17% (avg. 11%)
2.3 GHz AMD A10-4600M APU, Linux, 32-bit:  8.8-19% (avg. 15%)

3.0 GHz Intel Core i7, OS X, 64-bit:  3.5-16% (avg. 10%)
3.0 GHz Intel Core i7, OS X, 32-bit:  4.8-14% (avg. 11%)

2.6 GHz AMD Athlon 64 X2 5050e:
Performance-neutral (give or take a few percent)

Full-color compression speedups relative to IPP:

2.8 GHz Intel Xeon W3530, Linux, 64-bit:  4.8-34% (avg. 19%)
2.8 GHz Intel Xeon W3530, Linux, 32-bit:  -19%-7.0% (avg. -7.0%)

Refer to #42 for discussion.  Numerous other approaches were attempted,
but this one proved to be the most performant across all platforms.

This commit also fixes #3 (works around, really-- the clang-compiled version
of jchuff.c still performs 20% worse than its GCC-compiled counterpart, but
that code is now bypassed by the new SSE2 Huffman algorithm.)

Based on:
2cb4d41330
36c94e050d
This commit is contained in:
DRC
2016-01-07 00:19:43 -06:00
parent eb59b6e72d
commit f3a8684cd1
18 changed files with 5157 additions and 84 deletions

View File

@@ -38,19 +38,7 @@ Build Requirements
NOTE: the NASM build will fail if texinfo is not installed.
- GCC v4.1 or later recommended for best performance
* Beginning with Xcode 4, Apple stopped distributing GCC and switched to
the LLVM compiler. Xcode v4.0 through v4.6 provides a GCC front end
called LLVM-GCC. Unfortunately, as of this writing, neither LLVM-GCC nor
the LLVM (clang) compiler produces optimal performance with libjpeg-turbo.
Building libjpeg-turbo with LLVM-GCC v4.2 results in a 10% performance
degradation when compressing using 64-bit code, relative to building
libjpeg-turbo with GCC v4.2. Building libjpeg-turbo with LLVM (clang)
results in a 20% performance degradation when compressing using 64-bit
code, relative to building libjpeg-turbo with GCC v4.2. If you are
running Snow Leopard or earlier, it is suggested that you continue to use
Xcode v3.2.6, which provides GCC v4.2. If you are using Lion or later, it
is suggested that you install Apple GCC v4.2 or GCC v5 through MacPorts.
- GCC v4.1 (or later) or clang recommended for best performance
- If building the TurboJPEG Java wrapper, JDK or OpenJDK 1.5 or later is
required. Some systems, such as Solaris 10 and later and Red Hat Enterprise
@@ -89,38 +77,38 @@ for 64-bit build instructions.)
This will generate the following files under .libs/:
**libjpeg.a**
**libjpeg.a**
Static link library for the libjpeg API
**libjpeg.so.{version}** (Linux, Unix)
**libjpeg.{version}.dylib** (OS X)
**cygjpeg-{version}.dll** (Cygwin)
**libjpeg.so.{version}** (Linux, Unix)
**libjpeg.{version}.dylib** (OS X)
**cygjpeg-{version}.dll** (Cygwin)
Shared library for the libjpeg API
By default, *{version}* is 62.1.0, 7.1.0, or 8.0.2, depending on whether
libjpeg v6b (default), v7, or v8 emulation is enabled. If using Cygwin,
*{version}* is 62, 7, or 8.
**libjpeg.so** (Linux, Unix)
**libjpeg.dylib** (OS X)
**libjpeg.so** (Linux, Unix)
**libjpeg.dylib** (OS X)
Development symlink for the libjpeg API
**libjpeg.dll.a** (Cygwin)
**libjpeg.dll.a** (Cygwin)
Import library for the libjpeg API
**libturbojpeg.a**
**libturbojpeg.a**
Static link library for the TurboJPEG API
**libturbojpeg.so.0.1.0** (Linux, Unix)
**libturbojpeg.0.1.0.dylib** (OS X)
**cygturbojpeg-0.dll** (Cygwin)
**libturbojpeg.so.0.1.0** (Linux, Unix)
**libturbojpeg.0.1.0.dylib** (OS X)
**cygturbojpeg-0.dll** (Cygwin)
Shared library for the TurboJPEG API
**libturbojpeg.so** (Linux, Unix)
**libturbojpeg.dylib** (OS X)
**libturbojpeg.so** (Linux, Unix)
**libturbojpeg.dylib** (OS X)
Development symlink for the TurboJPEG API
**libturbojpeg.dll.a** (Cygwin)
**libturbojpeg.dll.a** (Cygwin)
Import library for the TurboJPEG API
@@ -333,16 +321,16 @@ Set the following shell variables for simplicity:
IOS_SYSROOT=$IOS_PLATFORMDIR/Developer/SDKs/iPhoneOS*.sdk
IOS_GCC=$IOS_PLATFORMDIR/Developer/usr/bin/arm-apple-darwin10-llvm-gcc-4.2
*ARMv6 (code will run on all iOS devices, not SIMD-accelerated)*
*ARMv6 (code will run on all iOS devices, not SIMD-accelerated)*
[NOTE: Requires Xcode 4.4.x or earlier]
IOS_CFLAGS="-march=armv6 -mcpu=arm1176jzf-s -mfpu=vfp"
*ARMv7 (code will run on iPhone 3GS-4S/iPad 1st-3rd Generation and newer)*
*ARMv7 (code will run on iPhone 3GS-4S/iPad 1st-3rd Generation and newer)*
IOS_CFLAGS="-march=armv7 -mcpu=cortex-a8 -mtune=cortex-a8 -mfpu=neon"
*ARMv7s (code will run on iPhone 5/iPad 4th Generation and newer)*
*ARMv7s (code will run on iPhone 5/iPad 4th Generation and newer)*
[NOTE: Requires Xcode 4.5 or later]
IOS_CFLAGS="-march=armv7s -mcpu=swift -mtune=swift -mfpu=neon"
@@ -365,11 +353,11 @@ Set the following shell variables for simplicity:
IOS_SYSROOT=$IOS_PLATFORMDIR/Developer/SDKs/iPhoneOS*.sdk
IOS_GCC=/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/clang
*ARMv7 (code will run on iPhone 3GS-4S/iPad 1st-3rd Generation and newer)*
*ARMv7 (code will run on iPhone 3GS-4S/iPad 1st-3rd Generation and newer)*
IOS_CFLAGS="-arch armv7"
*ARMv7s (code will run on iPhone 5/iPad 4th Generation and newer)*
*ARMv7s (code will run on iPhone 5/iPad 4th Generation and newer)*
IOS_CFLAGS="-arch armv7s"
@@ -527,22 +515,22 @@ on which version of cl.exe is in the `PATH`.
The following files will be generated under *{build_directory}*:
**jpeg-static.lib**
**jpeg-static.lib**
Static link library for the libjpeg API
**sharedlib/jpeg{version}.dll**
**sharedlib/jpeg{version}.dll**
DLL for the libjpeg API
**sharedlib/jpeg.lib**
**sharedlib/jpeg.lib**
Import library for the libjpeg API
**turbojpeg-static.lib**
**turbojpeg-static.lib**
Static link library for the TurboJPEG API
**turbojpeg.dll**
**turbojpeg.dll**
DLL for the TurboJPEG API
**turbojpeg.lib**
**turbojpeg.lib**
Import library for the TurboJPEG API
*{version}* is 62, 7, or 8, depending on whether libjpeg v6b (default), v7, or
@@ -569,22 +557,22 @@ build of libjpeg-turbo.
This will generate the following files under *{build_directory}*:
**{configuration}/jpeg-static.lib**
**{configuration}/jpeg-static.lib**
Static link library for the libjpeg API
**sharedlib/{configuration}/jpeg{version}.dll**
**sharedlib/{configuration}/jpeg{version}.dll**
DLL for the libjpeg API
**sharedlib/{configuration}/jpeg.lib**
**sharedlib/{configuration}/jpeg.lib**
Import library for the libjpeg API
**{configuration}/turbojpeg-static.lib**
**{configuration}/turbojpeg-static.lib**
Static link library for the TurboJPEG API
**{configuration}/turbojpeg.dll**
**{configuration}/turbojpeg.dll**
DLL for the TurboJPEG API
**{configuration}/turbojpeg.lib**
**{configuration}/turbojpeg.lib**
Import library for the TurboJPEG API
*{configuration}* is Debug, Release, RelWithDebInfo, or MinSizeRel, depending
@@ -603,22 +591,22 @@ cross-compiling on a Linux/Unix machine, then see "Build Recipes" below.
This will generate the following files under *{build_directory}*:
**libjpeg.a**
**libjpeg.a**
Static link library for the libjpeg API
**sharedlib/libjpeg-{version}.dll**
**sharedlib/libjpeg-{version}.dll**
DLL for the libjpeg API
**sharedlib/libjpeg.dll.a**
**sharedlib/libjpeg.dll.a**
Import library for the libjpeg API
**libturbojpeg.a**
**libturbojpeg.a**
Static link library for the TurboJPEG API
**libturbojpeg.dll**
**libturbojpeg.dll**
DLL for the TurboJPEG API
**libturbojpeg.dll.a**
**libturbojpeg.dll.a**
Import library for the TurboJPEG API
*{version}* is 62, 7, or 8, depending on whether libjpeg v6b (default), v7, or

View File

@@ -57,6 +57,16 @@ benchmark from outputting any images. This removes any potential operating
system overhead that might be caused by lazy writes to disk and thus improves
the consistency of the performance measurements.
[12] Added SIMD acceleration for Huffman encoding on SSE2-capable x86 and
x86-64 platforms. This speeds up the compression of full-color JPEGs by about
10-15% on average (relative to libjpeg-turbo 1.4.x) when using modern Intel and
AMD CPUs. Additionally, this works around an issue in the clang optimizer that
prevents it (as of this writing) from achieving the same performance as GCC
when compiling the C version of the Huffman encoder
(https://llvm.org/bugs/show_bug.cgi?id=16035). For the purposes of benchmarking
or regression testing, SIMD-accelerated Huffman encoding can be disabled by
setting the JSIMD_NOHUFFENC environment variable to 1.
1.4.2
=====

View File

@@ -5,6 +5,7 @@
* Copyright (C) 1991-1997, Thomas G. Lane.
* libjpeg-turbo Modifications:
* Copyright (C) 2009-2011, 2014-2016 D. R. Commander.
* Copyright (C) 2015 Matthieu Darbois.
* For conditions of distribution and use, see the accompanying README.ijg
* file.
*
@@ -20,7 +21,7 @@
#define JPEG_INTERNALS
#include "jinclude.h"
#include "jpeglib.h"
#include "jchuff.h" /* Declarations shared with jcphuff.c */
#include "jsimd.h"
#include "jconfigint.h"
#include <limits.h>
@@ -108,6 +109,8 @@ typedef struct {
long * dc_count_ptrs[NUM_HUFF_TBLS];
long * ac_count_ptrs[NUM_HUFF_TBLS];
#endif
int simd;
} huff_entropy_encoder;
typedef huff_entropy_encoder * huff_entropy_ptr;
@@ -159,6 +162,8 @@ start_pass_huff (j_compress_ptr cinfo, boolean gather_statistics)
entropy->pub.finish_pass = finish_pass_huff;
}
entropy->simd = jsimd_can_huff_encode_one_block();
for (ci = 0; ci < cinfo->comps_in_scan; ci++) {
compptr = cinfo->cur_comp_info[ci];
dctbl = compptr->dc_tbl_no;
@@ -480,6 +485,23 @@ flush_bits (working_state * state)
/* Encode a single block's worth of coefficients */
LOCAL(boolean)
encode_one_block_simd (working_state * state, JCOEFPTR block, int last_dc_val,
c_derived_tbl *dctbl, c_derived_tbl *actbl)
{
JOCTET _buffer[BUFSIZE], *buffer;
size_t bytes, bytestocopy; int localbuf = 0;
LOAD_BUFFER()
buffer = jsimd_huff_encode_one_block(state, buffer, block, last_dc_val,
dctbl, actbl);
STORE_BUFFER()
return TRUE;
}
LOCAL(boolean)
encode_one_block (working_state * state, JCOEFPTR block, int last_dc_val,
c_derived_tbl *dctbl, c_derived_tbl *actbl)
@@ -640,16 +662,30 @@ encode_mcu_huff (j_compress_ptr cinfo, JBLOCKROW *MCU_data)
}
/* Encode the MCU data blocks */
for (blkn = 0; blkn < cinfo->blocks_in_MCU; blkn++) {
ci = cinfo->MCU_membership[blkn];
compptr = cinfo->cur_comp_info[ci];
if (! encode_one_block(&state,
MCU_data[blkn][0], state.cur.last_dc_val[ci],
entropy->dc_derived_tbls[compptr->dc_tbl_no],
entropy->ac_derived_tbls[compptr->ac_tbl_no]))
return FALSE;
/* Update last_dc_val */
state.cur.last_dc_val[ci] = MCU_data[blkn][0][0];
if (entropy->simd) {
for (blkn = 0; blkn < cinfo->blocks_in_MCU; blkn++) {
ci = cinfo->MCU_membership[blkn];
compptr = cinfo->cur_comp_info[ci];
if (! encode_one_block_simd(&state,
MCU_data[blkn][0], state.cur.last_dc_val[ci],
entropy->dc_derived_tbls[compptr->dc_tbl_no],
entropy->ac_derived_tbls[compptr->ac_tbl_no]))
return FALSE;
/* Update last_dc_val */
state.cur.last_dc_val[ci] = MCU_data[blkn][0][0];
}
} else {
for (blkn = 0; blkn < cinfo->blocks_in_MCU; blkn++) {
ci = cinfo->MCU_membership[blkn];
compptr = cinfo->cur_comp_info[ci];
if (! encode_one_block(&state,
MCU_data[blkn][0], state.cur.last_dc_val[ci],
entropy->dc_derived_tbls[compptr->dc_tbl_no],
entropy->ac_derived_tbls[compptr->ac_tbl_no]))
return FALSE;
/* Update last_dc_val */
state.cur.last_dc_val[ci] = MCU_data[blkn][0][0];
}
}
/* Completed MCU, so update state */

View File

@@ -3,6 +3,7 @@
*
* Copyright 2009 Pierre Ossman <ossman@cendio.se> for Cendio AB
* Copyright 2011, 2014 D. R. Commander
* Copyright 2015 Matthieu Darbois
*
* Based on the x86 SIMD extension for IJG JPEG library,
* Copyright (C) 1999-2006, MIYASAKA Masaru.
@@ -10,6 +11,8 @@
*
*/
#include "jchuff.h" /* Declarations shared with jcphuff.c */
EXTERN(int) jsimd_can_rgb_ycc (void);
EXTERN(int) jsimd_can_rgb_gray (void);
EXTERN(int) jsimd_can_ycc_rgb (void);
@@ -82,3 +85,9 @@ EXTERN(void) jsimd_h2v2_merged_upsample
EXTERN(void) jsimd_h2v1_merged_upsample
(j_decompress_ptr cinfo, JSAMPIMAGE input_buf,
JDIMENSION in_row_group_ctr, JSAMPARRAY output_buf);
EXTERN(int) jsimd_can_huff_encode_one_block (void);
EXTERN(JOCTET*) jsimd_huff_encode_one_block
(void * state, JOCTET *buffer, JCOEFPTR block, int last_dc_val,
c_derived_tbl *dctbl, c_derived_tbl *actbl);

View File

@@ -3,6 +3,7 @@
*
* Copyright 2009 Pierre Ossman <ossman@cendio.se> for Cendio AB
* Copyright 2009-2011, 2014 D. R. Commander
* Copyright 2015 Matthieu Darbois
*
* Based on the x86 SIMD extension for IJG JPEG library,
* Copyright (C) 1999-2006, MIYASAKA Masaru.
@@ -387,3 +388,16 @@ jsimd_idct_float (j_decompress_ptr cinfo, jpeg_component_info * compptr,
{
}
GLOBAL(int)
jsimd_can_huff_encode_one_block (void)
{
return 0;
}
GLOBAL(JOCTET*)
jsimd_huff_encode_one_block (void * state, JOCTET *buffer, JCOEFPTR block,
int last_dc_val, c_derived_tbl *dctbl,
c_derived_tbl *actbl)
{
return NULL;
}

View File

@@ -32,6 +32,7 @@
"Copyright (C) 2009-2016 D. R. Commander\n" \
"Copyright (C) 2009-2011 Nokia Corporation and/or its subsidiary(-ies)\n" \
"Copyright (C) 2013-2014 MIPS Technologies, Inc.\n" \
"Copyright (C) 2013 Linaro Limited"
"Copyright (C) 2013 Linaro Limited\n" \
"Copyright (C) 2015 Matthieu Darbois"
#define JCOPYRIGHT_SHORT "Copyright (C) 1991-2016 The libjpeg-turbo Project and many others"

View File

@@ -22,17 +22,19 @@ endif()
if(SIMD_X86_64)
set(SIMD_BASENAMES jfdctflt-sse-64 jccolor-sse2-64 jcgray-sse2-64
jcsample-sse2-64 jdcolor-sse2-64 jdmerge-sse2-64 jdsample-sse2-64
jfdctfst-sse2-64 jfdctint-sse2-64 jidctflt-sse2-64 jidctfst-sse2-64
jidctint-sse2-64 jidctred-sse2-64 jquantf-sse2-64 jquanti-sse2-64)
jchuff-sse2-64 jcsample-sse2-64 jdcolor-sse2-64 jdmerge-sse2-64
jdsample-sse2-64 jfdctfst-sse2-64 jfdctint-sse2-64 jidctflt-sse2-64
jidctfst-sse2-64 jidctint-sse2-64 jidctred-sse2-64 jquantf-sse2-64
jquanti-sse2-64)
message(STATUS "Building x86_64 SIMD extensions")
else()
set(SIMD_BASENAMES jsimdcpu jfdctflt-3dn jidctflt-3dn jquant-3dn jccolor-mmx
jcgray-mmx jcsample-mmx jdcolor-mmx jdmerge-mmx jdsample-mmx jfdctfst-mmx
jfdctint-mmx jidctfst-mmx jidctint-mmx jidctred-mmx jquant-mmx jfdctflt-sse
jidctflt-sse jquant-sse jccolor-sse2 jcgray-sse2 jcsample-sse2 jdcolor-sse2
jdmerge-sse2 jdsample-sse2 jfdctfst-sse2 jfdctint-sse2 jidctflt-sse2
jidctfst-sse2 jidctint-sse2 jidctred-sse2 jquantf-sse2 jquanti-sse2)
jidctflt-sse jquant-sse jccolor-sse2 jcgray-sse2 jchuff-sse2 jcsample-sse2
jdcolor-sse2 jdmerge-sse2 jdsample-sse2 jfdctfst-sse2 jfdctint-sse2
jidctflt-sse2 jidctfst-sse2 jidctint-sse2 jidctred-sse2 jquantf-sse2
jquanti-sse2)
message(STATUS "Building i386 SIMD extensions")
endif()

View File

@@ -13,11 +13,11 @@ if SIMD_X86_64
libsimd_la_SOURCES = jsimd_x86_64.c jsimd.h jsimdcfg.inc.h jsimdext.inc \
jcolsamp.inc jdct.inc jfdctflt-sse-64.asm \
jccolor-sse2-64.asm jcgray-sse2-64.asm jcsample-sse2-64.asm \
jdcolor-sse2-64.asm jdmerge-sse2-64.asm jdsample-sse2-64.asm \
jfdctfst-sse2-64.asm jfdctint-sse2-64.asm jidctflt-sse2-64.asm \
jidctfst-sse2-64.asm jidctint-sse2-64.asm jidctred-sse2-64.asm \
jquantf-sse2-64.asm jquanti-sse2-64.asm
jccolor-sse2-64.asm jcgray-sse2-64.asm jchuff-sse2-64.asm \
jcsample-sse2-64.asm jdcolor-sse2-64.asm jdmerge-sse2-64.asm \
jdsample-sse2-64.asm jfdctfst-sse2-64.asm jfdctint-sse2-64.asm \
jidctflt-sse2-64.asm jidctfst-sse2-64.asm jidctint-sse2-64.asm \
jidctred-sse2-64.asm jquantf-sse2-64.asm jquanti-sse2-64.asm
jccolor-sse2-64.lo: jccolext-sse2-64.asm
jcgray-sse2-64.lo: jcgryext-sse2-64.asm
@@ -36,11 +36,11 @@ libsimd_la_SOURCES = jsimd_i386.c jsimd.h jsimdcfg.inc.h jsimdext.inc \
jfdctfst-mmx.asm jfdctint-mmx.asm jidctfst-mmx.asm \
jidctint-mmx.asm jidctred-mmx.asm jquant-mmx.asm \
jfdctflt-sse.asm jidctflt-sse.asm jquant-sse.asm \
jccolor-sse2.asm jcgray-sse2.asm jcsample-sse2.asm \
jdcolor-sse2.asm jdmerge-sse2.asm jdsample-sse2.asm \
jfdctfst-sse2.asm jfdctint-sse2.asm jidctflt-sse2.asm \
jidctfst-sse2.asm jidctint-sse2.asm jidctred-sse2.asm \
jquantf-sse2.asm jquanti-sse2.asm
jccolor-sse2.asm jcgray-sse2.asm jchuff-sse2.asm \
jcsample-sse2.asm jdcolor-sse2.asm jdmerge-sse2.asm \
jdsample-sse2.asm jfdctfst-sse2.asm jfdctint-sse2.asm \
jidctflt-sse2.asm jidctfst-sse2.asm jidctint-sse2.asm \
jidctred-sse2.asm jquantf-sse2.asm jquanti-sse2.asm
jccolor-mmx.lo: jccolext-mmx.asm
jcgray.-mmx.lo: jcgryext-mmx.asm

361
simd/jchuff-sse2-64.asm Normal file
View File

@@ -0,0 +1,361 @@
;
; jchuff-sse2-64.asm - Huffman entropy encoding (64-bit SSE2)
;
; Copyright 2009-2011, 2014-2016 D. R. Commander.
; Copyright 2015 Matthieu Darbois
;
; Based on
; x86 SIMD extension for IJG JPEG library
; Copyright (C) 1999-2006, MIYASAKA Masaru.
; For conditions of distribution and use, see copyright notice in jsimdext.inc
;
; This file should be assembled with NASM (Netwide Assembler),
; can *not* be assembled with Microsoft's MASM or any compatible
; assembler (including Borland's Turbo Assembler).
; NASM is available from http://nasm.sourceforge.net/ or
; http://sourceforge.net/project/showfiles.php?group_id=6208
;
; This file contains an SSE2 implementation for Huffman coding of one block.
; The following code is based directly on jchuff.c; see jchuff.c for more
; details.
;
; [TAB8]
%include "jsimdext.inc"
; --------------------------------------------------------------------------
SECTION SEG_CONST
alignz 16
global EXTN(jconst_huff_encode_one_block)
EXTN(jconst_huff_encode_one_block):
%include "jpeg_nbits_table.inc"
alignz 16
; --------------------------------------------------------------------------
SECTION SEG_TEXT
BITS 64
; These macros perform the same task as the emit_bits() function in the
; original libjpeg code. In addition to reducing overhead by explicitly
; inlining the code, additional performance is achieved by taking into
; account the size of the bit buffer and waiting until it is almost full
; before emptying it. This mostly benefits 64-bit platforms, since 6
; bytes can be stored in a 64-bit bit buffer before it has to be emptied.
%macro EMIT_BYTE 0
sub put_bits, 8 ; put_bits -= 8;
mov rdx, put_buffer
mov ecx, put_bits
shr rdx, cl ; c = (JOCTET)GETJOCTET(put_buffer >> put_bits);
mov byte [buffer], dl ; *buffer++ = c;
add buffer, 1
cmp dl, 0xFF ; need to stuff a zero byte?
jne %%.EMIT_BYTE_END
mov byte [buffer], 0 ; *buffer++ = 0;
add buffer, 1
%%.EMIT_BYTE_END:
%endmacro
%macro PUT_BITS 1
add put_bits, ecx ; put_bits += size;
shl put_buffer, cl ; put_buffer = (put_buffer << size);
or put_buffer, %1
%endmacro
%macro CHECKBUF31 0
cmp put_bits, 32 ; if (put_bits > 31) {
jl %%.CHECKBUF31_END
EMIT_BYTE
EMIT_BYTE
EMIT_BYTE
EMIT_BYTE
%%.CHECKBUF31_END:
%endmacro
%macro CHECKBUF47 0
cmp put_bits, 48 ; if (put_bits > 47) {
jl %%.CHECKBUF47_END
EMIT_BYTE
EMIT_BYTE
EMIT_BYTE
EMIT_BYTE
EMIT_BYTE
EMIT_BYTE
%%.CHECKBUF47_END:
%endmacro
%macro EMIT_BITS 2
CHECKBUF47
mov ecx, %2
PUT_BITS %1
%endmacro
%macro kloop_prepare 37 ;(ko, jno0, ..., jno31, xmm0, xmm1, xmm2, xmm3)
pxor xmm8, xmm8 ; __m128i neg = _mm_setzero_si128();
pxor xmm9, xmm9 ; __m128i neg = _mm_setzero_si128();
pxor xmm10, xmm10 ; __m128i neg = _mm_setzero_si128();
pxor xmm11, xmm11 ; __m128i neg = _mm_setzero_si128();
pinsrw %34, word [r12 + %2 * SIZEOF_WORD], 0 ; xmm_shadow[0] = block[jno0];
pinsrw %35, word [r12 + %10 * SIZEOF_WORD], 0 ; xmm_shadow[8] = block[jno8];
pinsrw %36, word [r12 + %18 * SIZEOF_WORD], 0 ; xmm_shadow[16] = block[jno16];
pinsrw %37, word [r12 + %26 * SIZEOF_WORD], 0 ; xmm_shadow[24] = block[jno24];
pinsrw %34, word [r12 + %3 * SIZEOF_WORD], 1 ; xmm_shadow[1] = block[jno1];
pinsrw %35, word [r12 + %11 * SIZEOF_WORD], 1 ; xmm_shadow[9] = block[jno9];
pinsrw %36, word [r12 + %19 * SIZEOF_WORD], 1 ; xmm_shadow[17] = block[jno17];
pinsrw %37, word [r12 + %27 * SIZEOF_WORD], 1 ; xmm_shadow[25] = block[jno25];
pinsrw %34, word [r12 + %4 * SIZEOF_WORD], 2 ; xmm_shadow[2] = block[jno2];
pinsrw %35, word [r12 + %12 * SIZEOF_WORD], 2 ; xmm_shadow[10] = block[jno10];
pinsrw %36, word [r12 + %20 * SIZEOF_WORD], 2 ; xmm_shadow[18] = block[jno18];
pinsrw %37, word [r12 + %28 * SIZEOF_WORD], 2 ; xmm_shadow[26] = block[jno26];
pinsrw %34, word [r12 + %5 * SIZEOF_WORD], 3 ; xmm_shadow[3] = block[jno3];
pinsrw %35, word [r12 + %13 * SIZEOF_WORD], 3 ; xmm_shadow[11] = block[jno11];
pinsrw %36, word [r12 + %21 * SIZEOF_WORD], 3 ; xmm_shadow[19] = block[jno19];
pinsrw %37, word [r12 + %29 * SIZEOF_WORD], 3 ; xmm_shadow[27] = block[jno27];
pinsrw %34, word [r12 + %6 * SIZEOF_WORD], 4 ; xmm_shadow[4] = block[jno4];
pinsrw %35, word [r12 + %14 * SIZEOF_WORD], 4 ; xmm_shadow[12] = block[jno12];
pinsrw %36, word [r12 + %22 * SIZEOF_WORD], 4 ; xmm_shadow[20] = block[jno20];
pinsrw %37, word [r12 + %30 * SIZEOF_WORD], 4 ; xmm_shadow[28] = block[jno28];
pinsrw %34, word [r12 + %7 * SIZEOF_WORD], 5 ; xmm_shadow[5] = block[jno5];
pinsrw %35, word [r12 + %15 * SIZEOF_WORD], 5 ; xmm_shadow[13] = block[jno13];
pinsrw %36, word [r12 + %23 * SIZEOF_WORD], 5 ; xmm_shadow[21] = block[jno21];
pinsrw %37, word [r12 + %31 * SIZEOF_WORD], 5 ; xmm_shadow[29] = block[jno29];
pinsrw %34, word [r12 + %8 * SIZEOF_WORD], 6 ; xmm_shadow[6] = block[jno6];
pinsrw %35, word [r12 + %16 * SIZEOF_WORD], 6 ; xmm_shadow[14] = block[jno14];
pinsrw %36, word [r12 + %24 * SIZEOF_WORD], 6 ; xmm_shadow[22] = block[jno22];
pinsrw %37, word [r12 + %32 * SIZEOF_WORD], 6 ; xmm_shadow[30] = block[jno30];
pinsrw %34, word [r12 + %9 * SIZEOF_WORD], 7 ; xmm_shadow[7] = block[jno7];
pinsrw %35, word [r12 + %17 * SIZEOF_WORD], 7 ; xmm_shadow[15] = block[jno15];
pinsrw %36, word [r12 + %25 * SIZEOF_WORD], 7 ; xmm_shadow[23] = block[jno23];
%if %1 != 32
pinsrw %37, word [r12 + %33 * SIZEOF_WORD], 7 ; xmm_shadow[31] = block[jno31];
%else
pinsrw %37, ebx, 7 ; xmm_shadow[31] = block[jno31];
%endif
pcmpgtw xmm8, %34 ; neg = _mm_cmpgt_epi16(neg, x1);
pcmpgtw xmm9, %35 ; neg = _mm_cmpgt_epi16(neg, x1);
pcmpgtw xmm10, %36 ; neg = _mm_cmpgt_epi16(neg, x1);
pcmpgtw xmm11, %37 ; neg = _mm_cmpgt_epi16(neg, x1);
paddw %34, xmm8 ; x1 = _mm_add_epi16(x1, neg);
paddw %35, xmm9 ; x1 = _mm_add_epi16(x1, neg);
paddw %36, xmm10 ; x1 = _mm_add_epi16(x1, neg);
paddw %37, xmm11 ; x1 = _mm_add_epi16(x1, neg);
pxor %34, xmm8 ; x1 = _mm_xor_si128(x1, neg);
pxor %35, xmm9 ; x1 = _mm_xor_si128(x1, neg);
pxor %36, xmm10 ; x1 = _mm_xor_si128(x1, neg);
pxor %37, xmm11 ; x1 = _mm_xor_si128(x1, neg);
pxor xmm8, %34 ; neg = _mm_xor_si128(neg, x1);
pxor xmm9, %35 ; neg = _mm_xor_si128(neg, x1);
pxor xmm10, %36 ; neg = _mm_xor_si128(neg, x1);
pxor xmm11, %37 ; neg = _mm_xor_si128(neg, x1);
movdqa XMMWORD [t1 + %1 * SIZEOF_WORD], %34 ; _mm_storeu_si128((__m128i *)(t1 + ko), x1);
movdqa XMMWORD [t1 + (%1 + 8) * SIZEOF_WORD], %35 ; _mm_storeu_si128((__m128i *)(t1 + ko + 8), x1);
movdqa XMMWORD [t1 + (%1 + 16) * SIZEOF_WORD], %36 ; _mm_storeu_si128((__m128i *)(t1 + ko + 16), x1);
movdqa XMMWORD [t1 + (%1 + 24) * SIZEOF_WORD], %37 ; _mm_storeu_si128((__m128i *)(t1 + ko + 24), x1);
movdqa XMMWORD [t2 + %1 * SIZEOF_WORD], xmm8 ; _mm_storeu_si128((__m128i *)(t2 + ko), neg);
movdqa XMMWORD [t2 + (%1 + 8) * SIZEOF_WORD], xmm9 ; _mm_storeu_si128((__m128i *)(t2 + ko + 8), neg);
movdqa XMMWORD [t2 + (%1 + 16) * SIZEOF_WORD], xmm10 ; _mm_storeu_si128((__m128i *)(t2 + ko + 16), neg);
movdqa XMMWORD [t2 + (%1 + 24) * SIZEOF_WORD], xmm11 ; _mm_storeu_si128((__m128i *)(t2 + ko + 24), neg);
%endmacro
;
; Encode a single block's worth of coefficients.
;
; GLOBAL(JOCTET*)
; jsimd_huff_encode_one_block_sse2 (working_state * state, JOCTET *buffer,
; JCOEFPTR block, int last_dc_val,
; c_derived_tbl *dctbl, c_derived_tbl *actbl)
;
; r10 = working_state *state
; r11 = JOCTET *buffer
; r12 = JCOEFPTR block
; r13 = int last_dc_val
; r14 = c_derived_tbl *dctbl
; r15 = c_derived_tbl *actbl
%define t1 rbp-(DCTSIZE2*SIZEOF_WORD)
%define t2 t1-(DCTSIZE2*SIZEOF_WORD)
%define put_buffer r8
%define put_bits r9d
%define buffer rax
align 16
global EXTN(jsimd_huff_encode_one_block_sse2)
EXTN(jsimd_huff_encode_one_block_sse2):
push rbp
mov rax,rsp ; rax = original rbp
sub rsp, byte 4
and rsp, byte (-SIZEOF_XMMWORD) ; align to 128 bits
mov [rsp],rax
mov rbp,rsp ; rbp = aligned rbp
lea rsp, [t2]
collect_args
%ifdef WIN64
sub rsp, 4*SIZEOF_XMMWORD
movaps XMMWORD [rsp-3*SIZEOF_XMMWORD], xmm8
movaps XMMWORD [rsp-2*SIZEOF_XMMWORD], xmm9
movaps XMMWORD [rsp-1*SIZEOF_XMMWORD], xmm10
movaps XMMWORD [rsp-0*SIZEOF_XMMWORD], xmm11
%endif
push rbx
mov buffer, r11 ; r11 is now sratch
mov put_buffer, MMWORD [r10+16] ; put_buffer = state->cur.put_buffer;
mov put_bits, DWORD [r10+24] ; put_bits = state->cur.put_bits;
push r10 ; r10 is now scratch
; Encode the DC coefficient difference per section F.1.2.1
movsx edi, word [r12] ; temp = temp2 = block[0] - last_dc_val;
sub edi, r13d ; r13 is not used anymore
mov ebx, edi
; This is a well-known technique for obtaining the absolute value
; without a branch. It is derived from an assembly language technique
; presented in "How to Optimize for the Pentium Processors",
; Copyright (c) 1996, 1997 by Agner Fog.
mov esi, edi
sar esi, 31 ; temp3 = temp >> (CHAR_BIT * sizeof(int) - 1);
xor edi, esi ; temp ^= temp3;
sub edi, esi ; temp -= temp3;
; For a negative input, want temp2 = bitwise complement of abs(input)
; This code assumes we are on a two's complement machine
add ebx, esi ; temp2 += temp3;
; Find the number of bits needed for the magnitude of the coefficient
lea r11, [rel jpeg_nbits_table]
movzx rdi, byte [r11 + rdi] ; nbits = JPEG_NBITS(temp);
; Emit the Huffman-coded symbol for the number of bits
mov r11d, INT [r14 + rdi * 4] ; code = dctbl->ehufco[nbits];
movzx esi, byte [r14 + rdi + 1024] ; size = dctbl->ehufsi[nbits];
EMIT_BITS r11, esi ; EMIT_BITS(code, size)
; Mask off any extra bits in code
mov esi, 1
mov ecx, edi
shl esi, cl
dec esi
and ebx, esi ; temp2 &= (((JLONG) 1)<<nbits) - 1;
; Emit that number of bits of the value, if positive,
; or the complement of its magnitude, if negative.
EMIT_BITS rbx, edi ; EMIT_BITS(temp2, nbits)
; Prepare data
xor ebx, ebx
kloop_prepare 0, 1, 8, 16, 9, 2, 3, 10, 17, 24, 32, 25, \
18, 11, 4, 5, 12, 19, 26, 33, 40, 48, 41, 34, \
27, 20, 13, 6, 7, 14, 21, 28, 35, \
xmm0, xmm1, xmm2, xmm3
kloop_prepare 32, 42, 49, 56, 57, 50, 43, 36, 29, 22, 15, 23, \
30, 37, 44, 51, 58, 59, 52, 45, 38, 31, 39, 46, \
53, 60, 61, 54, 47, 55, 62, 63, 63, \
xmm4, xmm5, xmm6, xmm7
pxor xmm8, xmm8
pcmpeqw xmm0, xmm8 ; tmp0 = _mm_cmpeq_epi16(tmp0, zero);
pcmpeqw xmm1, xmm8 ; tmp1 = _mm_cmpeq_epi16(tmp1, zero);
pcmpeqw xmm2, xmm8 ; tmp2 = _mm_cmpeq_epi16(tmp2, zero);
pcmpeqw xmm3, xmm8 ; tmp3 = _mm_cmpeq_epi16(tmp3, zero);
pcmpeqw xmm4, xmm8 ; tmp4 = _mm_cmpeq_epi16(tmp4, zero);
pcmpeqw xmm5, xmm8 ; tmp5 = _mm_cmpeq_epi16(tmp5, zero);
pcmpeqw xmm6, xmm8 ; tmp6 = _mm_cmpeq_epi16(tmp6, zero);
pcmpeqw xmm7, xmm8 ; tmp7 = _mm_cmpeq_epi16(tmp7, zero);
packsswb xmm0, xmm1 ; tmp0 = _mm_packs_epi16(tmp0, tmp1);
packsswb xmm2, xmm3 ; tmp2 = _mm_packs_epi16(tmp2, tmp3);
packsswb xmm4, xmm5 ; tmp4 = _mm_packs_epi16(tmp4, tmp5);
packsswb xmm6, xmm7 ; tmp6 = _mm_packs_epi16(tmp6, tmp7);
pmovmskb r11d, xmm0 ; index = ((uint64_t)_mm_movemask_epi8(tmp0)) << 0;
pmovmskb r12d, xmm2 ; index = ((uint64_t)_mm_movemask_epi8(tmp2)) << 16;
pmovmskb r13d, xmm4 ; index = ((uint64_t)_mm_movemask_epi8(tmp4)) << 32;
pmovmskb r14d, xmm6 ; index = ((uint64_t)_mm_movemask_epi8(tmp6)) << 48;
shl r12, 16
shl r14, 16
or r11, r12
or r13, r14
shl r13, 32
or r11, r13
not r11 ; index = ~index;
;mov MMWORD [ t1 + DCTSIZE2 * SIZEOF_WORD ], r11
;jmp .EFN
mov r13d, INT [r15 + 240 * 4] ; code_0xf0 = actbl->ehufco[0xf0];
movzx r14d, byte [r15 + 1024 + 240] ; size_0xf0 = actbl->ehufsi[0xf0];
lea rsi, [t1]
.BLOOP:
bsf r12, r11 ; r = __builtin_ctzl(index);
jz .ELOOP
mov rcx, r12
lea rsi, [rsi+r12*2] ; k += r;
shr r11, cl ; index >>= r;
movzx rdi, word [rsi] ; temp = t1[k];
lea rbx, [rel jpeg_nbits_table]
movzx rdi, byte [rbx + rdi] ; nbits = JPEG_NBITS(temp);
.BRLOOP:
cmp r12, 16 ; while (r > 15) {
jl .ERLOOP
EMIT_BITS r13, r14d ; EMIT_BITS(code_0xf0, size_0xf0)
sub r12, 16 ; r -= 16;
jmp .BRLOOP
.ERLOOP:
; Emit Huffman symbol for run length / number of bits
CHECKBUF31 ; uses rcx, rdx
shl r12, 4 ; temp3 = (r << 4) + nbits;
add r12, rdi
mov ebx, INT [r15 + r12 * 4] ; code = actbl->ehufco[temp3];
movzx ecx, byte [r15 + r12 + 1024] ; size = actbl->ehufsi[temp3];
PUT_BITS rbx
;EMIT_CODE(code, size)
movsx ebx, word [rsi-DCTSIZE2*2] ; temp2 = t2[k];
; Mask off any extra bits in code
mov rcx, rdi
mov rdx, 1
shl rdx, cl
dec rdx
and rbx, rdx ; temp2 &= (((JLONG) 1)<<nbits) - 1;
PUT_BITS rbx ; PUT_BITS(temp2, nbits)
shr r11, 1 ; index >>= 1;
add rsi, 2 ; ++k;
jmp .BLOOP
.ELOOP:
; If the last coef(s) were zero, emit an end-of-block code
lea rdi, [t1 + (DCTSIZE2-1) * 2] ; r = DCTSIZE2-1-k;
cmp rdi, rsi ; if (r > 0) {
je .EFN
mov ebx, INT [r15] ; code = actbl->ehufco[0];
movzx r12d, byte [r15 + 1024] ; size = actbl->ehufsi[0];
EMIT_BITS rbx, r12d
.EFN:
pop r10
; Save put_buffer & put_bits
mov MMWORD [r10+16], put_buffer ; state->cur.put_buffer = put_buffer;
mov DWORD [r10+24], put_bits ; state->cur.put_bits = put_bits;
pop rbx
%ifdef WIN64
movaps xmm8, XMMWORD [rsp-3*SIZEOF_XMMWORD]
movaps xmm9, XMMWORD [rsp-2*SIZEOF_XMMWORD]
movaps xmm10, XMMWORD [rsp-1*SIZEOF_XMMWORD]
movaps xmm11, XMMWORD [rsp-0*SIZEOF_XMMWORD]
add rsp, 4*SIZEOF_XMMWORD
%endif
uncollect_args
mov rsp,rbp ; rsp <- aligned rbp
pop rsp ; rsp <- original rbp
pop rbp
ret
; For some reason, the OS X linker does not honor the request to align the
; segment unless we do this.
align 16

427
simd/jchuff-sse2.asm Normal file
View File

@@ -0,0 +1,427 @@
;
; jchuff-sse2.asm - Huffman entropy encoding (SSE2)
;
; Copyright 2009-2011, 2014-2016 D. R. Commander.
; Copyright 2015 Matthieu Darbois
;
; Based on
; x86 SIMD extension for IJG JPEG library
; Copyright (C) 1999-2006, MIYASAKA Masaru.
; For conditions of distribution and use, see copyright notice in jsimdext.inc
;
; This file should be assembled with NASM (Netwide Assembler),
; can *not* be assembled with Microsoft's MASM or any compatible
; assembler (including Borland's Turbo Assembler).
; NASM is available from http://nasm.sourceforge.net/ or
; http://sourceforge.net/project/showfiles.php?group_id=6208
;
; This file contains an SSE2 implementation for Huffman coding of one block.
; The following code is based directly on jchuff.c; see jchuff.c for more
; details.
;
; [TAB8]
%include "jsimdext.inc"
; --------------------------------------------------------------------------
SECTION SEG_CONST
alignz 16
global EXTN(jconst_huff_encode_one_block)
EXTN(jconst_huff_encode_one_block):
%include "jpeg_nbits_table.inc"
alignz 16
; --------------------------------------------------------------------------
SECTION SEG_TEXT
BITS 32
; These macros perform the same task as the emit_bits() function in the
; original libjpeg code. In addition to reducing overhead by explicitly
; inlining the code, additional performance is achieved by taking into
; account the size of the bit buffer and waiting until it is almost full
; before emptying it. This mostly benefits 64-bit platforms, since 6
; bytes can be stored in a 64-bit bit buffer before it has to be emptied.
%macro EMIT_BYTE 0
sub put_bits, 8 ; put_bits -= 8;
mov edx, put_buffer
mov ecx, put_bits
shr edx, cl ; c = (JOCTET)GETJOCTET(put_buffer >> put_bits);
mov byte [eax], dl ; *buffer++ = c;
add eax, 1
cmp dl, 0xFF ; need to stuff a zero byte?
jne %%.EMIT_BYTE_END
mov byte [eax], 0 ; *buffer++ = 0;
add eax, 1
%%.EMIT_BYTE_END:
%endmacro
%macro PUT_BITS 1
add put_bits, ecx ; put_bits += size;
shl put_buffer, cl ; put_buffer = (put_buffer << size);
or put_buffer, %1
%endmacro
%macro CHECKBUF15 0
cmp put_bits, 16 ; if (put_bits > 31) {
jl %%.CHECKBUF15_END
mov eax, POINTER [esp+buffer]
EMIT_BYTE
EMIT_BYTE
mov POINTER [esp+buffer], eax
%%.CHECKBUF15_END:
%endmacro
%macro EMIT_BITS 1
PUT_BITS %1
CHECKBUF15
%endmacro
%macro kloop_prepare 37 ;(ko, jno0, ..., jno31, xmm0, xmm1, xmm2, xmm3)
pxor xmm4, xmm4 ; __m128i neg = _mm_setzero_si128();
pxor xmm5, xmm5 ; __m128i neg = _mm_setzero_si128();
pxor xmm6, xmm6 ; __m128i neg = _mm_setzero_si128();
pxor xmm7, xmm7 ; __m128i neg = _mm_setzero_si128();
pinsrw %34, word [esi + %2 * SIZEOF_WORD], 0 ; xmm_shadow[0] = block[jno0];
pinsrw %35, word [esi + %10 * SIZEOF_WORD], 0 ; xmm_shadow[8] = block[jno8];
pinsrw %36, word [esi + %18 * SIZEOF_WORD], 0 ; xmm_shadow[16] = block[jno16];
pinsrw %37, word [esi + %26 * SIZEOF_WORD], 0 ; xmm_shadow[24] = block[jno24];
pinsrw %34, word [esi + %3 * SIZEOF_WORD], 1 ; xmm_shadow[1] = block[jno1];
pinsrw %35, word [esi + %11 * SIZEOF_WORD], 1 ; xmm_shadow[9] = block[jno9];
pinsrw %36, word [esi + %19 * SIZEOF_WORD], 1 ; xmm_shadow[17] = block[jno17];
pinsrw %37, word [esi + %27 * SIZEOF_WORD], 1 ; xmm_shadow[25] = block[jno25];
pinsrw %34, word [esi + %4 * SIZEOF_WORD], 2 ; xmm_shadow[2] = block[jno2];
pinsrw %35, word [esi + %12 * SIZEOF_WORD], 2 ; xmm_shadow[10] = block[jno10];
pinsrw %36, word [esi + %20 * SIZEOF_WORD], 2 ; xmm_shadow[18] = block[jno18];
pinsrw %37, word [esi + %28 * SIZEOF_WORD], 2 ; xmm_shadow[26] = block[jno26];
pinsrw %34, word [esi + %5 * SIZEOF_WORD], 3 ; xmm_shadow[3] = block[jno3];
pinsrw %35, word [esi + %13 * SIZEOF_WORD], 3 ; xmm_shadow[11] = block[jno11];
pinsrw %36, word [esi + %21 * SIZEOF_WORD], 3 ; xmm_shadow[19] = block[jno19];
pinsrw %37, word [esi + %29 * SIZEOF_WORD], 3 ; xmm_shadow[27] = block[jno27];
pinsrw %34, word [esi + %6 * SIZEOF_WORD], 4 ; xmm_shadow[4] = block[jno4];
pinsrw %35, word [esi + %14 * SIZEOF_WORD], 4 ; xmm_shadow[12] = block[jno12];
pinsrw %36, word [esi + %22 * SIZEOF_WORD], 4 ; xmm_shadow[20] = block[jno20];
pinsrw %37, word [esi + %30 * SIZEOF_WORD], 4 ; xmm_shadow[28] = block[jno28];
pinsrw %34, word [esi + %7 * SIZEOF_WORD], 5 ; xmm_shadow[5] = block[jno5];
pinsrw %35, word [esi + %15 * SIZEOF_WORD], 5 ; xmm_shadow[13] = block[jno13];
pinsrw %36, word [esi + %23 * SIZEOF_WORD], 5 ; xmm_shadow[21] = block[jno21];
pinsrw %37, word [esi + %31 * SIZEOF_WORD], 5 ; xmm_shadow[29] = block[jno29];
pinsrw %34, word [esi + %8 * SIZEOF_WORD], 6 ; xmm_shadow[6] = block[jno6];
pinsrw %35, word [esi + %16 * SIZEOF_WORD], 6 ; xmm_shadow[14] = block[jno14];
pinsrw %36, word [esi + %24 * SIZEOF_WORD], 6 ; xmm_shadow[22] = block[jno22];
pinsrw %37, word [esi + %32 * SIZEOF_WORD], 6 ; xmm_shadow[30] = block[jno30];
pinsrw %34, word [esi + %9 * SIZEOF_WORD], 7 ; xmm_shadow[7] = block[jno7];
pinsrw %35, word [esi + %17 * SIZEOF_WORD], 7 ; xmm_shadow[15] = block[jno15];
pinsrw %36, word [esi + %25 * SIZEOF_WORD], 7 ; xmm_shadow[23] = block[jno23];
%if %1 != 32
pinsrw %37, word [esi + %33 * SIZEOF_WORD], 7 ; xmm_shadow[31] = block[jno31];
%else
pinsrw %37, ecx, 7 ; xmm_shadow[31] = block[jno31];
%endif
pcmpgtw xmm4, %34 ; neg = _mm_cmpgt_epi16(neg, x1);
pcmpgtw xmm5, %35 ; neg = _mm_cmpgt_epi16(neg, x1);
pcmpgtw xmm6, %36 ; neg = _mm_cmpgt_epi16(neg, x1);
pcmpgtw xmm7, %37 ; neg = _mm_cmpgt_epi16(neg, x1);
paddw %34, xmm4 ; x1 = _mm_add_epi16(x1, neg);
paddw %35, xmm5 ; x1 = _mm_add_epi16(x1, neg);
paddw %36, xmm6 ; x1 = _mm_add_epi16(x1, neg);
paddw %37, xmm7 ; x1 = _mm_add_epi16(x1, neg);
pxor %34, xmm4 ; x1 = _mm_xor_si128(x1, neg);
pxor %35, xmm5 ; x1 = _mm_xor_si128(x1, neg);
pxor %36, xmm6 ; x1 = _mm_xor_si128(x1, neg);
pxor %37, xmm7 ; x1 = _mm_xor_si128(x1, neg);
pxor xmm4, %34 ; neg = _mm_xor_si128(neg, x1);
pxor xmm5, %35 ; neg = _mm_xor_si128(neg, x1);
pxor xmm6, %36 ; neg = _mm_xor_si128(neg, x1);
pxor xmm7, %37 ; neg = _mm_xor_si128(neg, x1);
movdqa XMMWORD [esp + t1 + %1 * SIZEOF_WORD], %34 ; _mm_storeu_si128((__m128i *)(t1 + ko), x1);
movdqa XMMWORD [esp + t1 + (%1 + 8) * SIZEOF_WORD], %35 ; _mm_storeu_si128((__m128i *)(t1 + ko + 8), x1);
movdqa XMMWORD [esp + t1 + (%1 + 16) * SIZEOF_WORD], %36 ; _mm_storeu_si128((__m128i *)(t1 + ko + 16), x1);
movdqa XMMWORD [esp + t1 + (%1 + 24) * SIZEOF_WORD], %37 ; _mm_storeu_si128((__m128i *)(t1 + ko + 24), x1);
movdqa XMMWORD [esp + t2 + %1 * SIZEOF_WORD], xmm4 ; _mm_storeu_si128((__m128i *)(t2 + ko), neg);
movdqa XMMWORD [esp + t2 + (%1 + 8) * SIZEOF_WORD], xmm5 ; _mm_storeu_si128((__m128i *)(t2 + ko + 8), neg);
movdqa XMMWORD [esp + t2 + (%1 + 16) * SIZEOF_WORD], xmm6 ; _mm_storeu_si128((__m128i *)(t2 + ko + 16), neg);
movdqa XMMWORD [esp + t2 + (%1 + 24) * SIZEOF_WORD], xmm7 ; _mm_storeu_si128((__m128i *)(t2 + ko + 24), neg);
%endmacro
;
; Encode a single block's worth of coefficients.
;
; GLOBAL(JOCTET*)
; jsimd_huff_encode_one_block_sse2 (working_state * state, JOCTET *buffer,
; JCOEFPTR block, int last_dc_val,
; c_derived_tbl *dctbl, c_derived_tbl *actbl)
;
; eax + 8 = working_state *state
; eax + 12 = JOCTET *buffer
; eax + 16 = JCOEFPTR block
; eax + 20 = int last_dc_val
; eax + 24 = c_derived_tbl *dctbl
; eax + 28 = c_derived_tbl *actbl
%define pad 6*SIZEOF_DWORD ; Align to 16 bytes
%define t1 pad
%define t2 t1+(DCTSIZE2*SIZEOF_WORD)
%define block t2+(DCTSIZE2*SIZEOF_WORD)
%define actbl block+SIZEOF_DWORD
%define buffer actbl+SIZEOF_DWORD
%define temp buffer+SIZEOF_DWORD
%define temp2 temp+SIZEOF_DWORD
%define temp3 temp2+SIZEOF_DWORD
%define temp4 temp3+SIZEOF_DWORD
%define temp5 temp4+SIZEOF_DWORD
%define gotptr temp5+SIZEOF_DWORD ; void * gotptr
%define put_buffer ebx
%define put_bits edi
align 16
global EXTN(jsimd_huff_encode_one_block_sse2)
EXTN(jsimd_huff_encode_one_block_sse2):
push ebp
mov eax,esp ; eax = original ebp
sub esp, byte 4
and esp, byte (-SIZEOF_XMMWORD) ; align to 128 bits
mov [esp],eax
mov ebp,esp ; ebp = aligned ebp
sub esp, temp5+9*SIZEOF_DWORD-pad
push ebx
push ecx
; push edx ; need not be preserved
push esi
push edi
push ebp
mov esi, POINTER [eax+8] ; (working_state *state)
mov put_buffer, DWORD [esi+8] ; put_buffer = state->cur.put_buffer;
mov put_bits, DWORD [esi+12] ; put_bits = state->cur.put_bits;
push esi ; esi is now scratch
get_GOT edx ; get GOT address
movpic POINTER [esp+gotptr], edx ; save GOT address
mov ecx, POINTER [eax+28]
mov edx, POINTER [eax+16]
mov esi, POINTER [eax+12]
mov POINTER [esp+actbl], ecx
mov POINTER [esp+block], edx
mov POINTER [esp+buffer], esi
; Encode the DC coefficient difference per section F.1.2.1
mov esi, POINTER [esp+block] ; block
movsx ecx, word [esi] ; temp = temp2 = block[0] - last_dc_val;
sub ecx, DWORD [eax+20]
mov esi, ecx
; This is a well-known technique for obtaining the absolute value
; without a branch. It is derived from an assembly language technique
; presented in "How to Optimize for the Pentium Processors",
; Copyright (c) 1996, 1997 by Agner Fog.
mov edx, ecx
sar edx, 31 ; temp3 = temp >> (CHAR_BIT * sizeof(int) - 1);
xor ecx, edx ; temp ^= temp3;
sub ecx, edx ; temp -= temp3;
; For a negative input, want temp2 = bitwise complement of abs(input)
; This code assumes we are on a two's complement machine
add esi, edx ; temp2 += temp3;
mov DWORD [esp+temp], esi ; backup temp2 in temp
; Find the number of bits needed for the magnitude of the coefficient
movpic ebp, POINTER [esp+gotptr] ; load GOT address (ebp)
movzx edx, byte [GOTOFF(ebp, jpeg_nbits_table + ecx)] ; nbits = JPEG_NBITS(temp);
mov DWORD [esp+temp2], edx ; backup nbits in temp2
; Emit the Huffman-coded symbol for the number of bits
mov ebp, POINTER [eax+24] ; After this point, arguments are not accessible anymore
mov eax, INT [ebp + edx * 4] ; code = dctbl->ehufco[nbits];
movzx ecx, byte [ebp + edx + 1024] ; size = dctbl->ehufsi[nbits];
EMIT_BITS eax ; EMIT_BITS(code, size)
mov ecx, DWORD [esp+temp2] ; restore nbits
; Mask off any extra bits in code
mov eax, 1
shl eax, cl
dec eax
and eax, DWORD [esp+temp] ; temp2 &= (((JLONG) 1)<<nbits) - 1;
; Emit that number of bits of the value, if positive,
; or the complement of its magnitude, if negative.
EMIT_BITS eax ; EMIT_BITS(temp2, nbits)
; Prepare data
xor ecx, ecx
mov esi, POINTER [esp+block]
kloop_prepare 0, 1, 8, 16, 9, 2, 3, 10, 17, 24, 32, 25, \
18, 11, 4, 5, 12, 19, 26, 33, 40, 48, 41, 34, \
27, 20, 13, 6, 7, 14, 21, 28, 35, \
xmm0, xmm1, xmm2, xmm3
kloop_prepare 32, 42, 49, 56, 57, 50, 43, 36, 29, 22, 15, 23, \
30, 37, 44, 51, 58, 59, 52, 45, 38, 31, 39, 46, \
53, 60, 61, 54, 47, 55, 62, 63, 63, \
xmm0, xmm1, xmm2, xmm3
pxor xmm7, xmm7
movdqa xmm0, XMMWORD [esp + t1 + 0 * SIZEOF_WORD] ; __m128i tmp0 = _mm_loadu_si128((__m128i *)(t1 + 0));
movdqa xmm1, XMMWORD [esp + t1 + 8 * SIZEOF_WORD] ; __m128i tmp1 = _mm_loadu_si128((__m128i *)(t1 + 8));
movdqa xmm2, XMMWORD [esp + t1 + 16 * SIZEOF_WORD] ; __m128i tmp2 = _mm_loadu_si128((__m128i *)(t1 + 16));
movdqa xmm3, XMMWORD [esp + t1 + 24 * SIZEOF_WORD] ; __m128i tmp3 = _mm_loadu_si128((__m128i *)(t1 + 24));
pcmpeqw xmm0, xmm7 ; tmp0 = _mm_cmpeq_epi16(tmp0, zero);
pcmpeqw xmm1, xmm7 ; tmp1 = _mm_cmpeq_epi16(tmp1, zero);
pcmpeqw xmm2, xmm7 ; tmp2 = _mm_cmpeq_epi16(tmp2, zero);
pcmpeqw xmm3, xmm7 ; tmp3 = _mm_cmpeq_epi16(tmp3, zero);
packsswb xmm0, xmm1 ; tmp0 = _mm_packs_epi16(tmp0, tmp1);
packsswb xmm2, xmm3 ; tmp2 = _mm_packs_epi16(tmp2, tmp3);
pmovmskb edx, xmm0 ; index = ((uint64_t)_mm_movemask_epi8(tmp0)) << 0;
pmovmskb ecx, xmm2 ; index = ((uint64_t)_mm_movemask_epi8(tmp2)) << 16;
shl ecx, 16
or edx, ecx
not edx ; index = ~index;
lea esi, [esp+t1]
mov ebp, POINTER [esp+actbl] ; ebp = actbl
.BLOOP:
bsf ecx, edx ; r = __builtin_ctzl(index);
jz .ELOOP
lea esi, [esi+ecx*2] ; k += r;
shr edx, cl ; index >>= r;
mov DWORD [esp+temp3], edx
.BRLOOP:
cmp ecx, 16 ; while (r > 15) {
jl .ERLOOP
sub ecx, 16 ; r -= 16;
mov DWORD [esp+temp], ecx
mov eax, INT [ebp + 240 * 4] ; code_0xf0 = actbl->ehufco[0xf0];
movzx ecx, byte [ebp + 1024 + 240] ; size_0xf0 = actbl->ehufsi[0xf0];
EMIT_BITS eax ; EMIT_BITS(code_0xf0, size_0xf0)
mov ecx, DWORD [esp+temp]
jmp .BRLOOP
.ERLOOP:
movsx eax, word [esi] ; temp = t1[k];
movpic edx, POINTER [esp+gotptr] ; load GOT address (edx)
movzx eax, byte [GOTOFF(edx, jpeg_nbits_table + eax)] ; nbits = JPEG_NBITS(temp);
mov DWORD [esp+temp2], eax
; Emit Huffman symbol for run length / number of bits
shl ecx, 4 ; temp3 = (r << 4) + nbits;
add ecx, eax
mov eax, INT [ebp + ecx * 4] ; code = actbl->ehufco[temp3];
movzx ecx, byte [ebp + ecx + 1024] ; size = actbl->ehufsi[temp3];
EMIT_BITS eax
movsx edx, word [esi+DCTSIZE2*2] ; temp2 = t2[k];
; Mask off any extra bits in code
mov ecx, DWORD [esp+temp2]
mov eax, 1
shl eax, cl
dec eax
and eax, edx ; temp2 &= (((JLONG) 1)<<nbits) - 1;
EMIT_BITS eax ; PUT_BITS(temp2, nbits)
mov edx, DWORD [esp+temp3]
add esi, 2 ; ++k;
shr edx, 1 ; index >>= 1;
jmp .BLOOP
.ELOOP:
movdqa xmm0, XMMWORD [esp + t1 + 32 * SIZEOF_WORD] ; __m128i tmp0 = _mm_loadu_si128((__m128i *)(t1 + 0));
movdqa xmm1, XMMWORD [esp + t1 + 40 * SIZEOF_WORD] ; __m128i tmp1 = _mm_loadu_si128((__m128i *)(t1 + 8));
movdqa xmm2, XMMWORD [esp + t1 + 48 * SIZEOF_WORD] ; __m128i tmp2 = _mm_loadu_si128((__m128i *)(t1 + 16));
movdqa xmm3, XMMWORD [esp + t1 + 56 * SIZEOF_WORD] ; __m128i tmp3 = _mm_loadu_si128((__m128i *)(t1 + 24));
pcmpeqw xmm0, xmm7 ; tmp0 = _mm_cmpeq_epi16(tmp0, zero);
pcmpeqw xmm1, xmm7 ; tmp1 = _mm_cmpeq_epi16(tmp1, zero);
pcmpeqw xmm2, xmm7 ; tmp2 = _mm_cmpeq_epi16(tmp2, zero);
pcmpeqw xmm3, xmm7 ; tmp3 = _mm_cmpeq_epi16(tmp3, zero);
packsswb xmm0, xmm1 ; tmp0 = _mm_packs_epi16(tmp0, tmp1);
packsswb xmm2, xmm3 ; tmp2 = _mm_packs_epi16(tmp2, tmp3);
pmovmskb edx, xmm0 ; index = ((uint64_t)_mm_movemask_epi8(tmp0)) << 0;
pmovmskb ecx, xmm2 ; index = ((uint64_t)_mm_movemask_epi8(tmp2)) << 16;
shl ecx, 16
or edx, ecx
not edx ; index = ~index;
lea eax, [esp + t1 + (DCTSIZE2/2) * 2]
sub eax, esi
shr eax, 1
bsf ecx, edx ; r = __builtin_ctzl(index);
jz .ELOOP2
shr edx, cl ; index >>= r;
add ecx, eax
lea esi, [esi+ecx*2] ; k += r;
mov DWORD [esp+temp3], edx
jmp .BRLOOP2
.BLOOP2:
bsf ecx, edx ; r = __builtin_ctzl(index);
jz .ELOOP2
lea esi, [esi+ecx*2] ; k += r;
shr edx, cl ; index >>= r;
mov DWORD [esp+temp3], edx
.BRLOOP2:
cmp ecx, 16 ; while (r > 15) {
jl .ERLOOP2
sub ecx, 16 ; r -= 16;
mov DWORD [esp+temp], ecx
mov eax, INT [ebp + 240 * 4] ; code_0xf0 = actbl->ehufco[0xf0];
movzx ecx, byte [ebp + 1024 + 240] ; size_0xf0 = actbl->ehufsi[0xf0];
EMIT_BITS eax ; EMIT_BITS(code_0xf0, size_0xf0)
mov ecx, DWORD [esp+temp]
jmp .BRLOOP2
.ERLOOP2:
movsx eax, word [esi] ; temp = t1[k];
bsr eax, eax ; nbits = 32 - __builtin_clz(temp);
inc eax
mov DWORD [esp+temp2], eax
; Emit Huffman symbol for run length / number of bits
shl ecx, 4 ; temp3 = (r << 4) + nbits;
add ecx, eax
mov eax, INT [ebp + ecx * 4] ; code = actbl->ehufco[temp3];
movzx ecx, byte [ebp + ecx + 1024] ; size = actbl->ehufsi[temp3];
EMIT_BITS eax
movsx edx, word [esi+DCTSIZE2*2] ; temp2 = t2[k];
; Mask off any extra bits in code
mov ecx, DWORD [esp+temp2]
mov eax, 1
shl eax, cl
dec eax
and eax, edx ; temp2 &= (((JLONG) 1)<<nbits) - 1;
EMIT_BITS eax ; PUT_BITS(temp2, nbits)
mov edx, DWORD [esp+temp3]
add esi, 2 ; ++k;
shr edx, 1 ; index >>= 1;
jmp .BLOOP2
.ELOOP2:
; If the last coef(s) were zero, emit an end-of-block code
lea edx, [esp + t1 + (DCTSIZE2-1) * 2] ; r = DCTSIZE2-1-k;
cmp edx, esi ; if (r > 0) {
je .EFN
mov eax, INT [ebp] ; code = actbl->ehufco[0];
movzx ecx, byte [ebp + 1024] ; size = actbl->ehufsi[0];
EMIT_BITS eax
.EFN:
mov eax, [esp+buffer]
pop esi
; Save put_buffer & put_bits
mov DWORD [esi+8], put_buffer ; state->cur.put_buffer = put_buffer;
mov DWORD [esi+12], put_bits ; state->cur.put_bits = put_bits;
pop ebp
pop edi
pop esi
; pop edx ; need not be preserved
pop ecx
pop ebx
mov esp,ebp ; esp <- aligned ebp
pop esp ; esp <- original ebp
pop ebp
ret
; For some reason, the OS X linker does not honor the request to align the
; segment unless we do this.
align 16

4097
simd/jpeg_nbits_table.inc Normal file

File diff suppressed because it is too large Load Diff

View File

@@ -2,9 +2,10 @@
* simd/jsimd.h
*
* Copyright 2009 Pierre Ossman <ossman@cendio.se> for Cendio AB
* Copyright (C) 2011, 2014-2015 D. R. Commander
* Copyright (C) 2011, 2014-2016 D. R. Commander
* Copyright (C) 2013-2014, MIPS Technologies, Inc., California
* Copyright (C) 2014 Linaro Limited
* Copyright (C) 2015 Matthieu Darbois
*
* Based on the x86 SIMD extension for IJG JPEG library,
* Copyright (C) 1999-2006, MIYASAKA Masaru.
@@ -828,3 +829,9 @@ extern const int jconst_idct_float_sse2[];
EXTERN(void) jsimd_idct_float_sse2
(void * dct_table, JCOEFPTR coef_block, JSAMPARRAY output_buf,
JDIMENSION output_col);
/* Huffman coding */
extern const int jconst_huff_encode_one_block[];
EXTERN(JOCTET*) jsimd_huff_encode_one_block_sse2
(void * state, JOCTET *buffer, JCOEFPTR block, int last_dc_val,
c_derived_tbl *dctbl, c_derived_tbl *actbl);

View File

@@ -3,6 +3,7 @@
*
* Copyright 2009 Pierre Ossman <ossman@cendio.se> for Cendio AB
* Copyright 2009-2011, 2013-2014 D. R. Commander
* Copyright 2015 Matthieu Darbois
*
* Based on the x86 SIMD extension for IJG JPEG library,
* Copyright (C) 1999-2006, MIYASAKA Masaru.
@@ -706,3 +707,17 @@ jsimd_idct_float (j_decompress_ptr cinfo, jpeg_component_info * compptr,
JDIMENSION output_col)
{
}
GLOBAL(int)
jsimd_can_huff_encode_one_block (void)
{
return 0;
}
GLOBAL(JOCTET*)
jsimd_huff_encode_one_block (void * state, JOCTET *buffer, JCOEFPTR block,
int last_dc_val, c_derived_tbl *dctbl,
c_derived_tbl *actbl)
{
return NULL;
}

View File

@@ -3,6 +3,7 @@
*
* Copyright 2009 Pierre Ossman <ossman@cendio.se> for Cendio AB
* Copyright 2009-2011, 2013-2014 D. R. Commander
* Copyright 2015 Matthieu Darbois
*
* Based on the x86 SIMD extension for IJG JPEG library,
* Copyright (C) 1999-2006, MIYASAKA Masaru.
@@ -33,10 +34,10 @@ static unsigned int simd_support = ~0;
* FIXME: This code is racy under a multi-threaded environment.
*/
/*
/*
* ARMv8 architectures support NEON extensions by default.
* It is no longer optional as it was with ARMv7.
*/
*/
LOCAL(void)
@@ -542,3 +543,17 @@ jsimd_idct_float (j_decompress_ptr cinfo, jpeg_component_info * compptr,
JDIMENSION output_col)
{
}
GLOBAL(int)
jsimd_can_huff_encode_one_block (void)
{
return 0;
}
GLOBAL(JOCTET*)
jsimd_huff_encode_one_block (void * state, JOCTET *buffer, JCOEFPTR block,
int last_dc_val, c_derived_tbl *dctbl,
c_derived_tbl *actbl)
{
return NULL;
}

View File

@@ -2,7 +2,8 @@
* jsimd_i386.c
*
* Copyright 2009 Pierre Ossman <ossman@cendio.se> for Cendio AB
* Copyright 2009-2011, 2013-2014 D. R. Commander
* Copyright 2009-2011, 2013-2014, 2016 D. R. Commander
* Copyright 2015 Matthieu Darbois
*
* Based on the x86 SIMD extension for IJG JPEG library,
* Copyright (C) 1999-2006, MIYASAKA Masaru.
@@ -30,6 +31,7 @@
#define IS_ALIGNED_SSE(ptr) (IS_ALIGNED(ptr, 4)) /* 16 byte alignment */
static unsigned int simd_support = ~0;
static unsigned int simd_huffman = 1;
/*
* Check what SIMD accelerations are supported.
@@ -62,6 +64,9 @@ init_simd (void)
env = getenv("JSIMD_FORCENONE");
if ((env != NULL) && (strcmp(env, "1") == 0))
simd_support = 0;
env = getenv("JSIMD_NOHUFFENC");
if ((env != NULL) && (strcmp(env, "1") == 0))
simd_huffman = 0;
}
GLOBAL(int)
@@ -1059,3 +1064,28 @@ jsimd_idct_float (j_decompress_ptr cinfo, jpeg_component_info * compptr,
output_col);
}
GLOBAL(int)
jsimd_can_huff_encode_one_block (void)
{
init_simd();
if (DCTSIZE != 8)
return 0;
if (sizeof(JCOEF) != 2)
return 0;
if ((simd_support & JSIMD_SSE2) && simd_huffman &&
IS_ALIGNED_SSE(jconst_huff_encode_one_block))
return 1;
return 0;
}
GLOBAL(JOCTET*)
jsimd_huff_encode_one_block (void * state, JOCTET *buffer, JCOEFPTR block,
int last_dc_val, c_derived_tbl *dctbl,
c_derived_tbl *actbl)
{
return jsimd_huff_encode_one_block_sse2(state, buffer, block, last_dc_val,
dctbl, actbl);
}

View File

@@ -4,6 +4,7 @@
* Copyright 2009 Pierre Ossman <ossman@cendio.se> for Cendio AB
* Copyright 2009-2011, 2014 D. R. Commander
* Copyright (C) 2013-2014, MIPS Technologies, Inc., California
* Copyright 2015 Matthieu Darbois
*
* Based on the x86 SIMD extension for IJG JPEG library,
* Copyright (C) 1999-2006, MIYASAKA Masaru.
@@ -1113,3 +1114,17 @@ jsimd_idct_float (j_decompress_ptr cinfo, jpeg_component_info * compptr,
JDIMENSION output_col)
{
}
GLOBAL(int)
jsimd_can_huff_encode_one_block (void)
{
return 0;
}
GLOBAL(JOCTET*)
jsimd_huff_encode_one_block (void * state, JOCTET *buffer, JCOEFPTR block,
int last_dc_val, c_derived_tbl *dctbl,
c_derived_tbl *actbl)
{
return NULL;
}

View File

@@ -3,6 +3,7 @@
*
* Copyright 2009 Pierre Ossman <ossman@cendio.se> for Cendio AB
* Copyright 2009-2011, 2014-2015 D. R. Commander
* Copyright 2015 Matthieu Darbois
*
* Based on the x86 SIMD extension for IJG JPEG library,
* Copyright (C) 1999-2006, MIYASAKA Masaru.
@@ -724,3 +725,17 @@ jsimd_idct_float (j_decompress_ptr cinfo, jpeg_component_info * compptr,
JDIMENSION output_col)
{
}
GLOBAL(int)
jsimd_can_huff_encode_one_block (void)
{
return 0;
}
GLOBAL(JOCTET*)
jsimd_huff_encode_one_block (void * state, JOCTET *buffer, JCOEFPTR block,
int last_dc_val, c_derived_tbl *dctbl,
c_derived_tbl *actbl)
{
return NULL;
}

View File

@@ -2,7 +2,8 @@
* jsimd_x86_64.c
*
* Copyright 2009 Pierre Ossman <ossman@cendio.se> for Cendio AB
* Copyright 2009-2011, 2014 D. R. Commander
* Copyright 2009-2011, 2014, 2016 D. R. Commander
* Copyright 2015 Matthieu Darbois
*
* Based on the x86 SIMD extension for IJG JPEG library,
* Copyright (C) 1999-2006, MIYASAKA Masaru.
@@ -30,6 +31,7 @@
#define IS_ALIGNED_SSE(ptr) (IS_ALIGNED(ptr, 4)) /* 16 byte alignment */
static unsigned int simd_support = ~0;
static unsigned int simd_huffman = 1;
/*
* Check what SIMD accelerations are supported.
@@ -50,6 +52,9 @@ init_simd (void)
env = getenv("JSIMD_FORCENONE");
if ((env != NULL) && (strcmp(env, "1") == 0))
simd_support = 0;
env = getenv("JSIMD_NOHUFFENC");
if ((env != NULL) && (strcmp(env, "1") == 0))
simd_huffman = 0;
}
GLOBAL(int)
@@ -854,3 +859,29 @@ jsimd_idct_float (j_decompress_ptr cinfo, jpeg_component_info * compptr,
jsimd_idct_float_sse2(compptr->dct_table, coef_block, output_buf,
output_col);
}
GLOBAL(int)
jsimd_can_huff_encode_one_block (void)
{
init_simd();
if (DCTSIZE != 8)
return 0;
if (sizeof(JCOEF) != 2)
return 0;
if ((simd_support & JSIMD_SSE2) && simd_huffman &&
IS_ALIGNED_SSE(jconst_huff_encode_one_block))
return 1;
return 0;
}
GLOBAL(JOCTET*)
jsimd_huff_encode_one_block (void * state, JOCTET *buffer, JCOEFPTR block,
int last_dc_val, c_derived_tbl *dctbl,
c_derived_tbl *actbl)
{
return jsimd_huff_encode_one_block_sse2(state, buffer, block, last_dc_val,
dctbl, actbl);
}