From: Alexander Monakov Date: Wed, 16 Nov 2016 17:17:00 +0000 (+0300) Subject: nvptx backend prerequisites for OpenMP offloading X-Git-Url: https://git.libre-soc.org/?a=commitdiff_plain;h=5012919d0bd344ac1888e8e531072f0ccbe24d2c;p=gcc.git nvptx backend prerequisites for OpenMP offloading gcc/ * config/nvptx/mkoffload.c (main): Check that either OpenACC or OpenMP is selected. Pass -mgomp to offload compiler in OpenMP case. * config/nvptx/nvptx-protos.h (nvptx_shuffle_kind): Move enum declaration from nvptx.c. (nvptx_gen_shuffle): Declare. (nvptx_output_set_softstack): Declare. * config/nvptx/nvptx.c (nvptx_shuffle_kind): Move to nvptx-protos.h. (need_softstack_decl): New variable. (need_unisimt_decl): New variable. (diagnose_openacc_conflict): New. Use it... (nvptx_option_override): ...here. Handle TARGET_GOMP. (nvptx_encode_section_info): Handle "shared" attribute. (write_as_kernel): Restrict to OpenACC target regions. (init_softstack_frame): New. (nvptx_init_unisimt_predicate): New. (write_omp_entry): New. Use it... (nvptx_declare_function_name): ...here to emit OpenMP target region entrypoints. Handle TARGET_SOFT_STACK. Call nvptx_init_unisimt_predicate. (nvptx_output_set_softstack): New. (nvptx_get_drap_rtx): Return %argp as the DRAP if needed. (nvptx_gen_shuffle): Export. (nvptx_output_call_insn): Handle COND_EXEC patterns. Emit instruction predicate. (nvptx_print_operand): Fix handling of instruction predicates. (nvptx_get_unisimt_master): New helper function. (nvptx_get_unisimt_predicate): Ditto. (nvptx_call_insn_is_syscall_p): Ditto. (nvptx_unisimt_handle_set): Ditto. (nvptx_reorg_uniform_simt): New. Transform code for -muniform-simt. (nvptx_reorg): Call nvptx_reorg_uniform_simt. (nvptx_handle_shared_attribute): New. Use it... (nvptx_attribute_table): ... here (new entry). (nvptx_record_offload_symbol): Handle NULL attributes. (nvptx_file_end): Handle need_softstack_decl and need_unisimt_decl. (nvptx_simt_vf): New. (TARGET_SIMT_VF): Define. * config/nvptx/nvptx.h (TARGET_CPU_CPP_BUILTINS): Define __nvptx_softstack or __nvptx_unisimt__ when -msoft-stack, or resp. -muniform-simt option is active. (STACK_SIZE_MODE): Define. (FIXED_REGISTERS): Adjust. (SOFTSTACK_SLOT_REGNUM): New. (SOFTSTACK_PREV_REGNUM): New. (REGISTER_NAMES): Adjust. (struct machine_function): New fields. * config/nvptx/nvptx.md (UNSPEC_SET_SOFTSTACK): New. (UNSPEC_VOTE_BALLOT): Ditto. (UNSPEC_LANEID): Ditto. (UNSPECV_NOUNROLL): Ditto. (atomic): New attribute. (predicable): New attribute. Generate predicated forms via define_cond_exec. (br_true): Mark as not predicable. (br_false): Ditto. (br_true_uni): Ditto. (br_false_uni): Ditto. (return): Ditto. (trap_if_true): Ditto. (trap_if_false): Ditto. (nvptx_fork): Ditto. (nvptx_forked): Ditto. (nvptx_joining): Ditto. (nvptx_join): Ditto. (nvptx_barsync): Ditto. (epilogue): Emit stack restore if TARGET_SOFT_STACK. (allocate_stack): Implement for TARGET_SOFT_STACK. Remove unused code. (allocate_stack_): Remove unused pattern. (set_softstack_insn): New pattern. (restore_stack_block): Handle for TARGET_SOFT_STACK. (nvptx_vote_ballot): New pattern. (omp_simt_lane): Ditto. (omp_simt_last_lane): Ditto. (omp_simt_ordered): Ditto. (omp_simt_vote_any): Ditto. (omp_simt_xchg_bfly): Ditto. (omp_simt_xchg_idx): Ditto. (nvptx_nounroll): Ditto. (atomic_compare_and_swap_1): Mark with atomic attribute. (atomic_exchange): Ditto. (atomic_fetch_add): Ditto. (atomic_fetch_addsf): Ditto. (atomic_fetch_): Ditto. * config/nvptx/nvptx.opt: (msoft-stack): New option. (muniform-simt): Ditto. (mgomp): Ditto. * config/nvptx/t-nvptx (MULTILIB_OPTIONS): New. * doc/extend.texi (Nvidia PTX Variable Attributes): New section. * doc/invoke.texi (msoft-stack): Document. (muniform-simt): Document (mgomp): Document. * doc/tm.texi: Regenerate. * doc/tm.texi.in: (TARGET_SIMT_VF): New hook. * target.def: Define it. * target-insns.def (omp_simt_lane): New. (omp_simt_last_lane): New. (omp_simt_ordered): New. (omp_simt_vote_any): New. (omp_simt_xchg_bfly): New. (omp_simt_xchg_idx): New. libgcc/ * config/nvptx/crt0.c (__main): Setup __nvptx_stacks and __nvptx_uni. * config/nvptx/mgomp.c: New file. * config/nvptx/t-nvptx: Add mgomp.c gcc/testsuite/ * lib/target-supports.exp (check_effective_target_alloca): Use a compile test. * gcc.target/nvptx/softstack.c: New test. * gcc.target/nvptx/decl-shared.c: New test. * gcc.target/nvptx/decl-shared-init.c: New test. From-SVN: r242503 --- diff --git a/gcc/ChangeLog b/gcc/ChangeLog index 6f46dd404f2..be573a3124b 100644 --- a/gcc/ChangeLog +++ b/gcc/ChangeLog @@ -1,3 +1,106 @@ +2016-11-16 Alexander Monakov + + * config/nvptx/mkoffload.c (main): Check that either OpenACC or OpenMP + is selected. Pass -mgomp to offload compiler in OpenMP case. + * config/nvptx/nvptx-protos.h (nvptx_shuffle_kind): Move enum + declaration from nvptx.c. + (nvptx_gen_shuffle): Declare. + (nvptx_output_set_softstack): Declare. + * config/nvptx/nvptx.c (nvptx_shuffle_kind): Move to nvptx-protos.h. + (need_softstack_decl): New variable. + (need_unisimt_decl): New variable. + (diagnose_openacc_conflict): New. Use it... + (nvptx_option_override): ...here. Handle TARGET_GOMP. + (nvptx_encode_section_info): Handle "shared" attribute. + (write_as_kernel): Restrict to OpenACC target regions. + (init_softstack_frame): New. + (nvptx_init_unisimt_predicate): New. + (write_omp_entry): New. Use it... + (nvptx_declare_function_name): ...here to emit OpenMP target region + entrypoints. Handle TARGET_SOFT_STACK. Call + nvptx_init_unisimt_predicate. + (nvptx_output_set_softstack): New. + (nvptx_get_drap_rtx): Return %argp as the DRAP if needed. + (nvptx_gen_shuffle): Export. + (nvptx_output_call_insn): Handle COND_EXEC patterns. Emit instruction + predicate. + (nvptx_print_operand): Fix handling of instruction predicates. + (nvptx_get_unisimt_master): New helper function. + (nvptx_get_unisimt_predicate): Ditto. + (nvptx_call_insn_is_syscall_p): Ditto. + (nvptx_unisimt_handle_set): Ditto. + (nvptx_reorg_uniform_simt): New. Transform code for -muniform-simt. + (nvptx_reorg): Call nvptx_reorg_uniform_simt. + (nvptx_handle_shared_attribute): New. Use it... + (nvptx_attribute_table): ... here (new entry). + (nvptx_record_offload_symbol): Handle NULL attributes. + (nvptx_file_end): Handle need_softstack_decl and need_unisimt_decl. + (nvptx_simt_vf): New. + (TARGET_SIMT_VF): Define. + * config/nvptx/nvptx.h (TARGET_CPU_CPP_BUILTINS): Define + __nvptx_softstack or __nvptx_unisimt__ when -msoft-stack, or resp. + -muniform-simt option is active. + (STACK_SIZE_MODE): Define. + (FIXED_REGISTERS): Adjust. + (SOFTSTACK_SLOT_REGNUM): New. + (SOFTSTACK_PREV_REGNUM): New. + (REGISTER_NAMES): Adjust. + (struct machine_function): New fields. + * config/nvptx/nvptx.md (UNSPEC_SET_SOFTSTACK): New. + (UNSPEC_VOTE_BALLOT): Ditto. + (UNSPEC_LANEID): Ditto. + (UNSPECV_NOUNROLL): Ditto. + (atomic): New attribute. + (predicable): New attribute. Generate predicated forms via + define_cond_exec. + (br_true): Mark as not predicable. + (br_false): Ditto. + (br_true_uni): Ditto. + (br_false_uni): Ditto. + (return): Ditto. + (trap_if_true): Ditto. + (trap_if_false): Ditto. + (nvptx_fork): Ditto. + (nvptx_forked): Ditto. + (nvptx_joining): Ditto. + (nvptx_join): Ditto. + (nvptx_barsync): Ditto. + (epilogue): Emit stack restore if TARGET_SOFT_STACK. + (allocate_stack): Implement for TARGET_SOFT_STACK. Remove unused code. + (allocate_stack_): Remove unused pattern. + (set_softstack_insn): New pattern. + (restore_stack_block): Handle for TARGET_SOFT_STACK. + (nvptx_vote_ballot): New pattern. + (omp_simt_lane): Ditto. + (omp_simt_last_lane): Ditto. + (omp_simt_ordered): Ditto. + (omp_simt_vote_any): Ditto. + (omp_simt_xchg_bfly): Ditto. + (omp_simt_xchg_idx): Ditto. + (nvptx_nounroll): Ditto. + (atomic_compare_and_swap_1): Mark with atomic attribute. + (atomic_exchange): Ditto. + (atomic_fetch_add): Ditto. + (atomic_fetch_addsf): Ditto. + (atomic_fetch_): Ditto. + * config/nvptx/nvptx.opt: (msoft-stack): New option. + (muniform-simt): Ditto. + (mgomp): Ditto. + * config/nvptx/t-nvptx (MULTILIB_OPTIONS): New. + * doc/extend.texi (Nvidia PTX Variable Attributes): New section. + * doc/invoke.texi (msoft-stack): Document. + (muniform-simt): Document + (mgomp): Document. + * doc/tm.texi: Regenerate. + * doc/tm.texi.in: (TARGET_SIMT_VF): New hook. + * target.def: Define it. + * target-insns.def (omp_simt_lane): New. + (omp_simt_last_lane): New. + (omp_simt_ordered): New. + (omp_simt_vote_any): New. + (omp_simt_xchg_bfly): New. + (omp_simt_xchg_idx): New. + 2016-11-16 Maciej W. Rozycki * config/mips/mips-protos.h (mips_set_text_contents_type): New diff --git a/gcc/config/nvptx/mkoffload.c b/gcc/config/nvptx/mkoffload.c index c8eed451078..d876c7bc162 100644 --- a/gcc/config/nvptx/mkoffload.c +++ b/gcc/config/nvptx/mkoffload.c @@ -460,6 +460,7 @@ main (int argc, char **argv) /* Scan the argument vector. */ bool fopenmp = false; + bool fopenacc = false; for (int i = 1; i < argc; i++) { #define STR "-foffload-abi=" @@ -476,11 +477,15 @@ main (int argc, char **argv) #undef STR else if (strcmp (argv[i], "-fopenmp") == 0) fopenmp = true; + else if (strcmp (argv[i], "-fopenacc") == 0) + fopenacc = true; else if (strcmp (argv[i], "-save-temps") == 0) save_temps = true; else if (strcmp (argv[i], "-v") == 0) verbose = true; } + if (!(fopenacc ^ fopenmp)) + fatal_error (input_location, "either -fopenacc or -fopenmp must be set"); struct obstack argv_obstack; obstack_init (&argv_obstack); @@ -501,6 +506,8 @@ main (int argc, char **argv) default: gcc_unreachable (); } + if (fopenmp) + obstack_ptr_grow (&argv_obstack, "-mgomp"); for (int ix = 1; ix != argc; ix++) { diff --git a/gcc/config/nvptx/nvptx-protos.h b/gcc/config/nvptx/nvptx-protos.h index ec4588e6dc0..331ec0af6bf 100644 --- a/gcc/config/nvptx/nvptx-protos.h +++ b/gcc/config/nvptx/nvptx-protos.h @@ -21,6 +21,16 @@ #ifndef GCC_NVPTX_PROTOS_H #define GCC_NVPTX_PROTOS_H +/* The kind of shuffe instruction. */ +enum nvptx_shuffle_kind +{ + SHUFFLE_UP, + SHUFFLE_DOWN, + SHUFFLE_BFLY, + SHUFFLE_IDX, + SHUFFLE_MAX +}; + extern void nvptx_declare_function_name (FILE *, const char *, const_tree decl); extern void nvptx_declare_object_name (FILE *file, const char *name, const_tree decl); @@ -36,10 +46,12 @@ extern void nvptx_register_pragmas (void); extern void nvptx_expand_oacc_fork (unsigned); extern void nvptx_expand_oacc_join (unsigned); extern void nvptx_expand_call (rtx, rtx); +extern rtx nvptx_gen_shuffle (rtx, rtx, rtx, nvptx_shuffle_kind); extern rtx nvptx_expand_compare (rtx); extern const char *nvptx_ptx_type_from_mode (machine_mode, bool); extern const char *nvptx_output_mov_insn (rtx, rtx); extern const char *nvptx_output_call_insn (rtx_insn *, rtx, rtx); extern const char *nvptx_output_return (void); +extern const char *nvptx_output_set_softstack (unsigned); #endif #endif diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c index 782bbdecb37..405a91b2604 100644 --- a/gcc/config/nvptx/nvptx.c +++ b/gcc/config/nvptx/nvptx.c @@ -72,16 +72,6 @@ /* This file should be included last. */ #include "target-def.h" -/* The kind of shuffe instruction. */ -enum nvptx_shuffle_kind -{ - SHUFFLE_UP, - SHUFFLE_DOWN, - SHUFFLE_BFLY, - SHUFFLE_IDX, - SHUFFLE_MAX -}; - /* The various PTX memory areas an object might reside in. */ enum nvptx_data_area { @@ -141,6 +131,12 @@ static GTY(()) rtx worker_red_sym; /* Global lock variable, needed for 128bit worker & gang reductions. */ static GTY(()) tree global_lock_var; +/* True if any function references __nvptx_stacks. */ +static bool need_softstack_decl; + +/* True if any function references __nvptx_uni. */ +static bool need_unisimt_decl; + /* Allocate a new, cleared machine_function structure. */ static struct machine_function * @@ -151,6 +147,16 @@ nvptx_init_machine_status (void) return p; } +/* Issue a diagnostic when option OPTNAME is enabled (as indicated by OPTVAL) + and -fopenacc is also enabled. */ + +static void +diagnose_openacc_conflict (bool optval, const char *optname) +{ + if (flag_openacc && optval) + error ("option %s is not supported together with -fopenacc", optname); +} + /* Implement TARGET_OPTION_OVERRIDE. */ static void @@ -188,6 +194,13 @@ nvptx_option_override (void) worker_red_sym = gen_rtx_SYMBOL_REF (Pmode, "__worker_red"); SET_SYMBOL_DATA_AREA (worker_red_sym, DATA_AREA_SHARED); worker_red_align = GET_MODE_ALIGNMENT (SImode) / BITS_PER_UNIT; + + diagnose_openacc_conflict (TARGET_GOMP, "-mgomp"); + diagnose_openacc_conflict (TARGET_SOFT_STACK, "-msoft-stack"); + diagnose_openacc_conflict (TARGET_UNIFORM_SIMT, "-muniform-simt"); + + if (TARGET_GOMP) + target_flags |= MASK_SOFT_STACK | MASK_UNIFORM_SIMT; } /* Return a ptx type for MODE. If PROMOTE, then use .u32 for QImode to @@ -238,9 +251,17 @@ nvptx_encode_section_info (tree decl, rtx rtl, int first) if (TREE_CONSTANT (decl)) area = DATA_AREA_CONST; else if (TREE_CODE (decl) == VAR_DECL) - /* TODO: This would be a good place to check for a .shared or - other section name. */ - area = TREE_READONLY (decl) ? DATA_AREA_CONST : DATA_AREA_GLOBAL; + { + if (lookup_attribute ("shared", DECL_ATTRIBUTES (decl))) + { + area = DATA_AREA_SHARED; + if (DECL_INITIAL (decl)) + error ("static initialization of variable %q+D in %<.shared%>" + " memory is not supported", decl); + } + else + area = TREE_READONLY (decl) ? DATA_AREA_CONST : DATA_AREA_GLOBAL; + } SET_SYMBOL_DATA_AREA (XEXP (rtl, 0), area); } @@ -718,7 +739,10 @@ static bool write_as_kernel (tree attrs) { return (lookup_attribute ("kernel", attrs) != NULL_TREE - || lookup_attribute ("omp target entrypoint", attrs) != NULL_TREE); + || (lookup_attribute ("omp target entrypoint", attrs) != NULL_TREE + && lookup_attribute ("oacc function", attrs) != NULL_TREE)); + /* For OpenMP target regions, the corresponding kernel entry is emitted from + write_omp_entry as a separate function. */ } /* Emit a linker marker for a function decl or defn. */ @@ -973,6 +997,67 @@ init_frame (FILE *file, int regno, unsigned align, unsigned size) POINTER_SIZE, reg_names[regno], reg_names[regno]); } +/* Emit soft stack frame setup sequence. */ + +static void +init_softstack_frame (FILE *file, unsigned alignment, HOST_WIDE_INT size) +{ + /* Maintain 64-bit stack alignment. */ + unsigned keep_align = BIGGEST_ALIGNMENT / BITS_PER_UNIT; + size = ROUND_UP (size, keep_align); + int bits = POINTER_SIZE; + const char *reg_stack = reg_names[STACK_POINTER_REGNUM]; + const char *reg_frame = reg_names[FRAME_POINTER_REGNUM]; + const char *reg_sspslot = reg_names[SOFTSTACK_SLOT_REGNUM]; + const char *reg_sspprev = reg_names[SOFTSTACK_PREV_REGNUM]; + fprintf (file, "\t.reg.u%d %s;\n", bits, reg_stack); + fprintf (file, "\t.reg.u%d %s;\n", bits, reg_frame); + fprintf (file, "\t.reg.u%d %s;\n", bits, reg_sspslot); + fprintf (file, "\t.reg.u%d %s;\n", bits, reg_sspprev); + fprintf (file, "\t{\n"); + fprintf (file, "\t\t.reg.u32 %%fstmp0;\n"); + fprintf (file, "\t\t.reg.u%d %%fstmp1;\n", bits); + fprintf (file, "\t\t.reg.u%d %%fstmp2;\n", bits); + fprintf (file, "\t\tmov.u32 %%fstmp0, %%tid.y;\n"); + fprintf (file, "\t\tmul%s.u32 %%fstmp1, %%fstmp0, %d;\n", + bits == 64 ? ".wide" : ".lo", bits / 8); + fprintf (file, "\t\tmov.u%d %%fstmp2, __nvptx_stacks;\n", bits); + + /* Initialize %sspslot = &__nvptx_stacks[tid.y]. */ + fprintf (file, "\t\tadd.u%d %s, %%fstmp2, %%fstmp1;\n", bits, reg_sspslot); + + /* Initialize %sspprev = __nvptx_stacks[tid.y]. */ + fprintf (file, "\t\tld.shared.u%d %s, [%s];\n", + bits, reg_sspprev, reg_sspslot); + + /* Initialize %frame = %sspprev - size. */ + fprintf (file, "\t\tsub.u%d %s, %s, " HOST_WIDE_INT_PRINT_DEC ";\n", + bits, reg_frame, reg_sspprev, size); + + /* Apply alignment, if larger than 64. */ + if (alignment > keep_align) + fprintf (file, "\t\tand.b%d %s, %s, %d;\n", + bits, reg_frame, reg_frame, -alignment); + + size = crtl->outgoing_args_size; + gcc_assert (size % keep_align == 0); + + /* Initialize %stack. */ + fprintf (file, "\t\tsub.u%d %s, %s, " HOST_WIDE_INT_PRINT_DEC ";\n", + bits, reg_stack, reg_frame, size); + + /* Usually 'crtl->is_leaf' is computed during register allocator + initialization, which is not done on NVPTX. Compute it now. */ + gcc_assert (!crtl->is_leaf); + crtl->is_leaf = leaf_function_p (); + if (!crtl->is_leaf) + fprintf (file, "\t\tst.shared.u%d [%s], %s;\n", + bits, reg_sspslot, reg_stack); + fprintf (file, "\t}\n"); + cfun->machine->has_softstack = true; + need_softstack_decl = true; +} + /* Emit code to initialize the REGNO predicate register to indicate whether we are not lane zero on the NAME axis. */ @@ -986,6 +1071,97 @@ nvptx_init_axis_predicate (FILE *file, int regno, const char *name) fprintf (file, "\t}\n"); } +/* Emit code to initialize predicate and master lane index registers for + -muniform-simt code generation variant. */ + +static void +nvptx_init_unisimt_predicate (FILE *file) +{ + int bits = POINTER_SIZE; + int master = REGNO (cfun->machine->unisimt_master); + int pred = REGNO (cfun->machine->unisimt_predicate); + fprintf (file, "\t{\n"); + fprintf (file, "\t\t.reg.u32 %%ustmp0;\n"); + fprintf (file, "\t\t.reg.u%d %%ustmp1;\n", bits); + fprintf (file, "\t\t.reg.u%d %%ustmp2;\n", bits); + fprintf (file, "\t\tmov.u32 %%ustmp0, %%tid.y;\n"); + fprintf (file, "\t\tmul%s.u32 %%ustmp1, %%ustmp0, 4;\n", + bits == 64 ? ".wide" : ".lo"); + fprintf (file, "\t\tmov.u%d %%ustmp2, __nvptx_uni;\n", bits); + fprintf (file, "\t\tadd.u%d %%ustmp2, %%ustmp2, %%ustmp1;\n", bits); + fprintf (file, "\t\tld.shared.u32 %%r%d, [%%ustmp2];\n", master); + fprintf (file, "\t\tmov.u32 %%ustmp0, %%tid.x;\n"); + /* Compute 'master lane index' as 'tid.x & __nvptx_uni[tid.y]'. */ + fprintf (file, "\t\tand.b32 %%r%d, %%r%d, %%ustmp0;\n", master, master); + /* Compute predicate as 'tid.x == master'. */ + fprintf (file, "\t\tsetp.eq.u32 %%r%d, %%r%d, %%ustmp0;\n", pred, master); + fprintf (file, "\t}\n"); + need_unisimt_decl = true; +} + +/* Emit kernel NAME for function ORIG outlined for an OpenMP 'target' region: + + extern void gomp_nvptx_main (void (*fn)(void*), void *fnarg); + void __attribute__((kernel)) NAME (void *arg, char *stack, size_t stacksize) + { + __nvptx_stacks[tid.y] = stack + stacksize * (ctaid.x * ntid.y + tid.y + 1); + __nvptx_uni[tid.y] = 0; + gomp_nvptx_main (ORIG, arg); + } + ORIG itself should not be emitted as a PTX .entry function. */ + +static void +write_omp_entry (FILE *file, const char *name, const char *orig) +{ + static bool gomp_nvptx_main_declared; + if (!gomp_nvptx_main_declared) + { + gomp_nvptx_main_declared = true; + write_fn_marker (func_decls, false, true, "gomp_nvptx_main"); + func_decls << ".extern .func gomp_nvptx_main (.param.u" << POINTER_SIZE + << " %in_ar1, .param.u" << POINTER_SIZE << " %in_ar2);\n"; + } +#define ENTRY_TEMPLATE(PS, PS_BYTES, MAD_PS_32) "\ + (.param.u" PS " %arg, .param.u" PS " %stack, .param.u" PS " %sz)\n\ +{\n\ + .reg.u32 %r<3>;\n\ + .reg.u" PS " %R<4>;\n\ + mov.u32 %r0, %tid.y;\n\ + mov.u32 %r1, %ntid.y;\n\ + mov.u32 %r2, %ctaid.x;\n\ + cvt.u" PS ".u32 %R1, %r0;\n\ + " MAD_PS_32 " %R1, %r1, %r2, %R1;\n\ + mov.u" PS " %R0, __nvptx_stacks;\n\ + " MAD_PS_32 " %R0, %r0, " PS_BYTES ", %R0;\n\ + ld.param.u" PS " %R2, [%stack];\n\ + ld.param.u" PS " %R3, [%sz];\n\ + add.u" PS " %R2, %R2, %R3;\n\ + mad.lo.u" PS " %R2, %R1, %R3, %R2;\n\ + st.shared.u" PS " [%R0], %R2;\n\ + mov.u" PS " %R0, __nvptx_uni;\n\ + " MAD_PS_32 " %R0, %r0, 4, %R0;\n\ + mov.u32 %r0, 0;\n\ + st.shared.u32 [%R0], %r0;\n\ + mov.u" PS " %R0, \0;\n\ + ld.param.u" PS " %R1, [%arg];\n\ + {\n\ + .param.u" PS " %P<2>;\n\ + st.param.u" PS " [%P0], %R0;\n\ + st.param.u" PS " [%P1], %R1;\n\ + call.uni gomp_nvptx_main, (%P0, %P1);\n\ + }\n\ + ret.uni;\n\ +}\n" + static const char entry64[] = ENTRY_TEMPLATE ("64", "8", "mad.wide.u32"); + static const char entry32[] = ENTRY_TEMPLATE ("32", "4", "mad.lo.u32 "); +#undef ENTRY_TEMPLATE + const char *entry_1 = TARGET_ABI64 ? entry64 : entry32; + /* Position ENTRY_2 after the embedded nul using strlen of the prefix. */ + const char *entry_2 = entry_1 + strlen (entry64) + 1; + fprintf (file, ".visible .entry %s%s%s%s", name, entry_1, orig, entry_2); + need_softstack_decl = need_unisimt_decl = true; +} + /* Implement ASM_DECLARE_FUNCTION_NAME. Writes the start of a ptx function, including local var decls and copies from the arguments to local regs. */ @@ -997,6 +1173,14 @@ nvptx_declare_function_name (FILE *file, const char *name, const_tree decl) tree result_type = TREE_TYPE (fntype); int argno = 0; + if (lookup_attribute ("omp target entrypoint", DECL_ATTRIBUTES (decl)) + && !lookup_attribute ("oacc function", DECL_ATTRIBUTES (decl))) + { + char *buf = (char *) alloca (strlen (name) + sizeof ("$impl")); + sprintf (buf, "%s$impl", name); + write_omp_entry (file, name, buf); + name = buf; + } /* We construct the initial part of the function into a string stream, in order to share the prototype writing code. */ std::stringstream s; @@ -1034,19 +1218,24 @@ nvptx_declare_function_name (FILE *file, const char *name, const_tree decl) fprintf (file, "%s", s.str().c_str()); - /* Declare a local var for outgoing varargs. */ - if (cfun->machine->has_varadic) - init_frame (file, STACK_POINTER_REGNUM, - UNITS_PER_WORD, crtl->outgoing_args_size); - - /* Declare a local variable for the frame. Force its size to be - DImode-compatible. */ HOST_WIDE_INT sz = get_frame_size (); - if (sz || cfun->machine->has_chain) - init_frame (file, FRAME_POINTER_REGNUM, - crtl->stack_alignment_needed / BITS_PER_UNIT, - (sz + GET_MODE_SIZE (DImode) - 1) - & ~(HOST_WIDE_INT)(GET_MODE_SIZE (DImode) - 1)); + bool need_frameptr = sz || cfun->machine->has_chain; + int alignment = crtl->stack_alignment_needed / BITS_PER_UNIT; + if (!TARGET_SOFT_STACK) + { + /* Declare a local var for outgoing varargs. */ + if (cfun->machine->has_varadic) + init_frame (file, STACK_POINTER_REGNUM, + UNITS_PER_WORD, crtl->outgoing_args_size); + + /* Declare a local variable for the frame. Force its size to be + DImode-compatible. */ + if (need_frameptr) + init_frame (file, FRAME_POINTER_REGNUM, alignment, + ROUND_UP (sz, GET_MODE_SIZE (DImode))); + } + else if (need_frameptr || cfun->machine->has_varadic || cfun->calls_alloca) + init_softstack_frame (file, alignment, sz); /* Declare the pseudos we have as ptx registers. */ int maxregs = max_reg_num (); @@ -1072,8 +1261,25 @@ nvptx_declare_function_name (FILE *file, const char *name, const_tree decl) if (cfun->machine->axis_predicate[1]) nvptx_init_axis_predicate (file, REGNO (cfun->machine->axis_predicate[1]), "x"); + if (cfun->machine->unisimt_predicate) + nvptx_init_unisimt_predicate (file); } +/* Output instruction that sets soft stack pointer in shared memory to the + value in register given by SRC_REGNO. */ + +const char * +nvptx_output_set_softstack (unsigned src_regno) +{ + if (cfun->machine->has_softstack && !crtl->is_leaf) + { + fprintf (asm_out_file, "\tst.shared.u%d\t[%s], ", + POINTER_SIZE, reg_names[SOFTSTACK_SLOT_REGNUM]); + output_reg (asm_out_file, src_regno, VOIDmode); + fprintf (asm_out_file, ";\n"); + } + return ""; +} /* Output a return instruction. Also copy the return value to its outgoing location. */ @@ -1113,6 +1319,8 @@ nvptx_function_ok_for_sibcall (tree, tree) static rtx nvptx_get_drap_rtx (void) { + if (TARGET_SOFT_STACK && stack_realign_drap) + return arg_pointer_rtx; return NULL_RTX; } @@ -1311,7 +1519,7 @@ nvptx_gen_pack (rtx dst, rtx src0, rtx src1) /* Generate an instruction or sequence to broadcast register REG across the vectors of a single warp. */ -static rtx +rtx nvptx_gen_shuffle (rtx dst, rtx src, rtx idx, nvptx_shuffle_kind kind) { rtx res; @@ -1833,6 +2041,8 @@ nvptx_output_mov_insn (rtx dst, rtx src) return "%.\tcvt%t0%t1\t%0, %1;"; } +static void nvptx_print_operand (FILE *, rtx, int); + /* Output INSN, which is a call to CALLEE with result RESULT. For ptx, this involves writing .param declarations and in/out copies into them. For indirect calls, also write the .callprototype. */ @@ -1844,6 +2054,8 @@ nvptx_output_call_insn (rtx_insn *insn, rtx result, rtx callee) static int labelno; bool needs_tgt = register_operand (callee, Pmode); rtx pat = PATTERN (insn); + if (GET_CODE (pat) == COND_EXEC) + pat = COND_EXEC_CODE (pat); int arg_end = XVECLEN (pat, 0); tree decl = NULL_TREE; @@ -1888,6 +2100,8 @@ nvptx_output_call_insn (rtx_insn *insn, rtx result, rtx callee) fprintf (asm_out_file, ";\n"); } + /* The '.' stands for the call's predicate, if any. */ + nvptx_print_operand (asm_out_file, NULL_RTX, '.'); fprintf (asm_out_file, "\t\tcall "); if (result != NULL_RTX) fprintf (asm_out_file, "(%s_in), ", reg_names[NVPTX_RETURN_REGNUM]); @@ -1951,8 +2165,6 @@ nvptx_print_operand_punct_valid_p (unsigned char c) return c == '.' || c== '#'; } -static void nvptx_print_operand (FILE *, rtx, int); - /* Subroutine of nvptx_print_operand; used to print a memory reference X to FILE. */ static void @@ -2013,12 +2225,10 @@ nvptx_print_operand (FILE *file, rtx x, int code) x = current_insn_predicate; if (x) { - unsigned int regno = REGNO (XEXP (x, 0)); - fputs ("[", file); + fputs ("@", file); if (GET_CODE (x) == EQ) fputs ("!", file); - fputs (reg_names [regno], file); - fputs ("]", file); + output_reg (file, REGNO (XEXP (x, 0)), VOIDmode); } return; } @@ -2313,6 +2523,89 @@ nvptx_reorg_subreg (void) } } +/* Return a SImode "master lane index" register for uniform-simt, allocating on + first use. */ + +static rtx +nvptx_get_unisimt_master () +{ + rtx &master = cfun->machine->unisimt_master; + return master ? master : master = gen_reg_rtx (SImode); +} + +/* Return a BImode "predicate" register for uniform-simt, similar to above. */ + +static rtx +nvptx_get_unisimt_predicate () +{ + rtx &pred = cfun->machine->unisimt_predicate; + return pred ? pred : pred = gen_reg_rtx (BImode); +} + +/* Return true if given call insn references one of the functions provided by + the CUDA runtime: malloc, free, vprintf. */ + +static bool +nvptx_call_insn_is_syscall_p (rtx_insn *insn) +{ + rtx pat = PATTERN (insn); + gcc_checking_assert (GET_CODE (pat) == PARALLEL); + pat = XVECEXP (pat, 0, 0); + if (GET_CODE (pat) == SET) + pat = SET_SRC (pat); + gcc_checking_assert (GET_CODE (pat) == CALL + && GET_CODE (XEXP (pat, 0)) == MEM); + rtx addr = XEXP (XEXP (pat, 0), 0); + if (GET_CODE (addr) != SYMBOL_REF) + return false; + const char *name = XSTR (addr, 0); + /* Ordinary malloc/free are redirected to __nvptx_{malloc,free), so only the + references with forced assembler name refer to PTX syscalls. For vprintf, + accept both normal and forced-assembler-name references. */ + return (!strcmp (name, "vprintf") || !strcmp (name, "*vprintf") + || !strcmp (name, "*malloc") + || !strcmp (name, "*free")); +} + +/* If SET subexpression of INSN sets a register, emit a shuffle instruction to + propagate its value from lane MASTER to current lane. */ + +static void +nvptx_unisimt_handle_set (rtx set, rtx_insn *insn, rtx master) +{ + rtx reg; + if (GET_CODE (set) == SET && REG_P (reg = SET_DEST (set))) + emit_insn_after (nvptx_gen_shuffle (reg, reg, master, SHUFFLE_IDX), insn); +} + +/* Adjust code for uniform-simt code generation variant by making atomics and + "syscalls" conditionally executed, and inserting shuffle-based propagation + for registers being set. */ + +static void +nvptx_reorg_uniform_simt () +{ + rtx_insn *insn, *next; + + for (insn = get_insns (); insn; insn = next) + { + next = NEXT_INSN (insn); + if (!(CALL_P (insn) && nvptx_call_insn_is_syscall_p (insn)) + && !(NONJUMP_INSN_P (insn) + && GET_CODE (PATTERN (insn)) == PARALLEL + && get_attr_atomic (insn))) + continue; + rtx pat = PATTERN (insn); + rtx master = nvptx_get_unisimt_master (); + for (int i = 0; i < XVECLEN (pat, 0); i++) + nvptx_unisimt_handle_set (XVECEXP (pat, 0, i), insn, master); + rtx pred = nvptx_get_unisimt_predicate (); + pred = gen_rtx_NE (BImode, pred, const0_rtx); + pat = gen_rtx_COND_EXEC (VOIDmode, pred, pat); + validate_change (insn, &PATTERN (insn), pat, false); + } +} + /* Loop structure of the function. The entire function is described as a NULL loop. */ @@ -3829,6 +4122,9 @@ nvptx_reorg (void) /* Replace subregs. */ nvptx_reorg_subreg (); + if (TARGET_UNIFORM_SIMT) + nvptx_reorg_uniform_simt (); + regstat_free_n_sets_and_refs (); df_finish_pass (true); @@ -3857,12 +4153,36 @@ nvptx_handle_kernel_attribute (tree *node, tree name, tree ARG_UNUSED (args), return NULL_TREE; } +/* Handle a "shared" attribute; arguments as in + struct attribute_spec.handler. */ + +static tree +nvptx_handle_shared_attribute (tree *node, tree name, tree ARG_UNUSED (args), + int ARG_UNUSED (flags), bool *no_add_attrs) +{ + tree decl = *node; + + if (TREE_CODE (decl) != VAR_DECL) + { + error ("%qE attribute only applies to variables", name); + *no_add_attrs = true; + } + else if (!(TREE_PUBLIC (decl) || TREE_STATIC (decl))) + { + error ("%qE attribute not allowed with auto storage class", name); + *no_add_attrs = true; + } + + return NULL_TREE; +} + /* Table of valid machine attributes. */ static const struct attribute_spec nvptx_attribute_table[] = { /* { name, min_len, max_len, decl_req, type_req, fn_type_req, handler, affects_type_identity } */ { "kernel", 0, 0, true, false, false, nvptx_handle_kernel_attribute, false }, + { "shared", 0, 0, true, false, false, nvptx_handle_shared_attribute, false }, { NULL, 0, 0, false, false, false, NULL, false } }; @@ -3924,13 +4244,13 @@ nvptx_record_offload_symbol (tree decl) case FUNCTION_DECL: { tree attr = get_oacc_fn_attrib (decl); - tree dims = TREE_VALUE (attr); - unsigned ix; + /* OpenMP offloading does not set this attribute. */ + tree dims = attr ? TREE_VALUE (attr) : NULL_TREE; fprintf (asm_out_file, "//:FUNC_MAP \"%s\"", IDENTIFIER_POINTER (DECL_ASSEMBLER_NAME (decl))); - for (ix = 0; ix != GOMP_DIM_MAX; ix++, dims = TREE_CHAIN (dims)) + for (; dims; dims = TREE_CHAIN (dims)) { int size = TREE_INT_CST_LOW (TREE_VALUE (dims)); @@ -3991,6 +4311,21 @@ nvptx_file_end (void) if (worker_red_size) write_worker_buffer (asm_out_file, worker_red_sym, worker_red_align, worker_red_size); + + if (need_softstack_decl) + { + write_var_marker (asm_out_file, false, true, "__nvptx_stacks"); + /* 32 is the maximum number of warps in a block. Even though it's an + external declaration, emit the array size explicitly; otherwise, it + may fail at PTX JIT time if the definition is later in link order. */ + fprintf (asm_out_file, ".extern .shared .u%d __nvptx_stacks[32];\n", + POINTER_SIZE); + } + if (need_unisimt_decl) + { + write_var_marker (asm_out_file, false, true, "__nvptx_uni"); + fprintf (asm_out_file, ".extern .shared .u32 __nvptx_uni[32];\n"); + } } /* Expander for the shuffle builtins. */ @@ -4176,6 +4511,14 @@ nvptx_expand_builtin (tree exp, rtx target, rtx ARG_UNUSED (subtarget), #define PTX_WORKER_LENGTH 32 #define PTX_GANG_DEFAULT 0 /* Defer to runtime. */ +/* Implement TARGET_SIMT_VF target hook: number of threads in a warp. */ + +static int +nvptx_simt_vf () +{ + return PTX_VECTOR_LENGTH; +} + /* Validate compute dimensions of an OpenACC offload or routine, fill in non-unity defaults. FN_LEVEL indicates the level at which a routine might spawn a loop. It is negative for non-routines. If @@ -4944,6 +5287,9 @@ nvptx_goacc_reduction (gcall *call) #undef TARGET_BUILTIN_DECL #define TARGET_BUILTIN_DECL nvptx_builtin_decl +#undef TARGET_SIMT_VF +#define TARGET_SIMT_VF nvptx_simt_vf + #undef TARGET_GOACC_VALIDATE_DIMS #define TARGET_GOACC_VALIDATE_DIMS nvptx_goacc_validate_dims diff --git a/gcc/config/nvptx/nvptx.h b/gcc/config/nvptx/nvptx.h index 381269e3bcc..1702178eeb9 100644 --- a/gcc/config/nvptx/nvptx.h +++ b/gcc/config/nvptx/nvptx.h @@ -31,6 +31,10 @@ builtin_assert ("machine=nvptx"); \ builtin_assert ("cpu=nvptx"); \ builtin_define ("__nvptx__"); \ + if (TARGET_SOFT_STACK) \ + builtin_define ("__nvptx_softstack__"); \ + if (TARGET_UNIFORM_SIMT) \ + builtin_define ("__nvptx_unisimt__"); \ } while (0) /* Avoid the default in ../../gcc.c, which adds "-pthread", which is not @@ -79,13 +83,14 @@ #define POINTER_SIZE (TARGET_ABI64 ? 64 : 32) #define Pmode (TARGET_ABI64 ? DImode : SImode) +#define STACK_SIZE_MODE Pmode /* Registers. Since ptx is a virtual target, we just define a few hard registers for special purposes and leave pseudos unallocated. We have to have some available hard registers, to keep gcc setup happy. */ #define FIRST_PSEUDO_REGISTER 16 -#define FIXED_REGISTERS { 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 } +#define FIXED_REGISTERS { 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0 } #define CALL_USED_REGISTERS { 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 } #define HARD_REGNO_NREGS(REG, MODE) \ @@ -133,10 +138,17 @@ enum reg_class { NO_REGS, ALL_REGS, LIM_REG_CLASSES }; #define FRAME_POINTER_REGNUM 2 #define ARG_POINTER_REGNUM 3 #define STATIC_CHAIN_REGNUM 4 +/* This register points to the shared memory location with the current warp's + soft stack pointer (__nvptx_stacks[tid.y]). */ +#define SOFTSTACK_SLOT_REGNUM 5 +/* This register is used to save the previous value of the soft stack pointer + in the prologue and restore it when returning. */ +#define SOFTSTACK_PREV_REGNUM 6 #define REGISTER_NAMES \ { \ - "%value", "%stack", "%frame", "%args", "%chain", "%hr5", "%hr6", "%hr7", \ + "%value", "%stack", "%frame", "%args", \ + "%chain", "%sspslot", "%sspprev", "%hr7", \ "%hr8", "%hr9", "%hr10", "%hr11", "%hr12", "%hr13", "%hr14", "%hr15" \ } @@ -200,10 +212,13 @@ struct GTY(()) machine_function bool is_varadic; /* This call is varadic */ bool has_varadic; /* Current function has a varadic call. */ bool has_chain; /* Current function has outgoing static chain. */ + bool has_softstack; /* Current function has a soft stack frame. */ int num_args; /* Number of args of current call. */ int return_mode; /* Return mode of current fn. (machine_mode not defined yet.) */ rtx axis_predicate[2]; /* Neutering predicates. */ + rtx unisimt_master; /* 'Master lane index' for -muniform-simt. */ + rtx unisimt_predicate; /* Predicate for -muniform-simt. */ }; #endif diff --git a/gcc/config/nvptx/nvptx.md b/gcc/config/nvptx/nvptx.md index d117343c531..91d11290860 100644 --- a/gcc/config/nvptx/nvptx.md +++ b/gcc/config/nvptx/nvptx.md @@ -36,10 +36,16 @@ UNSPEC_ALLOCA + UNSPEC_SET_SOFTSTACK + UNSPEC_DIM_SIZE UNSPEC_BIT_CONV + UNSPEC_VOTE_BALLOT + + UNSPEC_LANEID + UNSPEC_SHUFFLE UNSPEC_BR_UNIFIED ]) @@ -55,11 +61,16 @@ UNSPECV_FORKED UNSPECV_JOINING UNSPECV_JOIN + + UNSPECV_NOUNROLL ]) (define_attr "subregs_ok" "false,true" (const_string "false")) +(define_attr "atomic" "false,true" + (const_string "false")) + ;; The nvptx operand predicates, in general, don't permit subregs and ;; only literal constants, which differ from the generic ones, which ;; permit subregs and symbolc constants (as appropriate) @@ -124,6 +135,17 @@ return true; }) +(define_attr "predicable" "false,true" + (const_string "true")) + +(define_cond_exec + [(match_operator 0 "predicate_operator" + [(match_operand:BI 1 "nvptx_register_operand" "") + (match_operand:BI 2 "const0_operand" "")])] + "" + "" + ) + (define_constraint "P0" "An integer with the value 0." (and (match_code "const_int") @@ -509,7 +531,8 @@ (label_ref (match_operand 1 "" "")) (pc)))] "" - "%j0\\tbra\\t%l1;") + "%j0\\tbra\\t%l1;" + [(set_attr "predicable" "false")]) (define_insn "br_false" [(set (pc) @@ -518,7 +541,8 @@ (label_ref (match_operand 1 "" "")) (pc)))] "" - "%J0\\tbra\\t%l1;") + "%J0\\tbra\\t%l1;" + [(set_attr "predicable" "false")]) ;; unified conditional branch (define_insn "br_true_uni" @@ -527,7 +551,8 @@ UNSPEC_BR_UNIFIED) (const_int 0)) (label_ref (match_operand 1 "" "")) (pc)))] "" - "%j0\\tbra.uni\\t%l1;") + "%j0\\tbra.uni\\t%l1;" + [(set_attr "predicable" "false")]) (define_insn "br_false_uni" [(set (pc) (if_then_else @@ -535,7 +560,8 @@ UNSPEC_BR_UNIFIED) (const_int 0)) (label_ref (match_operand 1 "" "")) (pc)))] "" - "%J0\\tbra.uni\\t%l1;") + "%J0\\tbra.uni\\t%l1;" + [(set_attr "predicable" "false")]) (define_expand "cbranch4" [(set (pc) @@ -938,12 +964,16 @@ "" { return nvptx_output_return (); -}) +} + [(set_attr "predicable" "false")]) (define_expand "epilogue" [(clobber (const_int 0))] "" { + if (TARGET_SOFT_STACK) + emit_insn (gen_set_softstack_insn (gen_rtx_REG (Pmode, + SOFTSTACK_PREV_REGNUM))); emit_jump_insn (gen_return ()); DONE; }) @@ -972,31 +1002,40 @@ (match_operand 1 "nvptx_register_operand")] "" { + if (TARGET_SOFT_STACK) + { + emit_move_insn (stack_pointer_rtx, + gen_rtx_MINUS (Pmode, stack_pointer_rtx, operands[1])); + emit_insn (gen_set_softstack_insn (stack_pointer_rtx)); + emit_move_insn (operands[0], virtual_stack_dynamic_rtx); + DONE; + } /* The ptx documentation specifies an alloca intrinsic (for 32 bit only) but notes it is not implemented. The assembler emits a confused error message. Issue a blunt one now instead. */ sorry ("target cannot support alloca."); emit_insn (gen_nop ()); DONE; - if (TARGET_ABI64) - emit_insn (gen_allocate_stack_di (operands[0], operands[1])); - else - emit_insn (gen_allocate_stack_si (operands[0], operands[1])); - DONE; }) -(define_insn "allocate_stack_" - [(set (match_operand:P 0 "nvptx_register_operand" "=R") - (unspec:P [(match_operand:P 1 "nvptx_register_operand" "R")] - UNSPEC_ALLOCA))] - "" - "%.\\tcall (%0), %%alloca, (%1);") +(define_insn "set_softstack_insn" + [(unspec [(match_operand 0 "nvptx_register_operand" "R")] + UNSPEC_SET_SOFTSTACK)] + "TARGET_SOFT_STACK" +{ + return nvptx_output_set_softstack (REGNO (operands[0])); +}) (define_expand "restore_stack_block" [(match_operand 0 "register_operand" "") (match_operand 1 "register_operand" "")] "" { + if (TARGET_SOFT_STACK) + { + emit_move_insn (operands[0], operands[1]); + emit_insn (gen_set_softstack_insn (operands[0])); + } DONE; }) @@ -1018,14 +1057,16 @@ (const_int 0)) (const_int 0))] "" - "%j0 trap;") + "%j0 trap;" + [(set_attr "predicable" "false")]) (define_insn "trap_if_false" [(trap_if (eq (match_operand:BI 0 "nvptx_register_operand" "R") (const_int 0)) (const_int 0))] "" - "%J0 trap;") + "%J0 trap;" + [(set_attr "predicable" "false")]) (define_expand "ctrap4" [(trap_if (match_operator 0 "nvptx_comparison_operator" @@ -1074,28 +1115,28 @@ UNSPECV_FORK)] "" "// fork %0;" -) + [(set_attr "predicable" "false")]) (define_insn "nvptx_forked" [(unspec_volatile:SI [(match_operand:SI 0 "const_int_operand" "")] UNSPECV_FORKED)] "" "// forked %0;" -) + [(set_attr "predicable" "false")]) (define_insn "nvptx_joining" [(unspec_volatile:SI [(match_operand:SI 0 "const_int_operand" "")] UNSPECV_JOINING)] "" "// joining %0;" -) + [(set_attr "predicable" "false")]) (define_insn "nvptx_join" [(unspec_volatile:SI [(match_operand:SI 0 "const_int_operand" "")] UNSPECV_JOIN)] "" "// join %0;" -) + [(set_attr "predicable" "false")]) (define_expand "oacc_fork" [(set (match_operand:SI 0 "nvptx_nonmemory_operand" "") @@ -1134,6 +1175,88 @@ "" "%.\\tshfl%S3.b32\\t%0, %1, %2, 31;") +(define_insn "nvptx_vote_ballot" + [(set (match_operand:SI 0 "nvptx_register_operand" "=R") + (unspec:SI [(match_operand:BI 1 "nvptx_register_operand" "R")] + UNSPEC_VOTE_BALLOT))] + "" + "%.\\tvote.ballot.b32\\t%0, %1;") + +;; Patterns for OpenMP SIMD-via-SIMT lowering + +;; Implement IFN_GOMP_SIMT_LANE: set operand 0 to lane index +(define_insn "omp_simt_lane" + [(set (match_operand:SI 0 "nvptx_register_operand" "") + (unspec:SI [(const_int 0)] UNSPEC_LANEID))] + "" + "%.\\tmov.u32\\t%0, %%laneid;") + +;; Implement IFN_GOMP_SIMT_ORDERED: copy operand 1 to operand 0 and +;; place a compiler barrier to disallow unrolling/peeling the containing loop +(define_expand "omp_simt_ordered" + [(match_operand:SI 0 "nvptx_register_operand" "=R") + (match_operand:SI 1 "nvptx_register_operand" "R")] + "" +{ + emit_move_insn (operands[0], operands[1]); + emit_insn (gen_nvptx_nounroll ()); + DONE; +}) + +;; Implement IFN_GOMP_SIMT_XCHG_BFLY: perform a "butterfly" exchange +;; across lanes +(define_expand "omp_simt_xchg_bfly" + [(match_operand 0 "nvptx_register_operand" "=R") + (match_operand 1 "nvptx_register_operand" "R") + (match_operand:SI 2 "nvptx_nonmemory_operand" "Ri")] + "" +{ + emit_insn (nvptx_gen_shuffle (operands[0], operands[1], operands[2], + SHUFFLE_BFLY)); + DONE; +}) + +;; Implement IFN_GOMP_SIMT_XCHG_IDX: broadcast value in operand 1 +;; from lane given by index in operand 2 to operand 0 in all lanes +(define_expand "omp_simt_xchg_idx" + [(match_operand 0 "nvptx_register_operand" "=R") + (match_operand 1 "nvptx_register_operand" "R") + (match_operand:SI 2 "nvptx_nonmemory_operand" "Ri")] + "" +{ + emit_insn (nvptx_gen_shuffle (operands[0], operands[1], operands[2], + SHUFFLE_IDX)); + DONE; +}) + +;; Implement IFN_GOMP_SIMT_VOTE_ANY: +;; set operand 0 to zero iff all lanes supply zero in operand 1 +(define_expand "omp_simt_vote_any" + [(match_operand:SI 0 "nvptx_register_operand" "=R") + (match_operand:SI 1 "nvptx_register_operand" "R")] + "" +{ + rtx pred = gen_reg_rtx (BImode); + emit_move_insn (pred, gen_rtx_NE (BImode, operands[1], const0_rtx)); + emit_insn (gen_nvptx_vote_ballot (operands[0], pred)); + DONE; +}) + +;; Implement IFN_GOMP_SIMT_LAST_LANE: +;; set operand 0 to the lowest lane index that passed non-zero in operand 1 +(define_expand "omp_simt_last_lane" + [(match_operand:SI 0 "nvptx_register_operand" "=R") + (match_operand:SI 1 "nvptx_register_operand" "R")] + "" +{ + rtx pred = gen_reg_rtx (BImode); + rtx tmp = gen_reg_rtx (SImode); + emit_move_insn (pred, gen_rtx_NE (BImode, operands[1], const0_rtx)); + emit_insn (gen_nvptx_vote_ballot (tmp, pred)); + emit_insn (gen_ctzsi2 (operands[0], tmp)); + DONE; +}) + ;; extract parts of a 64 bit object into 2 32-bit ints (define_insn "unpacksi2" [(set (match_operand:SI 0 "nvptx_register_operand" "=R") @@ -1186,7 +1309,8 @@ (set (match_dup 1) (unspec_volatile:SDIM [(const_int 0)] UNSPECV_CAS))] "" - "%.\\tatom%A1.cas.b%T0\\t%0, %1, %2, %3;") + "%.\\tatom%A1.cas.b%T0\\t%0, %1, %2, %3;" + [(set_attr "atomic" "true")]) (define_insn "atomic_exchange" [(set (match_operand:SDIM 0 "nvptx_register_operand" "=R") ;; output @@ -1197,7 +1321,8 @@ (set (match_dup 1) (match_operand:SDIM 2 "nvptx_nonmemory_operand" "Ri"))] ;; input "" - "%.\\tatom%A1.exch.b%T0\\t%0, %1, %2;") + "%.\\tatom%A1.exch.b%T0\\t%0, %1, %2;" + [(set_attr "atomic" "true")]) (define_insn "atomic_fetch_add" [(set (match_operand:SDIM 1 "memory_operand" "+m") @@ -1209,7 +1334,8 @@ (set (match_operand:SDIM 0 "nvptx_register_operand" "=R") (match_dup 1))] "" - "%.\\tatom%A1.add%t0\\t%0, %1, %2;") + "%.\\tatom%A1.add%t0\\t%0, %1, %2;" + [(set_attr "atomic" "true")]) (define_insn "atomic_fetch_addsf" [(set (match_operand:SF 1 "memory_operand" "+m") @@ -1221,7 +1347,8 @@ (set (match_operand:SF 0 "nvptx_register_operand" "=R") (match_dup 1))] "" - "%.\\tatom%A1.add%t0\\t%0, %1, %2;") + "%.\\tatom%A1.add%t0\\t%0, %1, %2;" + [(set_attr "atomic" "true")]) (define_code_iterator any_logic [and ior xor]) (define_code_attr logic [(and "and") (ior "or") (xor "xor")]) @@ -1237,10 +1364,18 @@ (set (match_operand:SDIM 0 "nvptx_register_operand" "=R") (match_dup 1))] "0" - "%.\\tatom%A1.b%T0.\\t%0, %1, %2;") + "%.\\tatom%A1.b%T0.\\t%0, %1, %2;" + [(set_attr "atomic" "true")]) (define_insn "nvptx_barsync" [(unspec_volatile [(match_operand:SI 0 "const_int_operand" "")] UNSPECV_BARSYNC)] "" - "\\tbar.sync\\t%0;") + "\\tbar.sync\\t%0;" + [(set_attr "predicable" "false")]) + +(define_insn "nvptx_nounroll" + [(unspec_volatile [(const_int 0)] UNSPECV_NOUNROLL)] + "" + "\\t.pragma \\\"nounroll\\\";" + [(set_attr "predicable" "false")]) diff --git a/gcc/config/nvptx/nvptx.opt b/gcc/config/nvptx/nvptx.opt index 601cf124e36..cb6194da9dc 100644 --- a/gcc/config/nvptx/nvptx.opt +++ b/gcc/config/nvptx/nvptx.opt @@ -32,3 +32,15 @@ Link in code for a __main kernel. moptimize Target Report Var(nvptx_optimize) Init(-1) Optimize partition neutering. + +msoft-stack +Target Report Mask(SOFT_STACK) +Use custom stacks instead of local memory for automatic storage. + +muniform-simt +Target Report Mask(UNIFORM_SIMT) +Generate code that can keep local state uniform across all lanes. + +mgomp +Target Report Mask(GOMP) +Generate code for OpenMP offloading: enables -msoft-stack and -muniform-simt. diff --git a/gcc/config/nvptx/t-nvptx b/gcc/config/nvptx/t-nvptx index e2580c956e0..6c1010ddd66 100644 --- a/gcc/config/nvptx/t-nvptx +++ b/gcc/config/nvptx/t-nvptx @@ -8,3 +8,5 @@ ALL_HOST_OBJS += mkoffload.o mkoffload$(exeext): mkoffload.o collect-utils.o libcommon-target.a $(LIBIBERTY) $(LIBDEPS) +$(LINKER) $(ALL_LINKERFLAGS) $(LDFLAGS) -o $@ \ mkoffload.o collect-utils.o libcommon-target.a $(LIBIBERTY) $(LIBS) + +MULTILIB_OPTIONS = mgomp diff --git a/gcc/doc/extend.texi b/gcc/doc/extend.texi index 0669f7999be..4dcc7f6b555 100644 --- a/gcc/doc/extend.texi +++ b/gcc/doc/extend.texi @@ -5576,6 +5576,7 @@ attributes. * MeP Variable Attributes:: * Microsoft Windows Variable Attributes:: * MSP430 Variable Attributes:: +* Nvidia PTX Variable Attributes:: * PowerPC Variable Attributes:: * RL78 Variable Attributes:: * SPU Variable Attributes:: @@ -6257,6 +6258,20 @@ same name (@pxref{MSP430 Function Attributes}). These attributes can be applied to both functions and variables. @end table +@node Nvidia PTX Variable Attributes +@subsection Nvidia PTX Variable Attributes + +These variable attributes are supported by the Nvidia PTX back end: + +@table @code +@item shared +@cindex @code{shared} attribute, Nvidia PTX +Use this attribute to place a variable in the @code{.shared} memory space. +This memory space is private to each cooperative thread array; only threads +within one thread block refer to the same instance of the variable. +The runtime does not initialize variables in this memory space. +@end table + @node PowerPC Variable Attributes @subsection PowerPC Variable Attributes diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi index 1d24b31a3f9..620225c37cf 100644 --- a/gcc/doc/invoke.texi +++ b/gcc/doc/invoke.texi @@ -20570,6 +20570,37 @@ offloading execution. Apply partitioned execution optimizations. This is the default when any level of optimization is selected. +@item -msoft-stack +@opindex msoft-stack +Generate code that does not use @code{.local} memory +directly for stack storage. Instead, a per-warp stack pointer is +maintained explicitly. This enables variable-length stack allocation (with +variable-length arrays or @code{alloca}), and when global memory is used for +underlying storage, makes it possible to access automatic variables from other +threads, or with atomic instructions. This code generation variant is used +for OpenMP offloading, but the option is exposed on its own for the purpose +of testing the compiler; to generate code suitable for linking into programs +using OpenMP offloading, use option @option{-mgomp}. + +@item -muniform-simt +@opindex muniform-simt +Switch to code generation variant that allows to execute all threads in each +warp, while maintaining memory state and side effects as if only one thread +in each warp was active outside of OpenMP SIMD regions. All atomic operations +and calls to runtime (malloc, free, vprintf) are conditionally executed (iff +current lane index equals the master lane index), and the register being +assigned is copied via a shuffle instruction from the master lane. Outside of +SIMD regions lane 0 is the master; inside, each thread sees itself as the +master. Shared memory array @code{int __nvptx_uni[]} stores all-zeros or +all-ones bitmasks for each warp, indicating current mode (0 outside of SIMD +regions). Each thread can bitwise-and the bitmask at position @code{tid.y} +with current lane index to compute the master lane index. + +@item -mgomp +@opindex mgomp +Generate code for use in OpenMP offloading: enables @option{-msoft-stack} and +@option{-muniform-simt} options, and selects corresponding multilib variant. + @end table @node PDP-11 Options diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi index 85341aeeace..84bba07de27 100644 --- a/gcc/doc/tm.texi +++ b/gcc/doc/tm.texi @@ -5862,6 +5862,10 @@ usable. In that case, the smaller the number is, the more desirable it is to use it. @end deftypefn +@deftypefn {Target Hook} int TARGET_SIMT_VF (void) +Return number of threads in SIMT thread group on the target. +@end deftypefn + @deftypefn {Target Hook} bool TARGET_GOACC_VALIDATE_DIMS (tree @var{decl}, int *@var{dims}, int @var{fn_level}) This hook should check the launch dimensions provided for an OpenACC compute region, or routine. Defaulted values are represented as -1 diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in index 400d574f3fa..9afd5daa65b 100644 --- a/gcc/doc/tm.texi.in +++ b/gcc/doc/tm.texi.in @@ -4295,6 +4295,8 @@ address; but often a machine-dependent strategy can generate better code. @hook TARGET_SIMD_CLONE_USABLE +@hook TARGET_SIMT_VF + @hook TARGET_GOACC_VALIDATE_DIMS @hook TARGET_GOACC_DIM_LIMIT diff --git a/gcc/target-insns.def b/gcc/target-insns.def index a6a040eb173..e011a5a7f26 100644 --- a/gcc/target-insns.def +++ b/gcc/target-insns.def @@ -68,6 +68,12 @@ DEF_TARGET_INSN (oacc_dim_pos, (rtx x0, rtx x1)) DEF_TARGET_INSN (oacc_dim_size, (rtx x0, rtx x1)) DEF_TARGET_INSN (oacc_fork, (rtx x0, rtx x1, rtx x2)) DEF_TARGET_INSN (oacc_join, (rtx x0, rtx x1, rtx x2)) +DEF_TARGET_INSN (omp_simt_lane, (rtx x0)) +DEF_TARGET_INSN (omp_simt_last_lane, (rtx x0, rtx x1)) +DEF_TARGET_INSN (omp_simt_ordered, (rtx x0, rtx x1)) +DEF_TARGET_INSN (omp_simt_vote_any, (rtx x0, rtx x1)) +DEF_TARGET_INSN (omp_simt_xchg_bfly, (rtx x0, rtx x1, rtx x2)) +DEF_TARGET_INSN (omp_simt_xchg_idx, (rtx x0, rtx x1, rtx x2)) DEF_TARGET_INSN (prefetch, (rtx x0, rtx x1, rtx x2)) DEF_TARGET_INSN (probe_stack, (rtx x0)) DEF_TARGET_INSN (probe_stack_address, (rtx x0)) diff --git a/gcc/target.def b/gcc/target.def index caeeff9c22a..c24b4cf5ee2 100644 --- a/gcc/target.def +++ b/gcc/target.def @@ -1648,6 +1648,18 @@ int, (struct cgraph_node *), NULL) HOOK_VECTOR_END (simd_clone) +/* Functions relating to OpenMP SIMT vectorization transform. */ +#undef HOOK_PREFIX +#define HOOK_PREFIX "TARGET_SIMT_" +HOOK_VECTOR (TARGET_SIMT, simt) + +DEFHOOK +(vf, +"Return number of threads in SIMT thread group on the target.", +int, (void), NULL) + +HOOK_VECTOR_END (simt) + /* Functions relating to openacc. */ #undef HOOK_PREFIX #define HOOK_PREFIX "TARGET_GOACC_" diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog index a1a66c3775a..4496de29f35 100644 --- a/gcc/testsuite/ChangeLog +++ b/gcc/testsuite/ChangeLog @@ -1,3 +1,11 @@ +2016-11-16 Alexander Monakov + + * lib/target-supports.exp (check_effective_target_alloca): Use a + compile test. + * gcc.target/nvptx/softstack.c: New test. + * gcc.target/nvptx/decl-shared.c: New test. + * gcc.target/nvptx/decl-shared-init.c: New test. + 2016-11-16 Maciej W. Rozycki * gcc.target/mips/data-sym-jump.c: New test case. diff --git a/gcc/testsuite/gcc.target/nvptx/decl-shared-init.c b/gcc/testsuite/gcc.target/nvptx/decl-shared-init.c new file mode 100644 index 00000000000..6a99b1c338a --- /dev/null +++ b/gcc/testsuite/gcc.target/nvptx/decl-shared-init.c @@ -0,0 +1 @@ +int var __attribute__((shared)) = 0; /* { dg-error "static initialization .* not supported" } */ diff --git a/gcc/testsuite/gcc.target/nvptx/decl-shared.c b/gcc/testsuite/gcc.target/nvptx/decl-shared.c new file mode 100644 index 00000000000..367075cebe2 --- /dev/null +++ b/gcc/testsuite/gcc.target/nvptx/decl-shared.c @@ -0,0 +1,14 @@ +static int v_internal __attribute__((shared,used)); +int v_common __attribute__((shared)); +int v_extdef __attribute__((shared,nocommon)); +extern int v_extdecl __attribute__((shared)); + +int use() +{ + return v_extdecl; +} + +/* { dg-final { scan-assembler "\[\r\n\]\[\t \]*.shared \[^,\r\n\]*v_internal" } } */ +/* { dg-final { scan-assembler "\[\r\n\]\[\t \]*.weak .shared \[^,\r\n\]*v_common" } } */ +/* { dg-final { scan-assembler "\[\r\n\]\[\t \]*.visible .shared \[^,\r\n\]*v_extdef" } } */ +/* { dg-final { scan-assembler "\[\r\n\]\[\t \]*.extern .shared \[^,\r\n\]*v_extdecl" } } */ diff --git a/gcc/testsuite/gcc.target/nvptx/softstack.c b/gcc/testsuite/gcc.target/nvptx/softstack.c new file mode 100644 index 00000000000..73e60f282a7 --- /dev/null +++ b/gcc/testsuite/gcc.target/nvptx/softstack.c @@ -0,0 +1,23 @@ +/* { dg-options "-O2 -msoft-stack" } */ +/* { dg-do run } */ + +static __attribute__((noinline,noclone)) int f(int *p) +{ + return __sync_lock_test_and_set(p, 1); +} + +static __attribute__((noinline,noclone)) int g(int n) +{ + /* Check that variable-length stack allocation works. */ + int v[n]; + v[0] = 0; + /* Check that atomic operations can be applied to auto data. */ + return f(v) == 0 && v[0] == 1; +} + +int main() +{ + if (!g(1)) + __builtin_abort(); + return 0; +} diff --git a/gcc/testsuite/lib/target-supports.exp b/gcc/testsuite/lib/target-supports.exp index e62b7688798..ec99708f633 100644 --- a/gcc/testsuite/lib/target-supports.exp +++ b/gcc/testsuite/lib/target-supports.exp @@ -763,7 +763,10 @@ proc check_effective_target_untyped_assembly {} { proc check_effective_target_alloca {} { if { [istarget nvptx-*-*] } { - return 0 + return [check_no_compiler_messages alloca assembly { + void f (void*); + void g (int n) { f (__builtin_alloca (n)); } + }] } return 1 } diff --git a/libgcc/ChangeLog b/libgcc/ChangeLog index cfa61115f6a..4daa0aa3409 100644 --- a/libgcc/ChangeLog +++ b/libgcc/ChangeLog @@ -1,3 +1,9 @@ +2016-11-16 Alexander Monakov + + * config/nvptx/crt0.c (__main): Setup __nvptx_stacks and __nvptx_uni. + * config/nvptx/mgomp.c: New file. + * config/nvptx/t-nvptx: Add mgomp.c + 2016-11-16 Waldemar Brodkorb PR libgcc/68468 diff --git a/libgcc/config/nvptx/crt0.c b/libgcc/config/nvptx/crt0.c index 3b7382d43af..5a7dbf6f22f 100644 --- a/libgcc/config/nvptx/crt0.c +++ b/libgcc/config/nvptx/crt0.c @@ -24,6 +24,14 @@ int *__exitval_ptr; extern void __attribute__((noreturn)) exit (int status); extern int main (int, void **); +/* Always setup soft stacks to allow testing with -msoft-stack but without + -mgomp. 32 is the maximum number of warps in a CTA: the definition here + must match the external declaration emitted by the compiler. */ +void *__nvptx_stacks[32] __attribute__((shared,nocommon)); + +/* Likewise for -muniform-simt. */ +unsigned __nvptx_uni[32] __attribute__((shared,nocommon)); + void __attribute__((kernel)) __main (int *rval_ptr, int argc, void **argv) { @@ -33,5 +41,9 @@ __main (int *rval_ptr, int argc, void **argv) if (rval_ptr) *rval_ptr = 255; + static char stack[131072] __attribute__((aligned(8))); + __nvptx_stacks[0] = stack + sizeof stack; + __nvptx_uni[0] = 0; + exit (main (argc, argv)); } diff --git a/libgcc/config/nvptx/mgomp.c b/libgcc/config/nvptx/mgomp.c new file mode 100644 index 00000000000..d8ca5818314 --- /dev/null +++ b/libgcc/config/nvptx/mgomp.c @@ -0,0 +1,32 @@ +/* Define shared memory arrays for -msoft-stack and -muniform-simt. + + Copyright (C) 2015-2016 Free Software Foundation, Inc. + + This file is free software; you can redistribute it and/or modify it + under the terms of the GNU General Public License as published by the + Free Software Foundation; either version 3, or (at your option) any + later version. + + This file is distributed in the hope that it will be useful, but + WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + General Public License for more details. + + Under Section 7 of GPL version 3, you are granted additional + permissions described in the GCC Runtime Library Exception, version + 3.1, as published by the Free Software Foundation. + + You should have received a copy of the GNU General Public License and + a copy of the GCC Runtime Library Exception along with this program; + see the files COPYING3 and COPYING.RUNTIME respectively. If not, see + . */ + +/* OpenACC offloading does not use these symbols; thus, they are exposed + only for the -mgomp multilib. The same definitions are also provided + in crt0.c for the case of non-offloading compilation. 32 is the maximum + number of warps in a CTA. */ + +#if defined(__nvptx_softstack__) && defined(__nvptx_unisimt__) +void *__nvptx_stacks[32] __attribute__((shared,nocommon)); +unsigned __nvptx_uni[32] __attribute__((shared,nocommon)); +#endif diff --git a/libgcc/config/nvptx/t-nvptx b/libgcc/config/nvptx/t-nvptx index daf252f2472..c4d20c94cbb 100644 --- a/libgcc/config/nvptx/t-nvptx +++ b/libgcc/config/nvptx/t-nvptx @@ -1,4 +1,5 @@ -LIB2ADD=$(srcdir)/config/nvptx/reduction.c +LIB2ADD=$(srcdir)/config/nvptx/reduction.c \ + $(srcdir)/config/nvptx/mgomp.c LIB2ADDEH= LIB2FUNCS_EXCLUDE=__main