NOTICE: This version of the NSF Unidata web site (archive.unidata.ucar.edu) is no longer being updated.
Current content can be found at unidata.ucar.edu.
To learn about what's going on, see About the Archive Site.
nco & netcdf aficionados. In preparing to optimize nco via MPI, I've been doing some profiling of the code to see where it's bogging down. Here are the benchmark-like runs done with oprofile, a free, very cool, easy-to-install, easy-to-use, stochastic profiler available for a variety of platforms (oprofile.sf.net) The first few have been done 2x to show the kind of reproducibility between runs (they're not identical b/c oprofile time-samples instead of using hardware counters or synthetic CPUS (like valgrind) - the upside is that there's no need to specially instrument the code (besides adding the -g CFLAG for compilation) and (unlike valgrind), the apps run at close to normal speed; the downside is that the process will miss a few counts). I've truncated most of the outputs, except the second below to show the entire range of counts. Some comments <---- are inserted. This approach has pointed out some targets for us to optimize, but it also points out some problems in general with aspects of netcdf files (some of which has probably been gone over in depth before). Most troubling for nco is the byteswapping bottleneck (swapn4b at netcdf-3.6.0/src/libsrc/ncx.c:119), but there are a number of similar byeswap fn()s. Some of the nco's spend >95% of the whole execution time byteswapping and many common operations spend >20% byteswapping. Charlie may have mentioned this in the past - I'd like to bring it up again. Would it be possible to allow pass-thru of those bytes that are not being changed (it seems silly to byteswap twice for no reason). Or, since a file read once and written on one architecture is most likely to be read and written again on the same architecture, can the byte order be written native for that arch with the order noted in the attributes? Does netcdf have any mechanisms for allowing this approach currently? One reason for revisiting this issue is that Charlie was buttonholed by a guy who was in charge of postprocessing IPCC files and the was pretty perturbed by the amount of time it took to do it. His requirements may not have been directly related, but even 30% speedup is significant for petabyte data sizes. I'm not sure yet how much time opendap spends doing this, but it wouldn't surprise me if it's similar. Methology: After starting the oprofile daemon: 'sudo opcontrol --no-vmlinux --start' to start the profiler, I ran: 'sudo opcontrol --reset' between each run to clear the profile counters to prevent previous runs from contaminating successive runs. Because of the way it works, it can accurately sample very long-running processes and your results will be confounded (as mine were) if different apps call the same functions or even (possibly) different functions in the same library. So make sure to reset oprofile before running another test. In each stanza below, the commandline shown was run after resetting oprofile and the resulting report was truncated in most instances at the 1-2% level. /usr/local/bin/ncap -h -O -s "nu_var1 [time,lat,lon,lev]=d4_01*d4_02*(d4_03**2)-(d4_05/d4_06)" -s "nu_var2 [lat,time,lev,lon]=(d4_13/d4_02)*((d4_03**2)-(d4_05/d4_06))" -s "nu_var3 [time,lat,lon]=(d3_08*3d_01)-(3d_05**3)-(3d_11*3d_16)" -s "nu_var4 [lon,lat,time]=(d3_08+3d_01)-(3d_05*3)-3d_11-17.33)" -p /data/hjm/nco_bm ipcc_dly_T85.nc /data/hjm/nco_bm/foo.nc (2 runs) CPU: AMD64 processors, speed 1804.13 MHz (estimated) Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 100000 samples % symbol name 44010 96.8466 swapn4b <- 96% of time spent byteswapping!! 688 1.5140 px_get CPU: AMD64 processors, speed 1804.13 MHz (estimated) Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 100000 samples % symbol name 44973 96.6995 swapn4b 699 1.5030 px_get 211 0.4537 ncio_px_get 115 0.2473 px_pgin 90 0.1935 getNCvx_float_float 90 0.1935 putNCvx_float_float 84 0.1806 ncx_howmany 54 0.1161 ncx_putn_float_float 47 0.1011 px_pgout 32 0.0688 ncx_getn_float_float 25 0.0538 ncio_px_rel 14 0.0301 anonymous symbol from section .plt 11 0.0237 px_rel 7 0.0151 NC_findvar 6 0.0129 nc_put_vara_float 6 0.0129 nco_put_vara 5 0.0108 NC_varoffset 4 0.0086 NCcoordck 4 0.0086 nc_get_vara_float 3 0.0065 NC_check_id 3 0.0065 getNCv_float 2 0.0043 NCvnrecs 2 0.0043 nco_var_val_cpy 2 0.0043 putNCv_float 2 0.0043 yyparse 1 0.0022 CreateNode 1 0.0022 NC_check_vlen 1 0.0022 NC_endef 1 0.0022 NC_lookupvar 1 0.0022 NCedgeck 1 0.0022 nc__create_mp 1 0.0022 nc_inq_att 1 0.0022 nco_free 1 0.0022 nco_get_vara 1 0.0022 nco_inq_dimid_flg 1 0.0022 nco_mss_val_get 1 0.0022 nco_typ_lng 1 0.0022 nco_var_fll 1 0.0022 new_NC 1 0.0022 odo1 1 0.0022 set_upper 1 0.0022 var_dfl_set ##################################### /usr/local/bin/ncrcat -h -O -n 4,2,1 -p /data/hjm/nco_bm skn_lgs_00.nc /data/hjm/nco_bm/foo.nc (2 runs for 4 files each) CPU: AMD64 processors, speed 1804.13 MHz (estimated) Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 100000 samples % symbol name 162908 30.7513 px_get <--- looks like I/O bound..? 43256 8.1652 main 26857 5.0697 getNCvx_float_float 26640 5.0287 nco_var_get 18739 3.5373 nc_get_var1_float 18640 3.5186 ncio_px_get 16911 3.1922 nc_put_var1_float 16301 3.0771 NC_findattr 16017 3.0235 putNCvx_float_float 15633 2.9510 NCcoordck 13847 2.6138 NC_lookupattr 13791 2.6033 nco_pck_dsk_inq 13701 2.5863 getNCv_float 12746 2.4060 nco_get_var1 12226 2.3078 nco_inq_att_flg 9992 1.8861 NC_check_id 7214 1.3618 swapn4b CPU: AMD64 processors, speed 1804.13 MHz (estimated) Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 100000 samples % symbol name 156526 30.6816 px_get 41796 8.1927 main 26904 5.2736 getNCvx_float_float 25713 5.0402 nco_var_get 18392 3.6051 nc_get_var1_float 18277 3.5826 ncio_px_get 16134 3.1625 nc_put_var1_float 15173 2.9742 NC_findattr 14905 2.9216 NCcoordck 14595 2.8609 putNCvx_float_float 13830 2.7109 getNCv_float 13303 2.6076 nco_pck_dsk_inq 12602 2.4702 nco_get_var1 12523 2.4547 NC_lookupattr 11285 2.2120 nco_inq_att_flg ##################################### /usr/local/bin/ncrcat -h -O -n 30,2,1 -p /data/hjm/nco_bm skn_lgs_00.nc /data/hjm/nco_bm/foo.nc (2 runs for 30 files each to see if there was any change in distribution (not really)) CPU: AMD64 processors, speed 1804.13 MHz (estimated) Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 100000 samples % symbol name 1188898 30.3816 px_get <--- I/O bound..? 329096 8.4098 main 194113 4.9604 getNCvx_float_float 193716 4.9503 nco_var_get 139195 3.5570 nc_get_var1_float 138361 3.5357 ncio_px_get 130332 3.3306 nc_put_var1_float 120096 3.0690 NC_findattr 117864 3.0119 putNCvx_float_float 115390 2.9487 NCcoordck 100581 2.5703 nco_pck_dsk_inq 100496 2.5681 NC_lookupattr 97406 2.4892 nco_get_var1 96558 2.4675 getNCv_float 85875 2.1945 nco_inq_att_flg 77161 1.9718 NC_check_id 52039 1.3298 swapn4b CPU: AMD64 processors, speed 1804.13 MHz (estimated) Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 100000 samples % symbol name 1143662 29.1803 px_get 330483 8.4322 main 209255 5.3391 getNCvx_float_float 202318 5.1621 nco_var_get 141829 3.6187 nc_get_var1_float 134400 3.4292 ncio_px_get 133916 3.4168 nc_put_var1_float 120232 3.0677 NCcoordck 117733 3.0039 NC_findattr 114406 2.9190 putNCvx_float_float 106404 2.7149 getNCv_float 102715 2.6208 nco_pck_dsk_inq 102116 2.6055 nco_get_var1 99158 2.5300 NC_lookupattr 90043 2.2974 nco_inq_att_flg 79626 2.0316 NC_check_id 52848 1.3484 swapn4b ##################################### /usr/local/bin/ncra -D2 -h -O -n 4,2,1 -p /data/hjm/nco_bm ipcc_dly_T85_00.nc /data/hjm/nco_bm/foo.nc CPU: AMD64 processors, speed 1804.13 MHz (estimated) Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 100000 samples % symbol name 227839 65.9413 nco_var_add_tll_ncra <---- optimization / MPI target! 109567 31.7109 swapn4b <---- again, byteswapping 5253 1.5203 nco_var_nrm 1200 0.3473 px_get ##################################### /usr/local/bin/ncpdq -D2 -h -O -a 'lon,time,lev,lat' -p /data/hjm/nco_bm ipcc_dly_T85.nc /data/hjm/nco_bm/foo.nc CPU: AMD64 processors, speed 1804.13 MHz (estimated) Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 100000 samples % symbol name 1736463 96.9924 nco_var_dmn_rdr_val <---- optimization / MPI target! 49459 2.7626 swapn4b 2915 0.1628 anonymous symbol from section .plt 659 0.0368 px_get /usr/local/bin/ncflint -D2 -h -O -w '0.5' -p /data/hjm/nco_bm ipcc_dly_T85_00.nc ipcc_dly_T85_01.nc /data/hjm/nco_bm/foo.nc CPU: AMD64 processors, speed 1804.13 MHz (estimated) Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 100000 samples % symbol name 115001 29.4886 nco_var_mlt 113175 29.0204 nco_var_cnf_typ 82781 21.2268 nco_var_add_tll_ncflint 75683 19.4067 swapn4b <---- again, byteswapping 969 0.2485 px_get ##################################### /usr/local/bin/ncwa -D2 -h -O -y sqrt -a lat,lon /data/hjm/nco_bm/ipcc_dly_T85.nc /data/hjm/nco_bm/foo.nc CPU: AMD64 processors, speed 1804.13 MHz (estimated) Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 100000 samples % symbol name 1535141 95.8639 nco_var_avg <---- optimization / MPI target! 37378 2.3341 nco_var_avg_reduce_ttl 26714 1.6682 swapn4b CPU: AMD64 processors, speed 1804.13 MHz (estimated) Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 100000 samples % symbol name 1539548 95.8061 nco_var_avg 38195 2.3769 nco_var_avg_reduce_ttl 26820 1.6690 swapn4b [with a different, smaller file, relatively more time is spent byteswapping] /usr/local/bin/ncwa -D2 -h -O -y sqrt -a lat,lon ~/data/nco_bm/sml_stl_28.nc ~/data/nco_bm/sml_stl_out.nc CPU: AMD64 processors, speed 1804.13 MHz (estimated) Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 100000 samples % symbol name 47241 89.9263 nco_var_avg 2607 4.9626 swapn4b 2450 4.6637 nco_var_avg_reduce_ttl ##################################### ##################################### /usr/local/bin/ncbo -D2 -h -O --op_typ='-' -p /data/hjm/nco_bm ipcc_dly_T85.nc ipcc_dly_T85_00.nc /data/hjm/nco_bm/foo.nc CPU: AMD64 processors, speed 1804.13 MHz (estimated) Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 100000 samples % symbol name 74162 69.2249 swapn4b <---- again, byteswapping 30484 28.4546 nco_var_sbt 1045 0.9754 px_get ##################################### /usr/local/bin/ncea -h -O -D2 -n 32,2,1 -p /data/hjm/nco_bm stl_5km_00.nc /data/hjm/nco_bm/foo.nc CPU: AMD64 processors, speed 1804.13 MHz (estimated) Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 100000 samples % symbol name 489123 65.1199 nco_var_add_tll_ncra <---- optimization / MPI target! 246314 32.7933 swapn4b <---- again, byteswapping 10816 1.4400 nco_var_nrm 2456 0.3270 px_get ##################################### /usr/local/bin/ncecat -D2 -h -O -n 32,2,1 -p /data/hjm/nco_bm skn_lgs_00.nc /data/hjm/nco_bm/foo.nc CPU: AMD64 processors, speed 1804.13 MHz (estimated) Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 100000 samples % symbol name 561993 54.5130 px_get <------ I/O bound..? 132219 12.8252 getNCvx_float_float 66122 6.4138 nc_get_vara_float 63292 6.1393 ncio_px_get 51907 5.0350 swapn4b 46320 4.4930 getNCv_float 27221 2.6404 NC_varoffset 23341 2.2641 ncx_getn_float_float 17055 1.6543 ncio_px_rel 13584 1.3176 ncx_howmany 13558 1.3151 px_rel 13390 1.2988 odo1 -- Cheers, Harry Harry J Mangalam - 949 856 2847 (vox; email for fax) - hjm@xxxxxxxxx <<plain text preferred>>
netcdfgroup
archives: