isthescalerproductofaandb,f(x)istheinputsignal,jistheanalysislevel,φandψarethescalingandwaveletfunctionrespectively.Manydifferentwaveletfamiliesex-istintheliterature,suchasHaar,Daubechies,andMexican-hat.WeuseDaubechies-6inourexperiments.Otherfamilieswehavetestedproduceasimilarresult.Onhigh-resolutionlevels,thepointswithhighwaveletcoefficientvaluessignalabruptchanges;there-foretheyarelikelyphasechangingpoints.Thewaveletfilteringtakesthereuse-distancetraceofeachdatasampleasasignal,thencomputesthelevel-1coefficientforeachaccessandremovesfromthetracetheaccesseswithalowwaveletcoefficientvalue.Anaccessiskeptonlyifitscoefficientω>m+3δ,wheremisthemeanandδisthestandarddeviation.Thedifferencebetweenthiscoefficientandothersisstatisticallysignifi-cant.Wehaveexperimentedwithcoefficientsofthenextfourlevelsandfoundthelevel-1coefficientadequate.
Figure2showsthewaveletfilteringfortheaccesstraceofadatasampleinMolDyn,amoleculardynamicssimulationprogram.Thefilteringremovesaccessesduringthegradualchangesbecausetheyhavelowcoefficients.Notethatitcorrectlyremovesaccessesthatcorrespondtolocalpeaks.Theremainingfouraccessesindicateglobalphasechanges.
Sherwoodetal.usedtheFouriertransformtofindperiodicpat-ternsinexecutiontrace[29].TheFouriertransformshowsthefre-quenciesappearedduringthewholesignal.Incomparison,waveletsgivesthetime-frequencyorthefrequenciesappearedovertime.Josephetal.usedwaveletstoanalyzethechangeofprocessorvoltageovertimeandtomakeon-linepredictionsusinganeffi-cientHaar-waveletimplementation[18].Weusewaveletssimilartotheiroff-lineanalysisbutatmuchfinergranularity(becauseofthenatureofourproblem).Insteadoffilteringtheaccesstraceofalldata,weanalyzethesub-traceforeachdataelement.Thisiscriticalbecauseagradualchangeinthesubtracemaybeseenasanabruptchangeinthewholetraceandcausefalsepositivesinthewaveletanalysis.WewillshowanexamplelaterinFigure3(b),wheremostabruptchangesseenfromthewholetracearenotphasechanges.
Afteritfiltersthesub-traceofeachdatasample,thefilteringsteprecompilestheremainingaccessesofalldatasamplesintheorderoflogicaltime.Thenewtraceiscalledafilteredtrace.Sincetheremainingaccessesofdifferentdataelementsmaysignalthesamephaseboundary,weuseoptimalphasepartitioningtofurtherremovetheseredundantindicators.
Figure2:Awavelettransformexample,wheregradualchangesarefilteredout
2.2.3OptimalPhasePartitioning
Ataphaseboundary,manydatachangetheiraccesspatterns.Sincethewaveletfilteringremovesreusesofthesamedatawithinaphase,theremainingismainlyaccessestodifferentdatasamplesclusteredatphaseboundaries.Thesetwopropertiessuggesttwoconditionsforagoodphasepartition.First,aphaseshouldincludeaccessestoasmanydatasamplesaspossible.Thisensuresthatwedonotartificiallycutaphaseintosmallerpieces.Second,aphaseshouldnotincludemultipleaccessesofthesamedatasample,sincedatareusesindicatephasechangesinthefilteredtrace.Thecomplication,however,comesfromtheimperfectfilteringbythewavelettransform.Notallreusesrepresentaphasechange.
Weconvertthefilteredtraceintoadirectedacyclicgraphwhereeachnodeisanaccessinthetrace.Eachnodehasadirectededgetoallsucceedingnodes.Eachedge(fromaccessatob)hasaweightdefinedasw=αr+1,where1≥α≥0,andristhenum-berofnoderecurrencesbetweenaandb.Forexample,thetraceaceefgefbdhastworecurrencesofeandonerecurrenceoffbe-tweencandb,sotheedgeweightbetweenthetwonodesis3α+1.Intuitively,theweightmeasureshowfitthesegmentfromatobisasaphase.Thetwofactorsintheweightpenalizetwotendencies.Thefirstistheinclusionofreuses,andthesecondisthecreationofnewphases.Theoptimalcaseisaminimalnumberofphaseswithleastreusesineachphase.Sincethetraceisnotperfect,theweightandthefactorαcontroltherelativepenaltyfortoolargeortoosmallphases.Ifαis1,weprohibitanyreusesinaphase.Wemayhaveasmanyphasesasthelengthofthefilteredtrace.Theresultwhenα≥1isthesameasα=1.Ifαis0,wegetonephase.Inexperiments,wefoundthatthephasepartitionsweresimilarwhenαisbetween0.2and0.8,suggestingthatthenoiseinthefilteredtracewasacceptable.Weusedα=0.5intheevaluation.
Onceαisdetermined,shortest-pathanalysisfindsaphaseparti-tionthatminimizesthetotalpenalty.Itaddstwonodes:asourcenodethathasdirectededgesflowingtoalloriginalnodes,andasinknodethathasdirectededgescomingfromalloriginalnodes.Anydirectedpathfromthesourcetothesinkgivesaphaseparti-tion.Theweightofthepathisitspenalty.Therefore,thebestphasepartitiongivestheleastpenalty,anditisgivenbytheshortestpathbetweenthesourceandthesink.
Summaryofoff-linephasedetectionTheprogramlocalityisaproductofallaccessestoallprogramdata.Thephasedetectionfirstpicksenoughsamplesintimeandspacetocapturethehigh-levelpattern.Thenituseswaveletstoremovethetemporalredun-dancyandphasepartitioningtoremovethespatialredundancy.Thenextchallengeismarkingthephasesinprogramcode.Thewaveletfilteringlosesaccuratetimeinformationbecausesamplesarecon-sideredapairatatime(tomeasurethedifference).Inaddition,thelocalitymaychangethroughatransitionperiodinsteadofatran-sitionpoint.Hencetheexacttimeofaphasechangeisdifficulttoattain.Weaddressthisprobleminthenextstep.
2.3PhaseMarkerSelection
Theinstructiontraceofanexecutionisrecordedatthegranular-ityofbasicblocks.Theresultisablocktrace,whereeachelementisthelabelofabasicblock.Thisstepfindsthebasicblocksinthecodethatuniquelymarkdetectedphases.Previousprogramanalysisconsideredonlyasubsetofcodelocations,forexamplefunctionandloopboundaries[17,21,22].Ouranalysisexaminesallinstructionblocks,whichisequivalenttoexaminingallprograminstructions.Thisisespeciallyimportantatthebinarylevel,wherethehighlevelprogramstructuremaybelostduetoaggressivecom-pilertransformationssuchasprocedurein-lining,softwarepipelin-ing,loopfusion,andcodecompression.
Asexplainedearlier,phasedetectionfindsthenumberofphasesbutcannotlocatetheprecisetimeofphasetransitions.Thepreci-sionisintheorderofhundredsofmemoryaccesseswhileatypicalbasicblockhasfewerthantenmemoryreferences.Moreover,thetransitionmaybegradual,anditisimpossibletolocateasinglepoint.Wesolvethisproblembyusingthefrequencyofthephasesinsteadofthetimeoftheirtransition.
Wedefinethefrequencyofaphasebythenumberofitsexecu-tionsinthetrainingrun.Giventhefrequencyfoundbythelaststep,wewanttoidentifyabasicblockthatisalwaysexecutedatthebe-ginningofaphase.Wecallitthemarkerblockforthisphase.Ifthefrequencyofaphaseisf,themarkerblockshouldappearnomorethanftimesintheblocktrace.Thefirststepofthemarkerselec-tionfilterstheblocktraceandkeepsonlyblockswhosefrequencyisnomorethanf.Ifaloopisaphase,thefilteringwillremovetheoccurrencesoftheloopbodyblockandkeeponlytheheaderandtheexitblocks.Ifasetofmutualrecursivefunctionsformsaphase,thefilteringwillremovethecodeofthefunctionsandkeeponlytheonesbeforeandaftertherootinvocation.Afterfiltering,theremainingblocksarecandidatemarkers.
Afterfrequency-basedfiltering,theremovedblocksleavelargeblankregionsbetweentheremainingblocks.Ifablankregionislargerthanathreshold,itisconsideredasaphaseexecution.Thethresholdisdeterminedbythelengthdistributionoftheblankre-gions,thefrequencyofphases,andtheexecutionlength.Sincethetrainingrunshadatleast3.5millionmemoryaccesses,wesimplyused10thousandinstructionsasthethreshold.Inotherwords,aphaseexecutionmustconsumeatleast0.3%ofthetotalexecutiontobeconsideredsignificant.Wecanuseasmallerthresholdtofindsub-phasesafterwefindlargephases.
Oncethephaseexecutionsareidentified,theanalysisconsiderstheblockthatcomesafteraregionasmarkersmarkingthebound-arybetweenthetwophases.Tworegionsareexecutionsofthesamephaseiftheyfollowthesamecodeblock.Theanalysispicksmarkersthatmarkmostifnotallexecutionsofthephasesinthetrainingrun.Wehaveconsideredseveralimprovementsthatcon-siderthelengthoftheregion,usemultiplemarkersforthesamephase,andcorrelatemarkerselectionacrossmultipleruns.How-ever,thisbasicschemesufficesforprogramswetested.
Requiringthemarkerfrequencytobenomorethanthephasefrequencyisnecessarybutnotsufficientforphasemarking.Aphasemaybefragmentedbyinfrequentlyexecutedcodeblocks.However,afalsemarkercannotdivideaphasemorethanftimes.Inaddition,thepartialphaseswillberegroupedinthenextstep,phase-hierarchyconstruction.
2.4MarkingThePhaseHierarchy
HierarchicalconstructionGiventhedetectedphases,wecon-structaphasehierarchyusinggrammarcompression.Thepur-poseistoidentifycompositephasesandincreasethegranularityofphaseprediction.Forexample,fortheTomcatvprogramshowedinFigure1,everyfivephaseexecutionsformatimestepthatre-peatsasacompositephase.Byconstructingthephasehierarchy,wefindphasesofthelargestgranularity.
WeuseSEQUITUR,alinear-timeandlinear-spacecompressionmethoddevelopedbyNevill-ManningandWitten[26].Itcom-pressesastringofsymbolsintoaContextFreeGrammar.Tobuildthephasehierarchy,wehavedevelopedanovelalgorithmthatex-tractsphaserepetitionsfromacompressedgrammarandrepresentsthemexplicitlyasaregularexpression.Thealgorithmrecursivelyconvertsnon-terminalsymbolsintoregularexpressions.Itremem-berspreviousresultssothatitconvertsthesamenon-terminalsym-bolonlyonce.Amergestepoccursforanon-terminalonceits
right-handsideisfullyconverted.Twoadjacentregularexpressionsaremergediftheyareequivalent(usingforexampletheequivalenttestdescribedbyHopcroftandUllman[16]).
SEQUITURwasusedbyLarustofindfrequentcodepaths[19]andbyChilimbitofindfrequentdata-accesssequences[5].TheirmethodsmodelthegrammarasaDAGandfindsfrequentsub-sequencesofagivenlength.Ourmethodtraversesthenon-terminalsymbolsinthesameorder,butinsteadoffindingsub-sequences,itproducesaregularexpression.
PhasemarkerinsertionThelaststepusesbinaryrewritingtoinsertmarkersintoaprogram.Thebasicphases(theleavesofthephasehierarchy)haveuniquemarkersintheprogram,sotheirpre-dictionistrivial.Topredictthecompositephases,weinsertapre-dictorintotheprogram.Basedonthephasehierarchy,thepredictormonitorstheprogramexecutionandmakespredictionsbasedontheon-linephasehistory.Sincethehierarchyisaregularexpres-sion,thepredictorusesafiniteautomatontorecognizethecurrentphaseinthephasehierarchy.Intheprogramswetestedsofar,thissimplemethodsuffices.Thecostofthemarkersandthepredictorisnegligiblebecausetheyareinvokedonceperphaseexecution,whichconsistsofonaveragemillionsofinstructionsasshownintheevaluation.
Table1:Benchmarks
Benchmark
Source
fastFouriertransformation
Spec2KFp
commonUNIXcompressionutility
Spec95Int
vectorizedmeshgeneration
Spec95Fp
anobject-orienteddatabase
AppluGccSwim
Mesh
moleculardynamicssimulation
CHAOS
3.EVALUATION
Weconductfourexperiments.Wefirstmeasurethegranularityandaccuracyofphaseprediction.Wethenuseitincacheresizingandmemoryremapping.Finally,wetestitagainstmanualphasemarking.Wecomparewithotherpredictiontechniquesinthefirsttwoexperiments.
OurtestsuiteisgiveninTable1.Wepickprogramsfromdiffer-entsetsofcommonlyusedbenchmarkstogetaninterestingmix.Theyrepresentcommoncomputationtasksinsignalprocessing,combinatorialoptimization,structuredandunstructuredmeshandN-bodysimulations,acompiler,andadatabase.FFTisaba-sicimplementationfromatextbook.ThenextsixprogramsarefromSPEC:threearefloating-pointandthreeareintegerprograms.ThreearefromSPEC95suite,onefromSPEC2K,andtwo(withsmallvariation)arefromboth.OriginallyfromtheCHAOSgroupatUniversityofMaryland,MolDynandMesharetwodynamicpro-gramswhosedataaccesspatterndependsonprograminputsandchangesduringexecution[6].Theyarecommonlystudiedindy-namicprogramoptimization[11,15,25,32].Thefloating-pointprogramsfromSPECarewritteninFortran,andtheintegerpro-gramsareinC.Ofthetwodynamicprograms,MolDynisinFor-tran,andMeshisinC.Wenotethatthechoiceofsource-levellan-guagesdoesnotmatterbecauseweanalyzeandtransformprogramsatthebinarylevel.
ForprogramsfromSPEC,weusethetestorthetraininputforphasedetectionandtherefinputforphaseprediction.Forthepre-dictionofMesh,weusedthesamemeshasthatinthetrainingrunbutwithsortededges.Forallotherprograms,thepredictionistestedonexecutionshundredstimeslongerthanthoseusedinphasedetection.
WeuseATOMtoinstrumentprogramstocollectthedataandinstructiontraceonaDigitalAlphamachine[31].AllprogramsarecompiledbytheAlphacompilerusing“-O5”flag.Afterphaseanalysis,weagainuseATOMtoinsertmarkersintoprograms.
3.1PhasePrediction
WepresentresultsforallprogramsexceptforGccandVortex,whichwediscussattheendofthissection.Wefirstmeasurethephaselengthandthenlookatthephaselocalityindetail.
Table2showstwosetsofresults.Theupperhalfshowsthe
Table2:Theaccuracyandcoverageofphaseprediction
BenchmarksAppluTomcatvMeshStrict100100100accuracy96.4192.3972.75
Accuracy(%)99.9699.9100Coverage(%)99.7099.7699.58
Average
96.47
13.49
86.1498.48
FFTCompressSwimMolDyn
14529159
Table3:Thenumberandthesizeofphasesindetectionandpredictionruns
DetectionPrediction
exe.len.avg.largestphaseleafavg.leafsize(Minst.)size(Minst.)phases(Minst.)
2.55730.4
254.33.29443775.5
0.66762418.4
175.034.952504.7
4.133334.9
5151.998.246911.1
0.20250988.1
863.6627.393317146.5
232.22712.037.031699.6
Tomcatv
Phase boundaries in sample trace
phaseboundaryCompress
Phase boundaries in sample trace
markerblock ID
Logical TimeLogical Time
25 billion inst., 5250 executions (crosses) of 7 phases
62 billion inst., 52 executions (crosses)
of 4 phases
phase IDshortest & longest exe. length in M inst.frequency (%)1225 billion inst., 2493 intervals (dots)9 of 38 BBV clusters (boxes) shown(12256KB cache miss rate (%)62 billion inst., 6242 intervals (dots)7 of 21 BBV clusters (boxes) shown256KB cache miss rate (%)108642002.972.9710812.596.73(e)cluster bounding boxfrequency (%)2.972.972.972.972.972.4526.08(f)6410.8623.54204.6336.412.96510152032KB cache miss rate (%)255101532KB cache miss rate (%)20Figure3:PredictionAccuracyforTomcatvandCompress.Part(a)and(b)showthephaseboundariesfoundbyoff-linephasedetection.Part(c)and(d)showthelocalityofthephasesfoundbyrun-timeprediction.Asacomparison,Part(e)and(f)showthelocalityoftenmillion-instructionintervalsandBBV(basic-blockvector)clusters.
16 1L14 4r)ew%12( o)10Py anwPhase 1o28 ePhase 2tBaK6R2 3s(4siM20051015202530Phase Occurance NumberFigure4:ThemissratesofCompressphasesonIBMPower4
variation.Infact,inthemajorityofcasesintheseprograms,BBVproducestightclusters.However,eveninbestcases,BBVclustersdonohaveperfectlystackedpointsaslocalityphasesdo.
Table4showstheinitialandnormalizedstandarddeviation.Thelocalityisan8-elementvectorthatcontainsthemissrateforcachesizesfrom32KBto256KBin32KBincrements.Thestandardde-viationiscalculatedforallexecutionsofthesamephaseandtheintervalsofeachBBVcluster.Thenthestandarddeviationofallphasesorclustersareaveraged(weightedbythephaseorclustersize)toproducethenumberfortheprogram.ThenumbersofBBVclusteringandprediction,shownbythelasttwocolumns,aresim-ilarlysmallasreportedbySherwoodetal.forIPC[30].Still,thenumbersforlocalityphasesaremuchsmaller—onetofiveordersofmagnitudesmallerthanthatofBBV-basedprediction.
Table4:StandarddeviationoflocalityphasesandBBVphases
standarddeviations
localityphaseBBVRLEMarkov
clustering
6.87E-80.00615.06E-70.000133.14E-60.000614.53E-70.00162.66E-80.000186.00E-60.000637.60E-50.00067
Sofarwemeasurethecachemissratethroughsimulation,whichdoesnotincludeallfactorsonrealmachinessuchasthatoftheoperatingsystem.WenowexaminetheL1missrateonanIBMPower4processorforthefirsttwophasesofCompress(theothertwophasesaretooinfrequenttobeinteresting).Figure4showsthemeasuredmissrateforeachexecutionofthetwophases.AllbutthefirstexecutionofPhase1havenearlyidenticalmissratesonthe32KB2-waydatacache.TheexecutionsofPhase2showmorevariation.TheeffectfromtheenvironmentismorevisibleinPhase2likelybecauseitsexecutionsareshorterandthemissratelowerthanthoseofthefirstphase.
Thecomparisonwithinterval-basedmethodsispartialbecauseweuseonlyprogramsthatareamenabletolocality-phasepredic-tion.Manydynamicprogramsdonothaveconsistentlocality.Fortheminterval-basedmethodscanstillexploitrun-timepatterns,whileourcurrentphasepredictionschemewouldnotworkbecauseitas-
Spec95/Gcc, compiling cccp.i, sampled 331 times1.0E5ec8.0E4natsid 6.0E4esuer d4.0E4elpmas2.0E40.0E00.0E02.0E74.0E76.0E78.0E71.0E8logical time of data accessSpec95/Vortex, test run, sampled 272 times8.0E5ecna6.0E5tsid esuer4.0E5 delpmas2.0E50.0E00.0E05.0E81.0E9 1.5E92.0E9 2.5E9logical time of data accessFigure5:SampledreusedistancetraceofGccandVortex.The
exactphaselengthisunpredictableingeneral.
sumesthateachphase,onceinexecution,maintainsidenticallo-cality.Nextaretwosuchexamples.
3.1.2GccandVortex
TheprogramsGccandVortexaredifferentbecausetheirphaselengthisnotconsistenteveninthesameexecution.InGcc,thephaselengthisdeterminedbythefunctionbeingcompiled.Fig-ure5showsthedistance-basedsampletrace.Unlikeprevioustracegraphs,ituseshorizontalstepstolinksamplepoints.Thepeaksintheuppergraphroughlycorrespondtothe100functionsinthe6383-lineinputfile.Thesizeandlocationofthepeaksaredeter-minedbytheinputandarenotconstant.
Vortexisanobject-orienteddatabase.Thetestrunfirstconstructsadatabaseandthenperformsasetofqueries.ThelowerfigureofFigure5showsthesampletrace.Itshowsthetransitionfromdatainsertiontoqueryprocessing.However,inotherinputs,thecon-structionandqueriesmaycomeinanyorder.Theexactbehavior,likeGcc,isinputdependentandnotconstant.
OurrecentresultsshowthatanextensionofthephaseanalysiscanmarkthephasesinGcc,whicharethecompilationofinputfunctions.Still,thepredictionresultsareinputdependent.DingandZhongshowedthattheoverallreusepatternofthesetwopro-gramsarestableacrossmanyinputs[12].Itsuggestsapredictionstrategybasedonstatistics.Wedonotconsiderthisextensioninthispaperandwillnotdiscussthesetwoprogramsfurther.
3.2AdaptiveCacheResizing
Duringanexecution,cacheresizingreducesthephysicalcachesizewithoutincreasingthemissrate[2,21].Therefore,itcanre-ducetheaccesstimeandenergyconsumptionofthecachewithoutlosingperformance.Weuseasimplifiedmodelwherethecacheconsistsof64-byteblocksand512sets.Itcanchangefromdirectmappedto8-waysetassociative,sothecachesizecanchangebe-
tween32KBand256KBin32KBunits.Intheadaptation,weneedtopredictthesmallestcachesizethatyieldsthesamemissrateasthe256KBcache.
AsseenintheexampleofTomcatv,programdatabehaviorchangesconstantly.Alocalityphaseisaunitofrepeatingbehaviorratherthanaunitofuniformbehavior.Tocapturethechangingbe-haviorinsidealargephase,wedivideitinto10Kintervals(calledphaseintervals).Theadaptationfindsthebestcachesizeforeachintervalduringthefirstfewexecutionsandreusesthemforlaterruns.Theschemeneedshardwaresupportbutneedsnomorethanthatofinterval-basedcacheresizing.
Interval-basedcacheresizingdividesanexecutionintofixed-lengthwindows.Basedonthehistory,intervalpredictionclassi-fiespastintervalsintoclassesandpredictsthebehaviorclassofthenextintervalusingmethodssuchaslast-valueandMarkovmod-els.Forcacheresizing,Balasubramonianetal.usedthecachemissrateandbranchpredictionratetoclassifypastintervals[2].Recentstudiesconsideredcodeinformationsuchascodeworkingset[9]andbasic-blockvector(BBV)[30].Wetestinterval-basedpre-dictionusingfivedifferentintervallengths:10K,1M,10M,40M,and100Mmemoryaccesses.Inaddition,wetestaBBVpredictorusing10Minstructionwindows,followingtheimplementationofSherwoodetal[30].
Theintervalmethodsdonothavephasemarkers,sotheycon-stantlymonitoreverypastintervaltodecidewhetheraphasechangehasoccurred.Intheexperiment,weassumeperfectdetection:thereisaphasechangeifthebestcachesizeofthenextintervaldiffersfromthecurrentone.BBVmethodusesarun-lengthencodingMarkovpredictortogivetheBBVclusterofthenextinterval(thebestpredictorreportedin[30]).However,asthelastsectionshows,theintervalsofaBBVclusterdonotalwayshaveidenticallocal-ity.WeuseperfectdetectionforBBVaswedoforotherintervalmethods.
Ataphasechange,thephasemethodreusesthebestcachesizestoredforthesamephase.TheBBVmethodreusesthecurrentbestcachesizeforeachBBVcluster.Forrun-timeexploration,wecounttheminimalcost—eachexplorationtakesexactlytwotrialruns,oneatthefullcachesizeandoneatthehalfcachesize.Thenweusethebestcachesizefromthethirdinterval.Intheexper-iment,weknowthebestcachesizeofeachphaseorintervalbyrunningitthroughCheetah,acachesimulatorthatmeasuresthemissrateofalleightcachesizesatthesametime[33].There-sultsforintervalandBBVmethodsareidealisticbecausetheyuseperfectphase-changedetection.Theresultofthephase-intervalmethodisreal.Becauseitknowstheexactbehaviorrepetition,thephase-intervalmethodcanamortizetheexplorationcostovermanyexecutions.Withtherighthardwaresupport,itcangaugetheexactlosscomparedtothefullsizecacheandguaranteeaboundontheabsoluteperformanceloss.
Figure6showstheaveragecachesizefromphase,interval,andBBVmethods.Thefirstgraphshowstheresultsofadaptationwithnomiss-rateincrease.Theresultsarenormalizedtothephasemethod.Thelargestcachesize,256KB,isshownasthelastbarineachgroup.Differentintervalsfinddifferentcachesizes,butallreductionsarelessthan10%.Theaverageis6%.BBVgivesconsistentlygoodreductionwithasingleintervalsize.Theim-provementisatmost15%andonaverage10%.Incontrast,thephaseadaptationreducesthecachesizeby50%formostprogramsandover35%onaverage.
Figure6(b)showstheresultsofadaptationwitha5%boundonthemiss-rateincrease.Theeffectofintervalmethodsvariesgreatly.The10Mintervalwas20%betterthanthelocalityphaseforFFTbutafactorofthreeworseforTomcatvandSwim.The100Min-
2.5e2ziS PhaseehcIntvl-10kaC1.5Intvl-1M gvIntvl-10MA dIntvl-40Mez1Intvl-100MilamBBVrolargest-sizeN0.50MoldynatvTomcSwimFFTprssMeshppluAgeComeAveraBenchmarks4.54ez3.5iS Phaseeh3cIntvl-10kaCIntvl-1M g2.5vIntvl-10MA d2Intvl-40MezIntvl-100Milam1.5BBVrolargest-sizeN10.50vsMoldynmcatSwimFFTToresMeshppluAgeCompveraABenchmarksFigure6:Averagecache-sizereductionbylocalityphase,in-terval,andBBVpredictionmethods,assumingperfectphase-changedetectionandminimal-explorationcostforintervalandBBVmethods.Uppergraph:noincreaseincachemisses.Lowergraph:atmost5%increase.
tervalhasthebestaveragereductionofnearly50%.BBVagainshowsconsistentlygoodreductionwithasingleintervalsize.Onaverageitisslightlybetterthanthebestintervalmethod.ThephasemethodreducesthecachesizemorethanothermethodsdoforallprogramsexceptforFFT.FFThasvariedbehavior,whichcausesthelowcoverageandconsequentlynotaslargecache-sizereduc-tionbylocalityphaseprediction.MolDyndoesnothaveidenticallocality,sophase-basedresizingcausesa0.6%increaseinthenum-berofcachemisses.Acrossallprograms,theaveragereductionusinglocalityphasesisover60%.
Earlierstudiesusedmoreaccuratemodelsofcacheandmea-suredtheeffectontimeandenergythroughcycle-accuratesimula-tion.Sincesimulatingthefullexecutiontakesalongtime,paststudieseitherusedapartialtraceorreducedtheprograminputsize[2,21].Wechoosetomeasurethemissrateoffullexecu-tions.Whileitdoesnotgivethetimeorenergy,themissrateisaccurateandreproduciblebyotherswithoutsignificanteffortsincalibrationofsimulationparameters.
3.3Phase-BasedMemoryRemapping
Weuselocalityphasesinrun-timememoryremapping.Tosup-portdataremappingatthephaseboundary,weassumethesupportoftheImpulsememorycontroller,developedbyCarterandhiscol-leaguesatUniversityofUtah[34,35].ImpulsereorganizesdatawithoutactuallycopyingthemtoCPUorinmemory.Forexam-ple,itmaycreateacolumn-majorversionofarow-majorarrayviaremapping.AkeyrequirementforexploitingImpulseistoidentifythetimewhenremappingisprofitable.
Weconsideraffinity-basedarrayremapping,wherearraysthattendtobeaccessedconcurrentlyareinterleavedbyremapping[36].Todemonstratethevalueoflocalityphaseprediction,weevaluatetheperformancebenefitsofredoingtheremappingforeachphaseratherthanonceforthewholeprogramduringcompilation.Weapplyaffinityanalysisforeachphaseandinsertremappingcodeatthelocationofthephasemarker.Thefollowingtableshowstheexecutiontimeinsecondson2GHzIntelPentiumIVmachinewiththegcccompilerusing-O3.
Forthetwoprograms,weobtainspeedupsof35.5%and2.8%comparedtotheoriginalprogramand13%and2.5%comparedtothebeststaticdatalayout[36],asshowninTable5.IntheabsenceofanImpulseimplementation,weprogramtheremappingandcal-culatetherunningtimeexcludingtheremappingcost.Table7.3ofZhang’sdissertationshowstheoverheadofsettingupremappingsforawiderangeofprograms.Theoverheadincludessettingupshadowregion,creatingmemorycontrollerpagetable,dataflush-ing,andpossibledatamovement.Thelargestoverheadshownis1.6%ofexecutiontimeforstaticindexvectorremapping[34].Forexampleforthe14majorarraysinSwim,whole-programanalysisshowscloseaffinitybetweenarrayuandv,uoldandpold,andunewandpnew.Phase-basedanalysisshowsaffinitygroup{u,v,p}forthefirstphase,{u,v,p,unew,vnew,pnew}forthesecondphase,andthreeothergroups,{u,uold,unew},{v,vold,vnew},and{p,pold,pnew},forthethirdphase.Comparedtowhole-programreorganization,thephase-basedoptimizationreducescachemissesbyonethird(duetoarrayp)forthefirstphase,bytwothirdsforthesecondphase,andbyhalfforthethirdphase.
Usingthetwoexampleprograms,wehaveshownthatphasepre-dictionfindsopportunitiesofdynamicdataremapping.Theaddi-tionalissuesofaffinityanalysisandcodetransformationaredis-cussedbyZhongetal[36].TheexactinteractionwithImpulseliketoolsisasubjectoffuturestudy.
3.4ComparisonwithManualPhaseMarking
Wehand-analyzedeachprogramandinsertedphasemarkers(man-ualmarkers)basedonourreadingofthecodeanditsdocumenta-tionaswellasresultsfromgprof(tofindimportantfunctions).We
comparemanualmarkingwithautomaticmarkingasfollows.Asaprogramruns,allmarkersoutputthelogicaltime(thenumberofmemoryaccessesfromthebeginning).Giventhesetoflogi-caltimesfrommanualmarkersandthesetfromauto-markers,wemeasuretheoverlapbetweenthetwosets.Twologicaltimesareconsideredthesameiftheydifferbynomorethan400,whichis0.02%oftheaveragephaselength.Weusetherecallandprecisiontomeasuretheircloseness.Theyaredefinedbytheformulasbelow.Therecallshowsthepercentageofthemanuallymarkedtimesthataremarkedbyauto-markers.Theprecisionshowsthepercentageoftheautomaticallymarkedtimesthataremarkedmanually.
Recall
=
|M∩A|
|A|
(2)
Table5:Theeffectofphase-basedarrayregrouping,excludingthecostofrun-timedatareorganizationBenchmarkPhase(speedup)
4.294.27(0.4%)52.8438.52(27.1%)
Table6:TheoverlapwithmanualphasemarkersBenchmarkDetectionPrediction
Prec.Recall11
Applu0.9930.948
0.9620.987
Tomcatv0.9520.571
0.3411
Mesh10.834
0.2710.987
Average0.9640.692
whereMisthesetoftimesfromthemanualmarkers,andAisthe
setoftimesfromauto-markers.
Table6showsacomparisonwithmanuallyinsertedmarkersfordetectionandpredictionruns.Thecolumnsforeachrungivetherecallandtheprecision.Therecallisover95%inallcasesex-ceptforMolDyninthedetectionrun.Theaveragerecallincreasesfrom96%inthedetectionrunto99%inthepredictionrunbe-causethephaseswithabetterrecalloccurmoreofteninlongerruns.Hence,theauto-markerscapturetheprogrammer’sunder-standingoftheprogrambecausetheycatchnearlyallmanuallymarkedphasechangingpoints.
Theprecisionisover95%forAppluandCompress,showingthatautomaticmarkersareeffectivelythesameasthemanualmarkers.MolDynhasthelowestrecallof27%.Wecheckedthecodeandfoundthedifference.Whentheprogramisconstructingtheneigh-borlist,theanalysismarkstheneighborsearchforeachparticleasaphasewhiletheprogrammermarksthesearchesforallparti-clesasaphase.Inthiscase,theanalysisiscorrect.Theneighborsearchrepeatsforeachparticle.ThisalsoexplainswhyMoldyncannotbepredictedwithbothhighaccuracyandhighcoverage—theneighborsearchhasvaryingbehaviorsinceaparticlemayhaveadifferentnumberofneighbors.Thelowrecallinotherprogramshasthesamereason:theautomaticanalysisismorethoroughthanthemanualanalysis.
Fourofthetestprogramsarethesimulationofgrid,meshandN-bodysystemsintimesteps.DingandKennedyshowedthattheybenefitedfromdynamicdatapacking,whichmonitoredtherun-timeaccesspatternandreorganizedthedatalayoutmultipletimesduringanexecution[11].Theirtechniquewasautomaticexceptforaprogrammer-inserteddirective,whichmustbeexecutedonceineachtimestep.Thisworkwasstartedinparttoautomaticallyinsertthedirective.Ithasachievedthisgoal:thelargestcompositephaseinthesefourprogramsisthetimesteploop.Therefore,thephasepredictionshouldhelptofullyautomatedynamicdatapack-ing,whichisshownbyseveralrecentstudiestoimproveperfor-mancebyintegerfactorsforphysical,engineering,andbiologicalsimulationandsparsematrixsolvers[11,15,25,32].
SummaryForprogramswithconsistentphasebehavior,thenewmethodgivesaccuratelocalitypredictionandconsequentlyyieldssignificantbenefitsforcacheresizingandmemoryremapping.Itismoreeffectiveatfindinglong,recurringphasesthanpreviousmeth-odsbasedonprogramcode,executionintervals,theircombination,andevenmanualanalysis.Forprogramswithvaryingphasebehav-ior,theprofilingstepcanoftenrevealtheinconsistency.Thenthemethodavoidsbehaviorpredictionofinconsistentphasesthroughaflag(asshownbytheexperimentsreportedinTable2).Usingasmallinputinaprofilingrunisenoughforlocalityphasepredic-tion.Therefore,thetechniquecanhandlelargeprogramsandlongexecutions.ForprogramssuchasGCCandVortex,wherelittlecon-
sistencyexistsduringthesameexecution,thelocalityanalysiscanstillrecognizephaseboundariesbutcannotyetmakepredictions.Predictionsbasedonstatisticsmaybehelpfulfortheseprograms,whichremainstobeourfuturework.Inaddition,thecurrentanaly-sisconsidersonlytemporallocality.Thefutureworkwillconsiderspatiallocalityinconjunctionwithtemporallocality.
4.RELATEDWORK
Thisworkisauniquecombinationofprogramcodeanddataanalysis.Itbuildsonpastworkinthesetwoareasandcomplementsinterval-basedmethods.
LocalityphasesEarlyphaseanalysis,owingtoitsrootinvirtual-memorymanagement,wasintertwinedwithlocalityanalysis.In1976,BatsonandMadisondefinedaphaseasaperiodofexecu-tionaccessingasubsetofprogramdata[4].Theyshowedexperi-mentallythatasetofAlgol-60programsspent90%timeinmajorphases.However,theydidnotpredictlocalityphases.LaterstudiesusedtimeorreusedistanceaswellaspredictorssuchasMarkovmodelstoimprovevirtualmemorymanagement.Recently,DingandZhongfoundpredictablepatternsintheoveralllocalitybutdidnotconsiderthephasebehavior[12].Wearenotawareofanytrace-basedtechniquethatidentifiesstaticphasesusinglocalityanalysis.ProgramphasesAllenandCockepioneeredintervalanalysistoconvertprogramcontrolflowintoahierarchyofregions[1].Forscientificprograms,mostcomputationanddataaccessareinloopnests.Anumberofstudiesshowedthattheinter-proceduralarray-sectionanalysisaccuratelysummarizestheprogramdatabehavior.RecentworkbyHsuandKremerusedprogramregionstocontrolprocessorvoltagestosaveenergy.Theirregionmayspanloopsandfunctionsandisguaranteedtobeanatomicunitofexecutionunderallprograminputs[17].Forgeneralpurposeprograms,Balasub-ramonianetal.[2],Huangetal.[21],andMagklisetal.[22]se-lectedasphasesprocedureswhosenumberofinstructionsexceedsathresholdinaprofilingrun.Thethreestudiesfoundthebestvoltageforprogramregionsonatraininginputandthentestedtheprogramonanotherinput.Theyobservedthatdifferentinputsdidnotaffectthevoltagesetting.Thefirsttwostudiesalsomeasuredtheenergysavingofphase-basedcacheresizing[2,21].Incompar-ison,thenewtechniquedoesnotrelyonstaticprogramstructure.Itusestrace-basedlocalityanalysistofindthephaseboundaries,whichmayoccuranywhereandnotjustatregion,looporproce-dureboundaries.
IntervalphasesIntervalmethodsdivideanexecutionintofixed-sizewindows,classifypastintervalsusingmachineorcode-basedmetrics,andpredictfutureintervalsusinglastvalue,Markov,ortable-drivenpredictors[9,10,13,30].Thepastworkusedintervalsoflengthfrom100thousand[2]to10millioninstructions[30]andexecutionsfrom10millisecondsto10seconds[13].Intervalpre-dictionworkswelliftheintervallengthdoesnotmatter,forexam-ple,whenanexecutionconsistsoflongsteadyphases.Otherwiseitisdifficulttofindthebestintervallengthforagivenprogramonagiveninput.Theexperimentaldatainthispapershowtheinher-entlimitationofintervalsforprogramswithconstantlychangingdatabehavior.Balasubramonianetal.searchesforthebestintervalsizeatruntime[3].Theirmethoddoublestheintervallengthun-tilthebehaviorisstable.LetNbetheexecutionlength,thisnewschemesearchesO(logN)choicesinthespaceofNcandidates.Inthiswork,welocatephasesanddeterminetheirexactlengthsthroughoff-linelocalityanalysis.Weshowthatimportantclassesofprogramshaveconsistentphasebehaviorandthehighaccuracyandlargegranularityofphasepredictionallowadaptationwithatightworst-performanceguarantee.However,notallprogramsareamenabletotheoff-lineanalysis.Interval-basedmethodsdonot
havethislimitationandcanexploitthegeneralclassofrun-timepatterns.
5.CONCLUSIONS
Thepaperpresentsageneralmethodforpredictinghierarchicalmemoryphasesinprogramswithinput-dependentbutconsistentphasebehavior.Basedonprofilingruns,itpredictsprogramexe-cutionshundredsoftimeslargerandpredictsthelengthandlocal-itywithnearperfectaccuracy.Whenusedforcacheadaptation,itreducesthecachesizeby40%withoutincreasingthenumberofcachemisses.Whenusedformemoryremapping,itimprovesprogramperformancebyupto35%.Itismoreeffectiveatidenti-fyinglong,recurringphasesthanpreviousmethodsbasedonpro-gramcode,executionintervals,andmanualanalysis.Itrecognizesprogramswithinconsistentphasebehaviorandavoidsfalsepredic-tions.Theseresultssuggestthatlocalityphasepredictionshouldbenefitmodernadaptationtechniquesforincreasingperformance,reducingenergy,andotherimprovementstothecomputersystemdesign.
Scientificallyspeaking,thisworkisanotherattempttounder-standthedichotomybetweenprogramcodeanddataaccessandtobridgethedivisionbetweenoff-lineanalysisandon-lineprediction.Theresultembodiesandextendsthedecades-oldideathatlocalitycouldbepartofthemissinglink.
Acknowledgement
Thisworkwasmotivatedinpartbytheearlierworkonphaseanaly-sisbyourcolleaguesatRochester.ThepresentationwasimprovedbydiscussionswithSandhyaDwarkadas,MichaelHuang,MichaelScott,DaveAlbonesi,andKaiShenatRochester,JohnCarteratUtah,UliKremeratRutgers,BradCalderatUCSD,andtheanony-mousreviewersfromthisandanotherconferencecommittee.LixinZhangatIBMandTimSherwoodatUCSBansweredourquestionsaboutImpulseandBBV.ThefinancialsupportcomesinpartfromtheDepartmentofEnergy(ContractNo.DE-FG02-02ER25525)andtheNationalScienceFoundation(ContractNo.CCR-0238176,CCR-0219848,andEIA-0080124).
6.[1]F.REFERENCES
AllenandJ.Cocke.Aproramdataflowanalysis
procedure.CommunicationsoftheACM,19:137–147,1976.[2]R.Balasubramonian,D.Albonesi,A.Buyuktosunoglu,and
S.Dwarkadas.Memoryhierarchyreconfigurationforenergyandperformanceingeneral-purposeprocessorarchitectures.InProceedingsofthe33rdInternationalSymposiumonMicroarchitecture,Monterey,California,December2000.[3]R.Balasubramonian,S.Dwarkadas,andD.H.Albonesi.
Dynamicallymanagingthecommunication-parallelismtrade-offinfutureclusteredprocessors.InProceedingsofInternationalSymposiumonComputerArchitecture,SanDiego,CA,June2003.
[4]A.P.BatsonandA.W.Madison.Measurementsofmajor
localityphasesinsymbolicreferencestrings.InProceedingsoftheACMSIGMETRICSConferenceonMeasurement&ModelingComputerSystems,Cambridge,MA,March1976.[5]T.M.Chilimbi.Efficientrepresentationsandabstractionsfor
quantifyingandexploitingdatareferencelocality.In
ProceedingsofACMSIGPLANConferenceonProgrammingLanguageDesignandImplementation,Snowbird,Utah,June2001.
[6]R.Das,M.Uysal,J.Saltz,andY.-S.Hwang.Communication
optimizationsforirregularscientificcomputationson
distributedmemoryarchitectures.JournalofParallelandDistributedComputing,22(3):462–479,September1994.[7]I.Daubechies.TenLecturesonWavelets.CapitalCityPress,Montpelier,Vermont,1992.
[8]
P.J.Denning.Workingsetspastandpresent.IEEE
TransactionsonSoftwareEngineering,SE-6(1),January1980.
[9]
A.S.DhodapkarandJ.E.Smith.Managing
multi-configurationhardwareviadynamicworking-setanalysis.InProceedingsofInternationalSymposiumonComputerArchitecture,Anchorage,Alaska,June2002.
[10]
A.S.DhodapkarandJ.E.Smith.Comparingprogramphasedetectiontechniques.InProceedingsofInternationalSymposiumonMicroarchitecture,December2003.
[11]
C.DingandK.Kennedy.Improvingcacheperformanceindynamicapplicationsthroughdataandcomputation
reorganizationatruntime.InProceedingsoftheSIGPLAN’99ConferenceonProgrammingLanguageDesignandImplementation,Atlanta,GA,May1999.
[12]
C.DingandY.Zhong.Predictingwhole-programlocalitywithreusedistanceanalysis.InProceedingsofACM
SIGPLANConferenceonProgrammingLanguageDesignandImplementation,SanDiego,CA,June2003.[13]
E.Duesterwald,C.Cascaval,andS.Dwarkadas.
Characterizingandpredictingprogrambehavioranditsvariability.InProceedingsofInternationalConferenceonParallelArchitecturesandCompilationTechniques,NewOrleans,Louisiana,September2003.[14]
C.Fang,S.Carr,S.Onder,andZ.Wang.
Reuse-distance-basedmiss-ratepredictiononaper
instructionbasis.InProceedingsofthefirstACMSIGPLANWorkshoponMemorySystemPerformance,WashingtonDC,June2004.
[15]
H.HanandC.W.Tseng.Improvinglocalityforadaptiveirregularscientificcodes.InProceedingsofWorkshoponLanguagesandCompilersforHigh-PerformanceComputing(LCPC’00),WhitePlains,NY,August2000.
[16]J.E.HopcroftandJ.D.Ullman.Introductiontoautomatatheory,languages,andcomputation.Addison-Wesley,1979.[17]
C.-H.HsuandU.Kermer.Thedesign,implementationandevaluationofacompileralgorithmforCPUenergy
reduction.InProceedingsofACMSIGPLANConferenceonProgrammingLanguageDesignandImplementation,SanDiego,CA,June2003.
[18]
R.Joseph,Z.Hu,andM.Martonosi.Waveletanalysisformicroprocessordesign:Experienceswithwavelet-baseddi/dtcharacterization.InProceedingsofInternationalSymposiumonHighPerformanceComputerArchitecture,February2004.
[19]
J.R.Larus.Wholeprogrampaths.InProceedingsofACMSIGPLANConferenceonProgrammingLanguageDesignandImplementation,Atlanta,Georgia,May1999.
[20]
C.LukandT.C.Mowry.Memoryforwarding:enabling
aggressivelayoutoptimizationsbyguaranteeingthesafetyofdatarelocation.InProceedingsofInternationalSymposiumonComputerArchitecture,Atlanta,GA,May1999.[21]
M.HuangandJ.RenauandJ.Torrellas.Positional
adaptationofprocessors:applicationtoenergyreduction.InProceedingsoftheInternationalSymposiumonComputerArchitecture,SanDiego,CA,June2003.
[22]
G.Magklis,M.L.Scott,G.Semeraro,D.H.Albonesi,,and
S.Dropsho.Profile-baseddynamicvoltageandfrequencyscalingforamultipleclockdomainmicroprocessor.InProceedingsoftheInternationalSymposiumonComputerArchitecture,SanDiego,CA,June2003.
[23]
G.MarinandJ.Mellor-Crummey.Crossarchitectureperformancepredictionsforscientificapplicationsusingparameterizedmodels.InProceedingsofJointInternationalConferenceonMeasurementandModelingofComputerSystems,NewYorkCity,NY,June2004.
[24]
R.L.Mattson,J.Gecsei,D.Slutz,andI.L.Traiger.
Evaluationtechniquesforstoragehierarchies.IBMSystemJournal,9(2):78–117,1970.
[25]
J.Mellor-Crummey,D.Whalley,andK.Kennedy.Improvingmemoryhierarchyperformanceforirregularapplications.InternationalJournalofParallelProgramming,29(3),June2001.
[26]
C.G.Nevill-ManningandI.H.Witten.Identifying
hierarchicalstructureinsequences:alinear-timealgorithm.JournalofArtificialIntelligenceResearch,7:67–82,1997.[27]
V.S.Pingali,S.A.McKee,W.C.Hsieh,andJ.B.Carter.Restructuringcomputationsfortemporaldatacachelocality.InternationalJournalofParallelProgramming,31(4),August2003.
[28]
X.Shen,Y.Zhong,andC.Ding.Regression-basedmulti-modelpredictionofdatareusesignature.In
Proceedingsofthe4thAnnualSymposiumoftheLasAlamosComputerScienceInstitute,SanteFe,NewMexico,November2003.
[29]
T.Sherwood,E.Perelman,andB.Calder.Basicblock
distributionanalysistofindperiodicbehaviorandsimulationpointsinapplications.InProceedingsofInternationalConferenceonParallelArchitecturesandCompilationTechniques,Barcelona,Spain,September2001.
[30]
T.Sherwood,S.Sair,andB.Calder.Phasetrackingandprediction.InProceedingsofInternationalSymposiumonComputerArchitecture,SanDiego,CA,June2003.
[31]
A.SrivastavaandA.Eustace.ATOM:Asystemforbuildingcustomizedprogramanalysistools.InProceedingsofACMSIGPLANConferenceonProgrammingLanguageDesignandImplementation,Orlando,Florida,June1994.[32]
M.M.Strout,L.Carter,andJ.Ferrante.Compile-timecompositionofrun-timedataanditerationreorderings.InProceedingsofACMSIGPLANConferenceonProgrammingLanguageDesignandImplementation,SanDiego,CA,June2003.
[33]
R.A.SugumarandS.G.Abraham.Efficientsimulationofcachesunderoptimalreplacementwithapplicationstomisscharacterization.InProceedingsoftheACMSIGMETRICSConferenceonMeasurement&ModelingComputerSystems,SantaClara,CA,May1993.
[34]
L.Zhang.EfficientRemappingMechanismforanAdaptiveMemorySystem.PhDthesis,DepartmentofComputerScience,UniversityofUtah,August2000.
[35]
L.Zhang,Z.Fang,M.Parker,B.K.Mathew,L.Schaelicke,J.B.Carter,W.C.Hsieh,andS.A.McKee.TheImpulsememorycontroller.IEEETransactionsonComputers,50(11),November2001.
[36]
Y.Zhong,M.Orlovich,X.Shen,andC.Ding.Arrayregroupingandstructuresplittingusingwhole-programreferenceaffinity.InProceedingsofACMSIGPLANConferenceonProgrammingLanguageDesignandImplementation,June2004.