您的当前位置：首页 compilers

compilers

来源：乌哈旅游

LocalityPhasePrediction

XipengShen

YutaoZhong

ChenDing

ComputerScienceDepartment,UniversityofRochester

{xshen,ytzhong,cding}@cs.rochester.edu

ABSTRACT

Ascomputermemoryhierarchybecomesadaptive,itsperformanceincreasinglydependsonforecastingthedynamicprogramlocality.Thispaperpresentsamethodthatpredictsthelocalityphasesofaprogrambyacombinationoflocalityproﬁlingandrun-timepre-diction.Byproﬁlingatraininginput,itidentiﬁeslocalityphasesbysiftingthroughallaccessestoalldataelementsusingvariable-distancesampling,waveletﬁltering,andoptimalphasepartition-ing.Itthenconstructsaphasehierarchythroughgrammarcom-pression.Finally,itinsertsphasemarkersintotheprogramusingbinaryrewriting.Whentheinstrumentedprogramruns,itusestheﬁrstfewexecutionsofaphasetopredictallitslaterexecutions.Comparedwithexistingmethodsbasedonprogramcodeandex-ecutionintervals,localityphasepredictionisuniquebecauseituseslocalityproﬁles,anditmarksphaseboundariesinprogramcode.Thesecondhalfofthepaperpresentsacomprehensiveevaluation.Itmeasurestheaccuracyandthecoverageofthenewtechniqueandcomparesitwithbestknownrun-timemethods.Itmeasuresitsben-eﬁtinadaptivecacheresizingandmemoryremapping.Finally,itcomparestheautomaticanalysiswithmanualphasemarking.Theresultsshowthatlocalityphasepredictioniswellsuitedforidenti-fyinglarge,recurringphasesincomplexprograms.

CategoriesandSubjectDescriptors

D.3.4[ProgrammingLanguages]:Processors—optimization,

compilers

GeneralTerms

Measurement,Performance,Algorithms

Keywords

programphaseanalysisandprediction,phasehierarchy,localityanalysisandoptimization,reconﬁgurablearchitecture,dynamicop-timization

Permissiontomakedigitalorhardcopiesofallorpartofthisworkforpersonalorclassroomuseisgrantedwithoutfeeprovidedthatcopiesarenotmadeordistributedforproﬁtorcommercialadvantageandthatcopiesbearthisnoticeandthefullcitationontheﬁrstpage.Tocopyotherwise,torepublish,topostonserversortoredistributetolists,requirespriorspeciﬁcpermissionand/orafee.

1.INTRODUCTION

Memoryadaptationisincreasinglyimportantasthememoryhi-erarchybecomesdeeperandmoreadaptive,andprogramsexhibitdynamiclocality.Toadapt,aprogrammayreorganizeitsdatalay-outmultipletimesduringanexecution.Severalstudieshaveexam-ineddynamicdatareorganizationattheprogramlevel[11,15,25,27,32]andatthehardwarelevel[20,35].Theyshowedimpressiveimprovementsincachelocalityandprefetchingefﬁciency.Unfor-tunately,thesetechniquesarenotyetwidelyusedpartlybecausetheyneedmanualanalysistoﬁndprogramphasesthatbeneﬁtfrommemoryadaptation.Inthispaper,weshowthatthisproblemcanbeaddressedbylocality-basedphaseprediction.

FollowingearlystudiesinvirtualmemorymanagementbyBat-sonandMadison[4]andbyDenning[8],wedeﬁnealocalityphaseasaperiodofaprogramexecutionthathasstableorslowchang-ingdatalocalityinsidethephasebutdisruptivetransitionperiodsbetweenphases.Foroptimizationpurpose,weareinterestedinphasesthatarerepeatedlyexecutedwithsimilarlocality.Whiledatalocalityisnoteasytodeﬁne,weuseaprecisemeasureinthispaper.Foranexecutionofaphase,wemeasurethelocalitybyitsmissrateacrossallcachesizesanditsnumberofdynamicinstruc-tions.Atruntime,phasepredictionmeansknowingaphaseanditslocalitywhenevertheexecutionentersthephase.Accuratepre-dictionisnecessarytoenablelarge-scalememorychangeswhileavoidinganyadverseeffects.

Manyprogramshaverecurringlocalityphases.Forexample,asimulationprogrammaytesttheagingofanairplanemodel.Thecomputationsweepsthroughthemeshstructureoftheairplanere-peatedlyinmanytimesteps.Thecachebehaviorofeachtimestepshouldbesimilarbecausethemajorityofthedataaccessisthesamedespitelocalvariationsincontrolﬂow.Givenadifferentinput,forexampleanotherairplanemodelorsomesubparts,thelocalityofthenewsimulationmaychangeradicallybutitwillbeconsistentwithinthesameexecution.Similarphasebehaviorarecommoninstructural,mechanical,molecular,andotherscientiﬁcandcommer-cialsimulations.Theseprogramshavegreatdemandforcomput-ingresources.Becauseoftheirdynamicbutstablephases,theyaregoodcandidatesforadaptation,ifwecanpredictlocalityphases.Wedescribeanewpredictionmethodthatoperatesinthreesteps.Theﬁrstanalyzesthedatalocalityinproﬁlingruns.Byexamin-ingthedistanceofdatareusesinvaryinglengths,theanalysiscan“zoomin”and“zoomout”overlongexecutiontracesanddetectslocalityphasesusingvariable-distancesampling,waveletﬁltering,andoptimalphasepartitioning.Thesecondstepthenanalyzestheinstructiontraceandidentiﬁesthephaseboundariesinthecode.Thethirdstepusesgrammarcompressiontoidentifyphasehier-archiesandtheninsertsprogrammarkersthroughbinaryrewrit-ing.Duringexecution,theprogramusestheﬁrstfewinstancesofa

phasetopredictallitslaterexecutions.Thenewanalysisconsidersbothprogramcodeanddataaccess.Itinsertsstaticmarkersintotheprogrambinarywithoutaccessingthesourcecode.

Phasepredictionhasbecomeafocusofmuchrecentresearch.Mosttechniquescanbedividedintotwocategories.Theﬁrstisintervalbased.Itdividesaprogramexecutionintoﬁxed-lengthin-tervalsandpredictsthebehavioroffutureintervalsfrompastob-servations.Interval-basedpredictioncanbeimplementedentirelyandefﬁcientlyatruntime[2,3,9,10,13,30].Ithandlesarbi-trarilycomplexprogramsanddetectsdynamicallychangingpat-terns.However,run-timesystemscannotafforddetaileddataanal-ysismuchbeyondcountingthecachemisses.Inaddition,itisun-clearhowtopicktheintervallengthfordifferentprogramsandfordifferentinputsofthesameprogram.Thesecondcategoryiscodebased.Itmarksasubsetofloopsandfunctionsasphasesandes-timatestheirbehaviorthroughproﬁling[17,21,22].Pro-activeratherthanreactive,itusesphasemarkerstocontrolthehardwareandreducetheneedforrun-timemonitoring.However,thepro-gramstructuremaynotrevealitslocalitypattern.Aphasemayhavemanyproceduresandloops.Thesameprocedureorloopmaybelongtodifferentlocalityphaseswhenaccessingdifferentdataatdifferentinvocations.Forexample,asimulationstepinaprogrammayspanthousandsoflinesofcodewithintertwinedfunctioncallsandindirectdataaccess.

Incomparison,thenewtechniquecombineslocalityanalysisandphasemarking.Theformeravoidstheuseofﬁxed-sizewindowsinanalysisorprediction.Thelatterenablespro-activephaseadap-tation.Inaddition,thephasemarkingconsidersallinstructionsintheprogrambinaryincasetheloopandprocedurestructuresareobfuscatedbyanoptimizingcompiler.

Inevaluation,weshowthatthenewanalysisﬁndsrecurringphasesofwidelyvaryingsizesandnearlyidenticallocality.Thephaselengthchangesintunewithprograminputsandrangesfromtwohundredthousandtothreebillioninstructions—thislengthispre-dictedwith99.5%accuracy.Wecompareitwithotherphasepre-dictionmethods,andweshowitsuseinadaptivecacheresizingandphase-basedmemoryremapping.

Localityphasepredictionisnoteffectiveonallprograms.Someprogramsmaynothavepredictablephases.Somephasesmaynotbepredictablefromitsdatalocality.Welimitouranalysistopro-gramsthathavelargepredictablephases,whichneverthelessin-cludeimportantclassesofdynamicprograms.Forsomeprogramssuchasacompileroradatabase,theanalysiscanstillidentifyphasesbutcannotpredicttheexactlocality.

2.HIERARCHICALPHASEANALYSIS

Thissectionﬁrstmotivatestheuseoflocalityanalysisandthendescribesthestepsoflocality-basedphaseprediction.

2.1LocalityAnalysisUsingReuseDistance

In1970,Mattsonetal.deﬁnedtheLRU-stackdistanceasthenumberofdistinctdataelementsaccessedbetweentwoconsecu-tivereferencestothesameelement[24].Theysummarizedthelocalityofanexecutionbythedistancehistogram,whichdeter-minesthemissrateoffully-associativeLRUcacheofallsizes.Buildingondecadesofdevelopmentbyothers,ourearlierworkreducedtheanalysiscosttonearlineartime.Anumberofrecentstudiesfoundthatreuse-distancehistogramschangeinpredictablepatternsinmanyprograms[12,14,23,28].Inthisworkwegoonestepfurthertoseewhetherpredictablepatternsexistforsubpartsofaprogram.ForbrevitywecalltheLRUstackdistancebetweentwoaccessesofthesamedatathereusedistanceofthesecondaccess.Reusedistancerevealspatternsinprogramlocality.Weusethe

x 1062ecn1.5atsiD es1ueR0.5023Logical Time45x 107Figure1:Thereuse-distancetraceofTomcatv

exampleofTomcatv,avectorizedmeshgenerationprogramfromSPEC95knownforitshighlymemory-sensitiveperformance.Fig-ure1showsthereusedistancetrace.Eachdataaccessisapointinthegraph—thex-axisgivesthelogicaltime(i.e.thenumberofdataaccesses),andthey-axisgivesthereusedistance1.Thepointsaresonumerousthattheyemergeassolidblocksandlines.

Thereusedistanceofdataaccesschangescontinuouslythrough-outthetrace.Wedeﬁneaphasechangeasanabruptchangeindatareusepattern.Inthisexample,theabruptchangesdividethetraceintoclearlyseparatedphases.Thesamephasesrepeatinaﬁxedsequence.Readingthecodedocumentation,weseeindeedthattheprogramhasasequenceoftimesteps,eachhasﬁvesub-steps—preparationofdata,residualvalues,solvingtwotridiagnoalsystems,andaddingcorrections.Whatisremarkableisthatwecouldseethesamepatternfromthereusedistancetracewithoutlookingattheprogram.

Theexampleconﬁrmsfourcommonlyheldassumptionsaboutprogramlocality.First,thedatalocalitymaychangeconstantlyinanexecution;however,majorshiftsinprogramlocalityaremarkedbyradicalratherthangradualchanges.Second,localityphaseshavedifferentlengths.Thesizeofonephasehaslittlerelationwiththesizeofothers.Third,thesizechangesgreatlywithprograminputs.Forexample,thephasesofTomcatvcontainafewhundredmillionmemoryaccessesinatrainingrunbutovertwenty-ﬁvebil-lionmemoryaccessesinatestrun.Finally,aphaseoftenrecurswithsimilarlocality.Aphaseisaunitofrepeatingbehaviorratherthanaunitofuniformbehavior.Toexploittheseproperties,local-ityphasepredictionusesreusedistancetotrackﬁne-grainchangesandﬁndprecisephaseboundaries.Itusessmalltrainingrunstopredictlargerexecutions.

Reusedistancemeasureslocalitybetterthanpureprogramorhardwaremeasures.Compileranalysiscannotfullyanalyzelocal-ityinprogramsthathavedynamicdatastructuresandindirectdataaccess.Thecommonhardwaremeasure,themissrate,isdeﬁnedoverawindow.Evenregularprogramsmayhaveirregularcachemissratedistributionswhenwecutthemintowindows,asshownlaterinFigure3.Itisdifﬁculttoﬁndaﬁxedwindowsizethatmatchesthephasesofunequallengths.Wemayusethemisstrace,butacachemissisabinaryevent—hitormissforagivencacheconﬁguration.Incomparison,reusedistanceisaprecisescale.Itispurelyaprogramproperty,independentofhardwareconﬁgura-tions.

Tospeculateistosee.Reusedistanceshowsaninterestingpic-tureofprogramlocality.Nextwepresentasystemthatautomati-callyuncoversthehierarchyoflocalityphasesfromthispicture.

2.2Off-linePhaseDetection

Giventheexecutiontraceoftrainingruns,phasedetectionop-eratesinthreesteps:variable-distancesamplingcollectsthereusedistancetrace,waveletﬁlteringﬁndsabruptchanges,andﬁnally,optimalphasepartitioninglocatesthephaseboundary.

2.2.1Variable-DistanceSampling

Insteadofanalyzingallaccessestoalldata,wesampleasmallnumberofrepresentativedata.Inaddition,foreachdata,werecordonlylong-distancereusesbecausetheyrevealglobalpatterns.Variable-distancesamplingisbasedonthedistance-basedsamplingdescribedbyDingandZhong[12].TheirsamplerusesATOMtogeneratethedataaccesstraceandmonitorsthereusedistanceofeveryaccess.Whenthereusedistanceisaboveathreshold(thequaliﬁcationthreshold),theaccessedmemorylocationistakenasadatasample.Alateraccesstoadatasampleisrecordedasanaccesssampleifthereusedistanceisoverasecondthreshold(thetemporalthreshold).Toavoidpickingtoomanydatasamples,itre-quiresthatanewdatasampletobeatleastacertainspacedistanceaway(thespatialthreshold)inmemoryfromexistingdatasamples.ThethreethresholdsinDingandZhong’smethodaredifﬁculttocontrol.Variable-distancesamplingsolvesthisproblembyus-ingdynamicfeedbacktoﬁndsuitablethresholds.Givenanarbi-traryexecutiontrace,itslength,andthetargetnumberofsamples,itstartswithaninitialsetofthresholds.Itperiodicallycheckswhethertherateofsamplecollectionistoohighortoolowconsid-eringthetargetsamplesize.Itchangesthethresholdsaccordinglytoensurethattheactualsamplesizeisnotfargreaterthanthetarget.Sincesamplinghappensoff-line,itcanusemoretimetoﬁndap-propriatethresholds.Inpractice,variable-distancesamplingﬁnds15thousandto30thousandsamplesinlessthan20adjustmentsofthresholds.Ittakesseveralhoursforthelaterstepsofwaveletﬁlteringandoptimalphasepartitioningtoanalyzethesesamples,althoughthelongtimeisacceptableforouroff-lineanalysisandcanbeimprovedbyamoreefﬁcientimplementation(currentlyus-ingMatlabandJava).

Thevariable-distancesamplingmaycollectsamplesatanun-evenrate.Evenatasteadyrate,itmayincludepartialresultsforexecutionsthathaveunevenreusedensity.However,thetargetsamplesizeislarge.Theredundancyensuresthatthesesamplestogethercontainelementsinallphaseexecutions.Ifadatasamplehastoofewaccesssamplestobeuseful,thenextanalysisstepwillremovethemasnoise.

2.2.2WaveletFiltering

Viewingthesampletraceasasignal,weusetheDiscreteWaveletTransform(DWT)asaﬁltertoexposeabruptchangesinthereusepattern.TheDWTisacommontechniqueinsignalandimageprocessing[7].Itshowsthechangeoffrequencyovertime.Asamutli-resolutionanalysis,theDWTappliestwofunctionstodata:thescalefunctionandthewaveletfunction.Theﬁrstsmoothsthesignalbyaveragingitsvaluesinawindow.Thesecondcalculatesthemagnitudeofarangeoffrequenciesinthewindow.Thewin-dowthenshiftsthroughthewholesignal.Afterﬁnishingthecal-culationsonthewholesignal,itrepeatsthesameprocessatthenextlevelonthescaledresultsfromthelastlevelinsteadofontheoriginalsignal.Thisprocessmaycontinueformanylevelsasamulti-resolutionprocess.Foreachpointoneachlevel,ascalingandawaveletcoefﬁcientarecalculatedusingthevariationsofthe

followingbasicformulas:

cj(k)=wj(k)=

where,isthescalerproductofaandb,f(x)istheinputsignal,jistheanalysislevel,φandψarethescalingandwaveletfunctionrespectively.Manydifferentwaveletfamiliesex-istintheliterature,suchasHaar,Daubechies,andMexican-hat.WeuseDaubechies-6inourexperiments.Otherfamilieswehavetestedproduceasimilarresult.Onhigh-resolutionlevels,thepointswithhighwaveletcoefﬁcientvaluessignalabruptchanges;there-foretheyarelikelyphasechangingpoints.

Thewaveletﬁlteringtakesthereuse-distancetraceofeachdatasampleasasignal,thencomputesthelevel-1coefﬁcientforeachaccessandremovesfromthetracetheaccesseswithalowwaveletcoefﬁcientvalue.Anaccessiskeptonlyifitscoefﬁcientω>m+3δ,wheremisthemeanandδisthestandarddeviation.Thedifferencebetweenthiscoefﬁcientandothersisstatisticallysigniﬁ-cant.Wehaveexperimentedwithcoefﬁcientsofthenextfourlevelsandfoundthelevel-1coefﬁcientadequate.

Figure2showsthewaveletﬁlteringfortheaccesstraceofadatasampleinMolDyn,amoleculardynamicssimulationprogram.Theﬁlteringremovesaccessesduringthegradualchangesbecausetheyhavelowcoefﬁcients.Notethatitcorrectlyremovesaccessesthatcorrespondtolocalpeaks.Theremainingfouraccessesindicateglobalphasechanges.

Sherwoodetal.usedtheFouriertransformtoﬁndperiodicpat-ternsinexecutiontrace[29].TheFouriertransformshowsthefre-quenciesappearedduringthewholesignal.Incomparison,waveletsgivesthetime-frequencyorthefrequenciesappearedovertime.Josephetal.usedwaveletstoanalyzethechangeofprocessorvoltageovertimeandtomakeon-linepredictionsusinganefﬁ-cientHaar-waveletimplementation[18].Weusewaveletssimilartotheiroff-lineanalysisbutatmuchﬁnergranularity(becauseofthenatureofourproblem).Insteadofﬁlteringtheaccesstraceofalldata,weanalyzethesub-traceforeachdataelement.Thisiscriticalbecauseagradualchangeinthesubtracemaybeseenasanabruptchangeinthewholetraceandcausefalsepositivesinthewaveletanalysis.WewillshowanexamplelaterinFigure3(b),wheremostabruptchangesseenfromthewholetracearenotphasechanges.

Afteritﬁltersthesub-traceofeachdatasample,theﬁlteringsteprecompilestheremainingaccessesofalldatasamplesintheorderoflogicaltime.Thenewtraceiscalledaﬁlteredtrace.Sincetheremainingaccessesofdifferentdataelementsmaysignalthesamephaseboundary,weuseoptimalphasepartitioningtofurtherremovetheseredundantindicators.

Figure2:Awavelettransformexample,wheregradualchangesareﬁlteredout

2.2.3OptimalPhasePartitioning

Ataphaseboundary,manydatachangetheiraccesspatterns.Sincethewaveletﬁlteringremovesreusesofthesamedatawithinaphase,theremainingismainlyaccessestodifferentdatasamplesclusteredatphaseboundaries.Thesetwopropertiessuggesttwoconditionsforagoodphasepartition.First,aphaseshouldincludeaccessestoasmanydatasamplesaspossible.Thisensuresthatwedonotartiﬁciallycutaphaseintosmallerpieces.Second,aphaseshouldnotincludemultipleaccessesofthesamedatasample,sincedatareusesindicatephasechangesintheﬁlteredtrace.Thecomplication,however,comesfromtheimperfectﬁlteringbythewavelettransform.Notallreusesrepresentaphasechange.

Weconverttheﬁlteredtraceintoadirectedacyclicgraphwhereeachnodeisanaccessinthetrace.Eachnodehasadirectededgetoallsucceedingnodes.Eachedge(fromaccessatob)hasaweightdeﬁnedasw=αr+1,where1≥α≥0,andristhenum-berofnoderecurrencesbetweenaandb.Forexample,thetraceaceefgefbdhastworecurrencesofeandonerecurrenceoffbe-tweencandb,sotheedgeweightbetweenthetwonodesis3α+1.Intuitively,theweightmeasureshowﬁtthesegmentfromatobisasaphase.Thetwofactorsintheweightpenalizetwotendencies.Theﬁrstistheinclusionofreuses,andthesecondisthecreationofnewphases.Theoptimalcaseisaminimalnumberofphaseswithleastreusesineachphase.Sincethetraceisnotperfect,theweightandthefactorαcontroltherelativepenaltyfortoolargeortoosmallphases.Ifαis1,weprohibitanyreusesinaphase.Wemayhaveasmanyphasesasthelengthoftheﬁlteredtrace.Theresultwhenα≥1isthesameasα=1.Ifαis0,wegetonephase.Inexperiments,wefoundthatthephasepartitionsweresimilarwhenαisbetween0.2and0.8,suggestingthatthenoiseintheﬁlteredtracewasacceptable.Weusedα=0.5intheevaluation.

Onceαisdetermined,shortest-pathanalysisﬁndsaphaseparti-tionthatminimizesthetotalpenalty.Itaddstwonodes:asourcenodethathasdirectededgesﬂowingtoalloriginalnodes,andasinknodethathasdirectededgescomingfromalloriginalnodes.Anydirectedpathfromthesourcetothesinkgivesaphaseparti-tion.Theweightofthepathisitspenalty.Therefore,thebestphasepartitiongivestheleastpenalty,anditisgivenbytheshortestpathbetweenthesourceandthesink.

Summaryofoff-linephasedetectionTheprogramlocalityisaproductofallaccessestoallprogramdata.Thephasedetectionﬁrstpicksenoughsamplesintimeandspacetocapturethehigh-levelpattern.Thenituseswaveletstoremovethetemporalredun-dancyandphasepartitioningtoremovethespatialredundancy.Thenextchallengeismarkingthephasesinprogramcode.Thewaveletﬁlteringlosesaccuratetimeinformationbecausesamplesarecon-sideredapairatatime(tomeasurethedifference).Inaddition,thelocalitymaychangethroughatransitionperiodinsteadofatran-sitionpoint.Hencetheexacttimeofaphasechangeisdifﬁculttoattain.Weaddressthisprobleminthenextstep.

2.3PhaseMarkerSelection

Theinstructiontraceofanexecutionisrecordedatthegranular-ityofbasicblocks.Theresultisablocktrace,whereeachelementisthelabelofabasicblock.Thisstepﬁndsthebasicblocksinthecodethatuniquelymarkdetectedphases.Previousprogramanalysisconsideredonlyasubsetofcodelocations,forexamplefunctionandloopboundaries[17,21,22].Ouranalysisexaminesallinstructionblocks,whichisequivalenttoexaminingallprograminstructions.Thisisespeciallyimportantatthebinarylevel,wherethehighlevelprogramstructuremaybelostduetoaggressivecom-pilertransformationssuchasprocedurein-lining,softwarepipelin-ing,loopfusion,andcodecompression.

Asexplainedearlier,phasedetectionﬁndsthenumberofphasesbutcannotlocatetheprecisetimeofphasetransitions.Thepreci-sionisintheorderofhundredsofmemoryaccesseswhileatypicalbasicblockhasfewerthantenmemoryreferences.Moreover,thetransitionmaybegradual,anditisimpossibletolocateasinglepoint.Wesolvethisproblembyusingthefrequencyofthephasesinsteadofthetimeoftheirtransition.

Wedeﬁnethefrequencyofaphasebythenumberofitsexecu-tionsinthetrainingrun.Giventhefrequencyfoundbythelaststep,wewanttoidentifyabasicblockthatisalwaysexecutedatthebe-ginningofaphase.Wecallitthemarkerblockforthisphase.Ifthefrequencyofaphaseisf,themarkerblockshouldappearnomorethanftimesintheblocktrace.Theﬁrststepofthemarkerselec-tionﬁlterstheblocktraceandkeepsonlyblockswhosefrequencyisnomorethanf.Ifaloopisaphase,theﬁlteringwillremovetheoccurrencesoftheloopbodyblockandkeeponlytheheaderandtheexitblocks.Ifasetofmutualrecursivefunctionsformsaphase,theﬁlteringwillremovethecodeofthefunctionsandkeeponlytheonesbeforeandaftertherootinvocation.Afterﬁltering,theremainingblocksarecandidatemarkers.

Afterfrequency-basedﬁltering,theremovedblocksleavelargeblankregionsbetweentheremainingblocks.Ifablankregionislargerthanathreshold,itisconsideredasaphaseexecution.Thethresholdisdeterminedbythelengthdistributionoftheblankre-gions,thefrequencyofphases,andtheexecutionlength.Sincethetrainingrunshadatleast3.5millionmemoryaccesses,wesimplyused10thousandinstructionsasthethreshold.Inotherwords,aphaseexecutionmustconsumeatleast0.3%ofthetotalexecutiontobeconsideredsigniﬁcant.Wecanuseasmallerthresholdtoﬁndsub-phasesafterweﬁndlargephases.

Oncethephaseexecutionsareidentiﬁed,theanalysisconsiderstheblockthatcomesafteraregionasmarkersmarkingthebound-arybetweenthetwophases.Tworegionsareexecutionsofthesamephaseiftheyfollowthesamecodeblock.Theanalysispicksmarkersthatmarkmostifnotallexecutionsofthephasesinthetrainingrun.Wehaveconsideredseveralimprovementsthatcon-siderthelengthoftheregion,usemultiplemarkersforthesamephase,andcorrelatemarkerselectionacrossmultipleruns.How-ever,thisbasicschemesufﬁcesforprogramswetested.

Requiringthemarkerfrequencytobenomorethanthephasefrequencyisnecessarybutnotsufﬁcientforphasemarking.Aphasemaybefragmentedbyinfrequentlyexecutedcodeblocks.However,afalsemarkercannotdivideaphasemorethanftimes.Inaddition,thepartialphaseswillberegroupedinthenextstep,phase-hierarchyconstruction.

2.4MarkingThePhaseHierarchy

HierarchicalconstructionGiventhedetectedphases,wecon-structaphasehierarchyusinggrammarcompression.Thepur-poseistoidentifycompositephasesandincreasethegranularityofphaseprediction.Forexample,fortheTomcatvprogramshowedinFigure1,everyﬁvephaseexecutionsformatimestepthatre-peatsasacompositephase.Byconstructingthephasehierarchy,weﬁndphasesofthelargestgranularity.

WeuseSEQUITUR,alinear-timeandlinear-spacecompressionmethoddevelopedbyNevill-ManningandWitten[26].Itcom-pressesastringofsymbolsintoaContextFreeGrammar.Tobuildthephasehierarchy,wehavedevelopedanovelalgorithmthatex-tractsphaserepetitionsfromacompressedgrammarandrepresentsthemexplicitlyasaregularexpression.Thealgorithmrecursivelyconvertsnon-terminalsymbolsintoregularexpressions.Itremem-berspreviousresultssothatitconvertsthesamenon-terminalsym-bolonlyonce.Amergestepoccursforanon-terminalonceits

right-handsideisfullyconverted.Twoadjacentregularexpressionsaremergediftheyareequivalent(usingforexampletheequivalenttestdescribedbyHopcroftandUllman[16]).

SEQUITURwasusedbyLarustoﬁndfrequentcodepaths[19]andbyChilimbitoﬁndfrequentdata-accesssequences[5].TheirmethodsmodelthegrammarasaDAGandﬁndsfrequentsub-sequencesofagivenlength.Ourmethodtraversesthenon-terminalsymbolsinthesameorder,butinsteadofﬁndingsub-sequences,itproducesaregularexpression.

PhasemarkerinsertionThelaststepusesbinaryrewritingtoinsertmarkersintoaprogram.Thebasicphases(theleavesofthephasehierarchy)haveuniquemarkersintheprogram,sotheirpre-dictionistrivial.Topredictthecompositephases,weinsertapre-dictorintotheprogram.Basedonthephasehierarchy,thepredictormonitorstheprogramexecutionandmakespredictionsbasedontheon-linephasehistory.Sincethehierarchyisaregularexpres-sion,thepredictorusesaﬁniteautomatontorecognizethecurrentphaseinthephasehierarchy.Intheprogramswetestedsofar,thissimplemethodsufﬁces.Thecostofthemarkersandthepredictorisnegligiblebecausetheyareinvokedonceperphaseexecution,whichconsistsofonaveragemillionsofinstructionsasshownintheevaluation.

Table1:Benchmarks

Benchmark

Source

fastFouriertransformation

Spec2KFp

commonUNIXcompressionutility

Spec95Int

vectorizedmeshgeneration

Spec95Fp

anobject-orienteddatabase

AppluGccSwim

Mesh

moleculardynamicssimulation

CHAOS

3.EVALUATION

Weconductfourexperiments.Weﬁrstmeasurethegranularityandaccuracyofphaseprediction.Wethenuseitincacheresizingandmemoryremapping.Finally,wetestitagainstmanualphasemarking.Wecomparewithotherpredictiontechniquesintheﬁrsttwoexperiments.

OurtestsuiteisgiveninTable1.Wepickprogramsfromdiffer-entsetsofcommonlyusedbenchmarkstogetaninterestingmix.Theyrepresentcommoncomputationtasksinsignalprocessing,combinatorialoptimization,structuredandunstructuredmeshandN-bodysimulations,acompiler,andadatabase.FFTisaba-sicimplementationfromatextbook.ThenextsixprogramsarefromSPEC:threeareﬂoating-pointandthreeareintegerprograms.ThreearefromSPEC95suite,onefromSPEC2K,andtwo(withsmallvariation)arefromboth.OriginallyfromtheCHAOSgroupatUniversityofMaryland,MolDynandMesharetwodynamicpro-gramswhosedataaccesspatterndependsonprograminputsandchangesduringexecution[6].Theyarecommonlystudiedindy-namicprogramoptimization[11,15,25,32].Theﬂoating-pointprogramsfromSPECarewritteninFortran,andtheintegerpro-gramsareinC.Ofthetwodynamicprograms,MolDynisinFor-tran,andMeshisinC.Wenotethatthechoiceofsource-levellan-guagesdoesnotmatterbecauseweanalyzeandtransformprogramsatthebinarylevel.

ForprogramsfromSPEC,weusethetestorthetraininputforphasedetectionandtherefinputforphaseprediction.Forthepre-dictionofMesh,weusedthesamemeshasthatinthetrainingrunbutwithsortededges.Forallotherprograms,thepredictionistestedonexecutionshundredstimeslongerthanthoseusedinphasedetection.

WeuseATOMtoinstrumentprogramstocollectthedataandinstructiontraceonaDigitalAlphamachine[31].AllprogramsarecompiledbytheAlphacompilerusing“-O5”ﬂag.Afterphaseanalysis,weagainuseATOMtoinsertmarkersintoprograms.

3.1PhasePrediction

WepresentresultsforallprogramsexceptforGccandVortex,whichwediscussattheendofthissection.Weﬁrstmeasurethephaselengthandthenlookatthephaselocalityindetail.

Table2showstwosetsofresults.Theupperhalfshowsthe

Table2:Theaccuracyandcoverageofphaseprediction

BenchmarksAppluTomcatvMeshStrict100100100accuracy96.4192.3972.75

Accuracy(%)99.9699.9100Coverage(%)99.7099.7699.58

Average

96.47

13.49

86.1498.48

FFTCompressSwimMolDyn

14529159

Table3:Thenumberandthesizeofphasesindetectionandpredictionruns

DetectionPrediction

exe.len.avg.largestphaseleafavg.leafsize(Minst.)size(Minst.)phases(Minst.)

2.55730.4

254.33.29443775.5

0.66762418.4

175.034.952504.7

4.133334.9

5151.998.246911.1

0.20250988.1

863.6627.393317146.5

232.22712.037.031699.6

Tomcatv

Phase boundaries in sample trace

phaseboundaryCompress

Phase boundaries in sample trace

markerblock ID

Logical TimeLogical Time

25 billion inst., 5250 executions (crosses) of 7 phases

62 billion inst., 52 executions (crosses)

of 4 phases

phase IDshortest & longest exe. length in M inst.frequency (%)1225 billion inst., 2493 intervals (dots)9 of 38 BBV clusters (boxes) shown(12256KB cache miss rate (%)62 billion inst., 6242 intervals (dots)7 of 21 BBV clusters (boxes) shown256KB cache miss rate (%)108642002.972.9710812.596.73(e)cluster bounding boxfrequency (%)2.972.972.972.972.972.4526.08(f)6410.8623.54204.6336.412.96510152032KB cache miss rate (%)255101532KB cache miss rate (%)20Figure3:PredictionAccuracyforTomcatvandCompress.Part(a)and(b)showthephaseboundariesfoundbyoff-linephasedetection.Part(c)and(d)showthelocalityofthephasesfoundbyrun-timeprediction.Asacomparison,Part(e)and(f)showthelocalityoftenmillion-instructionintervalsandBBV(basic-blockvector)clusters.

16 1L14 4r)ew%12( o)10Py anwPhase 1o28 ePhase 2tBaK6R2 3s(4siM20051015202530Phase Occurance NumberFigure4:ThemissratesofCompressphasesonIBMPower4

variation.Infact,inthemajorityofcasesintheseprograms,BBVproducestightclusters.However,eveninbestcases,BBVclustersdonohaveperfectlystackedpointsaslocalityphasesdo.

Table4showstheinitialandnormalizedstandarddeviation.Thelocalityisan8-elementvectorthatcontainsthemissrateforcachesizesfrom32KBto256KBin32KBincrements.Thestandardde-viationiscalculatedforallexecutionsofthesamephaseandtheintervalsofeachBBVcluster.Thenthestandarddeviationofallphasesorclustersareaveraged(weightedbythephaseorclustersize)toproducethenumberfortheprogram.ThenumbersofBBVclusteringandprediction,shownbythelasttwocolumns,aresim-ilarlysmallasreportedbySherwoodetal.forIPC[30].Still,thenumbersforlocalityphasesaremuchsmaller—onetoﬁveordersofmagnitudesmallerthanthatofBBV-basedprediction.

Table4:StandarddeviationoflocalityphasesandBBVphases

standarddeviations

localityphaseBBVRLEMarkov

clustering

6.87E-80.00615.06E-70.000133.14E-60.000614.53E-70.00162.66E-80.000186.00E-60.000637.60E-50.00067

Sofarwemeasurethecachemissratethroughsimulation,whichdoesnotincludeallfactorsonrealmachinessuchasthatoftheoperatingsystem.WenowexaminetheL1missrateonanIBMPower4processorfortheﬁrsttwophasesofCompress(theothertwophasesaretooinfrequenttobeinteresting).Figure4showsthemeasuredmissrateforeachexecutionofthetwophases.AllbuttheﬁrstexecutionofPhase1havenearlyidenticalmissratesonthe32KB2-waydatacache.TheexecutionsofPhase2showmorevariation.TheeffectfromtheenvironmentismorevisibleinPhase2likelybecauseitsexecutionsareshorterandthemissratelowerthanthoseoftheﬁrstphase.

Thecomparisonwithinterval-basedmethodsispartialbecauseweuseonlyprogramsthatareamenabletolocality-phasepredic-tion.Manydynamicprogramsdonothaveconsistentlocality.Fortheminterval-basedmethodscanstillexploitrun-timepatterns,whileourcurrentphasepredictionschemewouldnotworkbecauseitas-

Spec95/Gcc, compiling cccp.i, sampled 331 times1.0E5ec8.0E4natsid 6.0E4esuer d4.0E4elpmas2.0E40.0E00.0E02.0E74.0E76.0E78.0E71.0E8logical time of data accessSpec95/Vortex, test run, sampled 272 times8.0E5ecna6.0E5tsid esuer4.0E5 delpmas2.0E50.0E00.0E05.0E81.0E9 1.5E92.0E9 2.5E9logical time of data accessFigure5:SampledreusedistancetraceofGccandVortex.The

exactphaselengthisunpredictableingeneral.

sumesthateachphase,onceinexecution,maintainsidenticallo-cality.Nextaretwosuchexamples.

3.1.2GccandVortex

TheprogramsGccandVortexaredifferentbecausetheirphaselengthisnotconsistenteveninthesameexecution.InGcc,thephaselengthisdeterminedbythefunctionbeingcompiled.Fig-ure5showsthedistance-basedsampletrace.Unlikeprevioustracegraphs,ituseshorizontalstepstolinksamplepoints.Thepeaksintheuppergraphroughlycorrespondtothe100functionsinthe6383-lineinputﬁle.Thesizeandlocationofthepeaksaredeter-minedbytheinputandarenotconstant.

Vortexisanobject-orienteddatabase.Thetestrunﬁrstconstructsadatabaseandthenperformsasetofqueries.ThelowerﬁgureofFigure5showsthesampletrace.Itshowsthetransitionfromdatainsertiontoqueryprocessing.However,inotherinputs,thecon-structionandqueriesmaycomeinanyorder.Theexactbehavior,likeGcc,isinputdependentandnotconstant.

OurrecentresultsshowthatanextensionofthephaseanalysiscanmarkthephasesinGcc,whicharethecompilationofinputfunctions.Still,thepredictionresultsareinputdependent.DingandZhongshowedthattheoverallreusepatternofthesetwopro-gramsarestableacrossmanyinputs[12].Itsuggestsapredictionstrategybasedonstatistics.Wedonotconsiderthisextensioninthispaperandwillnotdiscussthesetwoprogramsfurther.

3.2AdaptiveCacheResizing

Duringanexecution,cacheresizingreducesthephysicalcachesizewithoutincreasingthemissrate[2,21].Therefore,itcanre-ducetheaccesstimeandenergyconsumptionofthecachewithoutlosingperformance.Weuseasimpliﬁedmodelwherethecacheconsistsof64-byteblocksand512sets.Itcanchangefromdirectmappedto8-waysetassociative,sothecachesizecanchangebe-

tween32KBand256KBin32KBunits.Intheadaptation,weneedtopredictthesmallestcachesizethatyieldsthesamemissrateasthe256KBcache.

AsseenintheexampleofTomcatv,programdatabehaviorchangesconstantly.Alocalityphaseisaunitofrepeatingbehaviorratherthanaunitofuniformbehavior.Tocapturethechangingbe-haviorinsidealargephase,wedivideitinto10Kintervals(calledphaseintervals).Theadaptationﬁndsthebestcachesizeforeachintervalduringtheﬁrstfewexecutionsandreusesthemforlaterruns.Theschemeneedshardwaresupportbutneedsnomorethanthatofinterval-basedcacheresizing.

Interval-basedcacheresizingdividesanexecutionintoﬁxed-lengthwindows.Basedonthehistory,intervalpredictionclassi-ﬁespastintervalsintoclassesandpredictsthebehaviorclassofthenextintervalusingmethodssuchaslast-valueandMarkovmod-els.Forcacheresizing,Balasubramonianetal.usedthecachemissrateandbranchpredictionratetoclassifypastintervals[2].Recentstudiesconsideredcodeinformationsuchascodeworkingset[9]andbasic-blockvector(BBV)[30].Wetestinterval-basedpre-dictionusingﬁvedifferentintervallengths:10K,1M,10M,40M,and100Mmemoryaccesses.Inaddition,wetestaBBVpredictorusing10Minstructionwindows,followingtheimplementationofSherwoodetal[30].

Theintervalmethodsdonothavephasemarkers,sotheycon-stantlymonitoreverypastintervaltodecidewhetheraphasechangehasoccurred.Intheexperiment,weassumeperfectdetection:thereisaphasechangeifthebestcachesizeofthenextintervaldiffersfromthecurrentone.BBVmethodusesarun-lengthencodingMarkovpredictortogivetheBBVclusterofthenextinterval(thebestpredictorreportedin[30]).However,asthelastsectionshows,theintervalsofaBBVclusterdonotalwayshaveidenticallocal-ity.WeuseperfectdetectionforBBVaswedoforotherintervalmethods.

Ataphasechange,thephasemethodreusesthebestcachesizestoredforthesamephase.TheBBVmethodreusesthecurrentbestcachesizeforeachBBVcluster.Forrun-timeexploration,wecounttheminimalcost—eachexplorationtakesexactlytwotrialruns,oneatthefullcachesizeandoneatthehalfcachesize.Thenweusethebestcachesizefromthethirdinterval.Intheexper-iment,weknowthebestcachesizeofeachphaseorintervalbyrunningitthroughCheetah,acachesimulatorthatmeasuresthemissrateofalleightcachesizesatthesametime[33].There-sultsforintervalandBBVmethodsareidealisticbecausetheyuseperfectphase-changedetection.Theresultofthephase-intervalmethodisreal.Becauseitknowstheexactbehaviorrepetition,thephase-intervalmethodcanamortizetheexplorationcostovermanyexecutions.Withtherighthardwaresupport,itcangaugetheexactlosscomparedtothefullsizecacheandguaranteeaboundontheabsoluteperformanceloss.

Figure6showstheaveragecachesizefromphase,interval,andBBVmethods.Theﬁrstgraphshowstheresultsofadaptationwithnomiss-rateincrease.Theresultsarenormalizedtothephasemethod.Thelargestcachesize,256KB,isshownasthelastbarineachgroup.Differentintervalsﬁnddifferentcachesizes,butallreductionsarelessthan10%.Theaverageis6%.BBVgivesconsistentlygoodreductionwithasingleintervalsize.Theim-provementisatmost15%andonaverage10%.Incontrast,thephaseadaptationreducesthecachesizeby50%formostprogramsandover35%onaverage.

Figure6(b)showstheresultsofadaptationwitha5%boundonthemiss-rateincrease.Theeffectofintervalmethodsvariesgreatly.The10Mintervalwas20%betterthanthelocalityphaseforFFTbutafactorofthreeworseforTomcatvandSwim.The100Min-

2.5e2ziS PhaseehcIntvl-10kaC1.5Intvl-1M gvIntvl-10MA dIntvl-40Mez1Intvl-100MilamBBVrolargest-sizeN0.50MoldynatvTomcSwimFFTprssMeshppluAgeComeAveraBenchmarks4.54ez3.5iS Phaseeh3cIntvl-10kaCIntvl-1M g2.5vIntvl-10MA d2Intvl-40MezIntvl-100Milam1.5BBVrolargest-sizeN10.50vsMoldynmcatSwimFFTToresMeshppluAgeCompveraABenchmarksFigure6:Averagecache-sizereductionbylocalityphase,in-terval,andBBVpredictionmethods,assumingperfectphase-changedetectionandminimal-explorationcostforintervalandBBVmethods.Uppergraph:noincreaseincachemisses.Lowergraph:atmost5%increase.

tervalhasthebestaveragereductionofnearly50%.BBVagainshowsconsistentlygoodreductionwithasingleintervalsize.Onaverageitisslightlybetterthanthebestintervalmethod.ThephasemethodreducesthecachesizemorethanothermethodsdoforallprogramsexceptforFFT.FFThasvariedbehavior,whichcausesthelowcoverageandconsequentlynotaslargecache-sizereduc-tionbylocalityphaseprediction.MolDyndoesnothaveidenticallocality,sophase-basedresizingcausesa0.6%increaseinthenum-berofcachemisses.Acrossallprograms,theaveragereductionusinglocalityphasesisover60%.

Earlierstudiesusedmoreaccuratemodelsofcacheandmea-suredtheeffectontimeandenergythroughcycle-accuratesimula-tion.Sincesimulatingthefullexecutiontakesalongtime,paststudieseitherusedapartialtraceorreducedtheprograminputsize[2,21].Wechoosetomeasurethemissrateoffullexecu-tions.Whileitdoesnotgivethetimeorenergy,themissrateisaccurateandreproduciblebyotherswithoutsigniﬁcanteffortsincalibrationofsimulationparameters.

3.3Phase-BasedMemoryRemapping

Weuselocalityphasesinrun-timememoryremapping.Tosup-portdataremappingatthephaseboundary,weassumethesupportoftheImpulsememorycontroller,developedbyCarterandhiscol-leaguesatUniversityofUtah[34,35].ImpulsereorganizesdatawithoutactuallycopyingthemtoCPUorinmemory.Forexam-ple,itmaycreateacolumn-majorversionofarow-majorarrayviaremapping.AkeyrequirementforexploitingImpulseistoidentifythetimewhenremappingisproﬁtable.

Weconsiderafﬁnity-basedarrayremapping,wherearraysthattendtobeaccessedconcurrentlyareinterleavedbyremapping[36].Todemonstratethevalueoflocalityphaseprediction,weevaluatetheperformancebeneﬁtsofredoingtheremappingforeachphaseratherthanonceforthewholeprogramduringcompilation.Weapplyafﬁnityanalysisforeachphaseandinsertremappingcodeatthelocationofthephasemarker.Thefollowingtableshowstheexecutiontimeinsecondson2GHzIntelPentiumIVmachinewiththegcccompilerusing-O3.

Forthetwoprograms,weobtainspeedupsof35.5%and2.8%comparedtotheoriginalprogramand13%and2.5%comparedtothebeststaticdatalayout[36],asshowninTable5.IntheabsenceofanImpulseimplementation,weprogramtheremappingandcal-culatetherunningtimeexcludingtheremappingcost.Table7.3ofZhang’sdissertationshowstheoverheadofsettingupremappingsforawiderangeofprograms.Theoverheadincludessettingupshadowregion,creatingmemorycontrollerpagetable,dataﬂush-ing,andpossibledatamovement.Thelargestoverheadshownis1.6%ofexecutiontimeforstaticindexvectorremapping[34].Forexampleforthe14majorarraysinSwim,whole-programanalysisshowscloseafﬁnitybetweenarrayuandv,uoldandpold,andunewandpnew.Phase-basedanalysisshowsafﬁnitygroup{u,v,p}fortheﬁrstphase,{u,v,p,unew,vnew,pnew}forthesecondphase,andthreeothergroups,{u,uold,unew},{v,vold,vnew},and{p,pold,pnew},forthethirdphase.Comparedtowhole-programreorganization,thephase-basedoptimizationreducescachemissesbyonethird(duetoarrayp)fortheﬁrstphase,bytwothirdsforthesecondphase,andbyhalfforthethirdphase.

Usingthetwoexampleprograms,wehaveshownthatphasepre-dictionﬁndsopportunitiesofdynamicdataremapping.Theaddi-tionalissuesofafﬁnityanalysisandcodetransformationaredis-cussedbyZhongetal[36].TheexactinteractionwithImpulseliketoolsisasubjectoffuturestudy.

3.4ComparisonwithManualPhaseMarking

Wehand-analyzedeachprogramandinsertedphasemarkers(man-ualmarkers)basedonourreadingofthecodeanditsdocumenta-tionaswellasresultsfromgprof(toﬁndimportantfunctions).We

comparemanualmarkingwithautomaticmarkingasfollows.Asaprogramruns,allmarkersoutputthelogicaltime(thenumberofmemoryaccessesfromthebeginning).Giventhesetoflogi-caltimesfrommanualmarkersandthesetfromauto-markers,wemeasuretheoverlapbetweenthetwosets.Twologicaltimesareconsideredthesameiftheydifferbynomorethan400,whichis0.02%oftheaveragephaselength.Weusetherecallandprecisiontomeasuretheircloseness.Theyaredeﬁnedbytheformulasbelow.Therecallshowsthepercentageofthemanuallymarkedtimesthataremarkedbyauto-markers.Theprecisionshowsthepercentageoftheautomaticallymarkedtimesthataremarkedmanually.

Recall

|M∩A|

|A|

(2)

Table5:Theeffectofphase-basedarrayregrouping,excludingthecostofrun-timedatareorganizationBenchmarkPhase(speedup)

4.294.27(0.4%)52.8438.52(27.1%)

Table6:TheoverlapwithmanualphasemarkersBenchmarkDetectionPrediction

Prec.Recall11

Applu0.9930.948

0.9620.987

Tomcatv0.9520.571

0.3411

Mesh10.834

0.2710.987

Average0.9640.692

whereMisthesetoftimesfromthemanualmarkers,andAisthe

setoftimesfromauto-markers.

Table6showsacomparisonwithmanuallyinsertedmarkersfordetectionandpredictionruns.Thecolumnsforeachrungivetherecallandtheprecision.Therecallisover95%inallcasesex-ceptforMolDyninthedetectionrun.Theaveragerecallincreasesfrom96%inthedetectionrunto99%inthepredictionrunbe-causethephaseswithabetterrecalloccurmoreofteninlongerruns.Hence,theauto-markerscapturetheprogrammer’sunder-standingoftheprogrambecausetheycatchnearlyallmanuallymarkedphasechangingpoints.

Theprecisionisover95%forAppluandCompress,showingthatautomaticmarkersareeffectivelythesameasthemanualmarkers.MolDynhasthelowestrecallof27%.Wecheckedthecodeandfoundthedifference.Whentheprogramisconstructingtheneigh-borlist,theanalysismarkstheneighborsearchforeachparticleasaphasewhiletheprogrammermarksthesearchesforallparti-clesasaphase.Inthiscase,theanalysisiscorrect.Theneighborsearchrepeatsforeachparticle.ThisalsoexplainswhyMoldyncannotbepredictedwithbothhighaccuracyandhighcoverage—theneighborsearchhasvaryingbehaviorsinceaparticlemayhaveadifferentnumberofneighbors.Thelowrecallinotherprogramshasthesamereason:theautomaticanalysisismorethoroughthanthemanualanalysis.

Fourofthetestprogramsarethesimulationofgrid,meshandN-bodysystemsintimesteps.DingandKennedyshowedthattheybeneﬁtedfromdynamicdatapacking,whichmonitoredtherun-timeaccesspatternandreorganizedthedatalayoutmultipletimesduringanexecution[11].Theirtechniquewasautomaticexceptforaprogrammer-inserteddirective,whichmustbeexecutedonceineachtimestep.Thisworkwasstartedinparttoautomaticallyinsertthedirective.Ithasachievedthisgoal:thelargestcompositephaseinthesefourprogramsisthetimesteploop.Therefore,thephasepredictionshouldhelptofullyautomatedynamicdatapack-ing,whichisshownbyseveralrecentstudiestoimproveperfor-mancebyintegerfactorsforphysical,engineering,andbiologicalsimulationandsparsematrixsolvers[11,15,25,32].

SummaryForprogramswithconsistentphasebehavior,thenewmethodgivesaccuratelocalitypredictionandconsequentlyyieldssigniﬁcantbeneﬁtsforcacheresizingandmemoryremapping.Itismoreeffectiveatﬁndinglong,recurringphasesthanpreviousmeth-odsbasedonprogramcode,executionintervals,theircombination,andevenmanualanalysis.Forprogramswithvaryingphasebehav-ior,theproﬁlingstepcanoftenrevealtheinconsistency.Thenthemethodavoidsbehaviorpredictionofinconsistentphasesthroughaﬂag(asshownbytheexperimentsreportedinTable2).Usingasmallinputinaproﬁlingrunisenoughforlocalityphasepredic-tion.Therefore,thetechniquecanhandlelargeprogramsandlongexecutions.ForprogramssuchasGCCandVortex,wherelittlecon-

sistencyexistsduringthesameexecution,thelocalityanalysiscanstillrecognizephaseboundariesbutcannotyetmakepredictions.Predictionsbasedonstatisticsmaybehelpfulfortheseprograms,whichremainstobeourfuturework.Inaddition,thecurrentanaly-sisconsidersonlytemporallocality.Thefutureworkwillconsiderspatiallocalityinconjunctionwithtemporallocality.

4.RELATEDWORK

Thisworkisauniquecombinationofprogramcodeanddataanalysis.Itbuildsonpastworkinthesetwoareasandcomplementsinterval-basedmethods.

LocalityphasesEarlyphaseanalysis,owingtoitsrootinvirtual-memorymanagement,wasintertwinedwithlocalityanalysis.In1976,BatsonandMadisondeﬁnedaphaseasaperiodofexecu-tionaccessingasubsetofprogramdata[4].Theyshowedexperi-mentallythatasetofAlgol-60programsspent90%timeinmajorphases.However,theydidnotpredictlocalityphases.LaterstudiesusedtimeorreusedistanceaswellaspredictorssuchasMarkovmodelstoimprovevirtualmemorymanagement.Recently,DingandZhongfoundpredictablepatternsintheoveralllocalitybutdidnotconsiderthephasebehavior[12].Wearenotawareofanytrace-basedtechniquethatidentiﬁesstaticphasesusinglocalityanalysis.ProgramphasesAllenandCockepioneeredintervalanalysistoconvertprogramcontrolﬂowintoahierarchyofregions[1].Forscientiﬁcprograms,mostcomputationanddataaccessareinloopnests.Anumberofstudiesshowedthattheinter-proceduralarray-sectionanalysisaccuratelysummarizestheprogramdatabehavior.RecentworkbyHsuandKremerusedprogramregionstocontrolprocessorvoltagestosaveenergy.Theirregionmayspanloopsandfunctionsandisguaranteedtobeanatomicunitofexecutionunderallprograminputs[17].Forgeneralpurposeprograms,Balasub-ramonianetal.[2],Huangetal.[21],andMagklisetal.[22]se-lectedasphasesprocedureswhosenumberofinstructionsexceedsathresholdinaproﬁlingrun.Thethreestudiesfoundthebestvoltageforprogramregionsonatraininginputandthentestedtheprogramonanotherinput.Theyobservedthatdifferentinputsdidnotaffectthevoltagesetting.Theﬁrsttwostudiesalsomeasuredtheenergysavingofphase-basedcacheresizing[2,21].Incompar-ison,thenewtechniquedoesnotrelyonstaticprogramstructure.Itusestrace-basedlocalityanalysistoﬁndthephaseboundaries,whichmayoccuranywhereandnotjustatregion,looporproce-dureboundaries.

IntervalphasesIntervalmethodsdivideanexecutionintoﬁxed-sizewindows,classifypastintervalsusingmachineorcode-basedmetrics,andpredictfutureintervalsusinglastvalue,Markov,ortable-drivenpredictors[9,10,13,30].Thepastworkusedintervalsoflengthfrom100thousand[2]to10millioninstructions[30]andexecutionsfrom10millisecondsto10seconds[13].Intervalpre-dictionworkswelliftheintervallengthdoesnotmatter,forexam-ple,whenanexecutionconsistsoflongsteadyphases.Otherwiseitisdifﬁculttoﬁndthebestintervallengthforagivenprogramonagiveninput.Theexperimentaldatainthispapershowtheinher-entlimitationofintervalsforprogramswithconstantlychangingdatabehavior.Balasubramonianetal.searchesforthebestintervalsizeatruntime[3].Theirmethoddoublestheintervallengthun-tilthebehaviorisstable.LetNbetheexecutionlength,thisnewschemesearchesO(logN)choicesinthespaceofNcandidates.Inthiswork,welocatephasesanddeterminetheirexactlengthsthroughoff-linelocalityanalysis.Weshowthatimportantclassesofprogramshaveconsistentphasebehaviorandthehighaccuracyandlargegranularityofphasepredictionallowadaptationwithatightworst-performanceguarantee.However,notallprogramsareamenabletotheoff-lineanalysis.Interval-basedmethodsdonot

havethislimitationandcanexploitthegeneralclassofrun-timepatterns.

5.CONCLUSIONS

Thepaperpresentsageneralmethodforpredictinghierarchicalmemoryphasesinprogramswithinput-dependentbutconsistentphasebehavior.Basedonproﬁlingruns,itpredictsprogramexe-cutionshundredsoftimeslargerandpredictsthelengthandlocal-itywithnearperfectaccuracy.Whenusedforcacheadaptation,itreducesthecachesizeby40%withoutincreasingthenumberofcachemisses.Whenusedformemoryremapping,itimprovesprogramperformancebyupto35%.Itismoreeffectiveatidenti-fyinglong,recurringphasesthanpreviousmethodsbasedonpro-gramcode,executionintervals,andmanualanalysis.Itrecognizesprogramswithinconsistentphasebehaviorandavoidsfalsepredic-tions.Theseresultssuggestthatlocalityphasepredictionshouldbeneﬁtmodernadaptationtechniquesforincreasingperformance,reducingenergy,andotherimprovementstothecomputersystemdesign.

Scientiﬁcallyspeaking,thisworkisanotherattempttounder-standthedichotomybetweenprogramcodeanddataaccessandtobridgethedivisionbetweenoff-lineanalysisandon-lineprediction.Theresultembodiesandextendsthedecades-oldideathatlocalitycouldbepartofthemissinglink.

Acknowledgement

Thisworkwasmotivatedinpartbytheearlierworkonphaseanaly-sisbyourcolleaguesatRochester.ThepresentationwasimprovedbydiscussionswithSandhyaDwarkadas,MichaelHuang,MichaelScott,DaveAlbonesi,andKaiShenatRochester,JohnCarteratUtah,UliKremeratRutgers,BradCalderatUCSD,andtheanony-mousreviewersfromthisandanotherconferencecommittee.LixinZhangatIBMandTimSherwoodatUCSBansweredourquestionsaboutImpulseandBBV.TheﬁnancialsupportcomesinpartfromtheDepartmentofEnergy(ContractNo.DE-FG02-02ER25525)andtheNationalScienceFoundation(ContractNo.CCR-0238176,CCR-0219848,andEIA-0080124).

6.[1]F.REFERENCES

AllenandJ.Cocke.Aproramdataﬂowanalysis

procedure.CommunicationsoftheACM,19:137–147,1976.[2]R.Balasubramonian,D.Albonesi,A.Buyuktosunoglu,and

S.Dwarkadas.Memoryhierarchyreconﬁgurationforenergyandperformanceingeneral-purposeprocessorarchitectures.InProceedingsofthe33rdInternationalSymposiumonMicroarchitecture,Monterey,California,December2000.[3]R.Balasubramonian,S.Dwarkadas,andD.H.Albonesi.

Dynamicallymanagingthecommunication-parallelismtrade-offinfutureclusteredprocessors.InProceedingsofInternationalSymposiumonComputerArchitecture,SanDiego,CA,June2003.

[4]A.P.BatsonandA.W.Madison.Measurementsofmajor

localityphasesinsymbolicreferencestrings.InProceedingsoftheACMSIGMETRICSConferenceonMeasurement&ModelingComputerSystems,Cambridge,MA,March1976.[5]T.M.Chilimbi.Efﬁcientrepresentationsandabstractionsfor

quantifyingandexploitingdatareferencelocality.In

ProceedingsofACMSIGPLANConferenceonProgrammingLanguageDesignandImplementation,Snowbird,Utah,June2001.

[6]R.Das,M.Uysal,J.Saltz,andY.-S.Hwang.Communication

optimizationsforirregularscientiﬁccomputationson

distributedmemoryarchitectures.JournalofParallelandDistributedComputing,22(3):462–479,September1994.[7]I.Daubechies.TenLecturesonWavelets.CapitalCityPress,Montpelier,Vermont,1992.

[8]

P.J.Denning.Workingsetspastandpresent.IEEE

TransactionsonSoftwareEngineering,SE-6(1),January1980.

[9]

A.S.DhodapkarandJ.E.Smith.Managing

multi-conﬁgurationhardwareviadynamicworking-setanalysis.InProceedingsofInternationalSymposiumonComputerArchitecture,Anchorage,Alaska,June2002.

[10]

A.S.DhodapkarandJ.E.Smith.Comparingprogramphasedetectiontechniques.InProceedingsofInternationalSymposiumonMicroarchitecture,December2003.

[11]

C.DingandK.Kennedy.Improvingcacheperformanceindynamicapplicationsthroughdataandcomputation

reorganizationatruntime.InProceedingsoftheSIGPLAN’99ConferenceonProgrammingLanguageDesignandImplementation,Atlanta,GA,May1999.

[12]

C.DingandY.Zhong.Predictingwhole-programlocalitywithreusedistanceanalysis.InProceedingsofACM

SIGPLANConferenceonProgrammingLanguageDesignandImplementation,SanDiego,CA,June2003.[13]

E.Duesterwald,C.Cascaval,andS.Dwarkadas.

Characterizingandpredictingprogrambehavioranditsvariability.InProceedingsofInternationalConferenceonParallelArchitecturesandCompilationTechniques,NewOrleans,Louisiana,September2003.[14]

C.Fang,S.Carr,S.Onder,andZ.Wang.

Reuse-distance-basedmiss-ratepredictiononaper

instructionbasis.InProceedingsoftheﬁrstACMSIGPLANWorkshoponMemorySystemPerformance,WashingtonDC,June2004.

[15]

H.HanandC.W.Tseng.Improvinglocalityforadaptiveirregularscientiﬁccodes.InProceedingsofWorkshoponLanguagesandCompilersforHigh-PerformanceComputing(LCPC’00),WhitePlains,NY,August2000.

[16]J.E.HopcroftandJ.D.Ullman.Introductiontoautomatatheory,languages,andcomputation.Addison-Wesley,1979.[17]

C.-H.HsuandU.Kermer.Thedesign,implementationandevaluationofacompileralgorithmforCPUenergy

reduction.InProceedingsofACMSIGPLANConferenceonProgrammingLanguageDesignandImplementation,SanDiego,CA,June2003.

[18]

R.Joseph,Z.Hu,andM.Martonosi.Waveletanalysisformicroprocessordesign:Experienceswithwavelet-baseddi/dtcharacterization.InProceedingsofInternationalSymposiumonHighPerformanceComputerArchitecture,February2004.

[19]

J.R.Larus.Wholeprogrampaths.InProceedingsofACMSIGPLANConferenceonProgrammingLanguageDesignandImplementation,Atlanta,Georgia,May1999.

[20]

C.LukandT.C.Mowry.Memoryforwarding:enabling

aggressivelayoutoptimizationsbyguaranteeingthesafetyofdatarelocation.InProceedingsofInternationalSymposiumonComputerArchitecture,Atlanta,GA,May1999.[21]

M.HuangandJ.RenauandJ.Torrellas.Positional

adaptationofprocessors:applicationtoenergyreduction.InProceedingsoftheInternationalSymposiumonComputerArchitecture,SanDiego,CA,June2003.

[22]

G.Magklis,M.L.Scott,G.Semeraro,D.H.Albonesi,,and

S.Dropsho.Proﬁle-baseddynamicvoltageandfrequencyscalingforamultipleclockdomainmicroprocessor.InProceedingsoftheInternationalSymposiumonComputerArchitecture,SanDiego,CA,June2003.

[23]

G.MarinandJ.Mellor-Crummey.Crossarchitectureperformancepredictionsforscientiﬁcapplicationsusingparameterizedmodels.InProceedingsofJointInternationalConferenceonMeasurementandModelingofComputerSystems,NewYorkCity,NY,June2004.

[24]

R.L.Mattson,J.Gecsei,D.Slutz,andI.L.Traiger.

Evaluationtechniquesforstoragehierarchies.IBMSystemJournal,9(2):78–117,1970.

[25]

J.Mellor-Crummey,D.Whalley,andK.Kennedy.Improvingmemoryhierarchyperformanceforirregularapplications.InternationalJournalofParallelProgramming,29(3),June2001.

[26]

C.G.Nevill-ManningandI.H.Witten.Identifying

hierarchicalstructureinsequences:alinear-timealgorithm.JournalofArtiﬁcialIntelligenceResearch,7:67–82,1997.[27]

V.S.Pingali,S.A.McKee,W.C.Hsieh,andJ.B.Carter.Restructuringcomputationsfortemporaldatacachelocality.InternationalJournalofParallelProgramming,31(4),August2003.

[28]

X.Shen,Y.Zhong,andC.Ding.Regression-basedmulti-modelpredictionofdatareusesignature.In

Proceedingsofthe4thAnnualSymposiumoftheLasAlamosComputerScienceInstitute,SanteFe,NewMexico,November2003.

[29]

T.Sherwood,E.Perelman,andB.Calder.Basicblock

distributionanalysistoﬁndperiodicbehaviorandsimulationpointsinapplications.InProceedingsofInternationalConferenceonParallelArchitecturesandCompilationTechniques,Barcelona,Spain,September2001.

[30]

T.Sherwood,S.Sair,andB.Calder.Phasetrackingandprediction.InProceedingsofInternationalSymposiumonComputerArchitecture,SanDiego,CA,June2003.

[31]

A.SrivastavaandA.Eustace.ATOM:Asystemforbuildingcustomizedprogramanalysistools.InProceedingsofACMSIGPLANConferenceonProgrammingLanguageDesignandImplementation,Orlando,Florida,June1994.[32]

M.M.Strout,L.Carter,andJ.Ferrante.Compile-timecompositionofrun-timedataanditerationreorderings.InProceedingsofACMSIGPLANConferenceonProgrammingLanguageDesignandImplementation,SanDiego,CA,June2003.

[33]

R.A.SugumarandS.G.Abraham.Efﬁcientsimulationofcachesunderoptimalreplacementwithapplicationstomisscharacterization.InProceedingsoftheACMSIGMETRICSConferenceonMeasurement&ModelingComputerSystems,SantaClara,CA,May1993.

[34]

L.Zhang.EfﬁcientRemappingMechanismforanAdaptiveMemorySystem.PhDthesis,DepartmentofComputerScience,UniversityofUtah,August2000.

[35]

L.Zhang,Z.Fang,M.Parker,B.K.Mathew,L.Schaelicke,J.B.Carter,W.C.Hsieh,andS.A.McKee.TheImpulsememorycontroller.IEEETransactionsonComputers,50(11),November2001.

[36]

Y.Zhong,M.Orlovich,X.Shen,andC.Ding.Arrayregroupingandstructuresplittingusingwhole-programreferenceafﬁnity.InProceedingsofACMSIGPLANConferenceonProgrammingLanguageDesignandImplementation,June2004.

因篇幅问题不能全部显示，请点此查看更多更全内容

查看全文