YuanLin1,HyunseokLee1,YoavHarel1,MarkWoh1,ScottMahlke1,
TrevorMudge1andKriszti´anFlautner2
2
AdvancedComputerArchitectureLaboratoryARM,Ltd.UniversityofMichigan-AnnArbor,MICambridge,UnitedKingdom
{linyz,leehzz,yoavh,mwoh,mahlke,tnm}@umich.edu{krisztian.flautner}@arm.com
1
Abstract
OnecentralchallengeintherealizationofSoftwareDe-finedRadio(SDR)istoprovideaprogrammablesolutionthatmeetsthechallenginghigh-performance,low-powerrequirements,whileprovidinganefficientsoftwaredevel-opmentinterface.Inthispaper,wepresentanoverviewofafullyprogrammablemulti-coreSIMDarchitectureforSDR.Oursolutioncansupport2MbpsW-CDMAatabout270mW,and24Mbps802.11aatabout370mWin90nmtechnology.Thishighcomputationalefficiencyisachievedbyexploitingthevectorcharacteristicsofthealgorithms,throughauniquemulti-corearchitecturethatconsistsoftightlycoupledscalarandwideSIMDpipelines.Inaddi-tion,wepresentasoftwaredesignflowthatsupportseffi-cientDSPprogrammingandimplementationthroughasetofsignalprocessingextensionstoC,referredtoasSPEX.
1Introduction
SoftwareDefinedRadio(SDR)promisestorevolutionizethecommunicationindustrybydeliveringlow-cost,flexiblesoftwaresolutionsforwirelessmobilecommunicationpro-tocols.Wirelessprotocolsaresystemsconsistingofacol-lectionofdistinctDSPalgorithms.Thedifficultiesofimple-mentingacompletesysteminsoftwareincludechallengesforbothDSPhardwareandsoftwaredesigners.Inthispaper,wepresentasystemsolutionforSDRthatincludesanovelDSPprocessorarchitecturethatisdesignedspecificallyforSDR,andaprogrammingmodelthatallowsefficientDSPsoftwaredevelopment.WehavedevelopedthecompleteW-CDMAand802.11aprotocols’physicallayers,programmedthemontooursystem,andshownthattheyachievethere-quiredbandwidthandthepowerefficiencyformobiletermi-nals.
ThetwomajorchallengesinSDRarethedesignofeffi-cienthardwaresystemsandsoftwaredevelopmentenviron-ments.
•Hardwarerequirementsforcurrentandnextgenerationwirelessprotocolsareextremelyhigh.SDRproces-sorsmustachievesupercomputer-likecomputationalthroughput,maintainASIC-likepowerconsumption,meettheprotocols’latencyrequirements,andsupportreal-timesystemswithdynamicallychangingcontrolstates[4].ExistingandnextgenerationDSPproces-sors,suchastheTITMS320C64x[1]andtheFreescaleStarCore[2]arenotdesignedspecificallyforSDR.Theyeitherconsumetoomuchpowerordonotmeettheperformancerequirements.Inaddition,becausewirelessprotocolsarecomplexsystemsofmanyDSPkernels,itisalsodesirableforthehardwaredesignerstoprovideaneasydesigninterfaceforDSPsoftwareprogrammers.SomeoftheemergingSDRprocessorsolutionsmeettheperformancerequirements,butareverydifficulttodevelopordebug.MorphoTechnol-ogy’sRCArray[3]consistsof2Darrayofprocessingelements,whichcannotbeexpressedefficientlyintra-ditionalC-likeprogramminglanguages.Similarly,thePicoArray[5]alsoconsistsofanarrayofprocessingelementsonwhichitisdifficulttomapapplications.•DSPprogrammingsupportneedstoprovideaneasysystemdevelopmentflowforthesoftwaredevelopers.Atthesystem-level,thesoftwareneedstoprovideade-velopmentflowsimilartoexistingSoCdevelopmentflows,wheretheinter-algorithmcommunicationandprotocolstatecontrolaredevelopedanddebugged.Atthealgorithm-level,programmingsupportmustgen-erateefficientmachinecodeforindividualDSPker-nelswritteninahigh-levelprogramminglanguage.ItalsoneedstobeflexibleenoughtoallowDSPpro-grammerstodevelophand-writtenassemblycodefortheextraoptimizations.Cisperhapsthemostpopu-larprogramminglanguageintheDSPcommunityeventhoughitlackssomenecessaryfeaturesforproperlyde-scribingtheapplicationdomain.Forexample,itdoesnotsupportfirst-classSIMDorconcurrentfunctionob-
ControllerALUNumber of Processing Elements x SIMD width For W-CDMA 2Mbps (51.2GOP/sec) 90nm 1V @400MHzMEMGlobalMemorySoCInterfaceMEMGlobalMemory300Memory250SoCInterfaceSoCInterface4x322x648x161x1281x256 (50% Utilization)MOPS/mWInterconnectSoCInterfaceSoCInterfaceSoCInterface200PEPE15016x8PE100ScalarMemorySIMDMemoryScalarMemorySIMDMemoryScalarMemorySIMDMemory32x464x2128x1ALU05010050ScalarRFScalarALUSIMD RFScalarRFScalarALUSIMD RFScalarRFScalarALUSIMD RF0SIMD ALUSIMD Inter-connect150200250SIMD ALUSIMD ALU300critical delay pathScalarPipelineSIMDPipelineScalarPipelineSIMDPipelineScalarPipelineSIMDPipelineSIMD width (arithmetic units)(a)Multi-coresystemarchitectureforSDR
(b)EffectofSIMDwidthoncomputationalefficiencyforW-CDMA2Mbps
Figure1:SystemOverview
algorithmdescriptionsandtheirimplementations.LikeMat-lab,SPEXprovidesbuilt-insupportforvectors(throughtheuseofSIMDdatavariables),aswellasSIMDdataoper-ations,suchasvectorpermutationandvectorpredication.However,SIMDdatavariablescarryadditionalattributes,suchasdatabitwidthforefficientimplementation.SPEXalsoprovidesthreadandcommunicationobjects,calledker-nelsandchannels.Kernelobjectssupportdynamicthreadspawninganddeletiontoaccountfordynamicallychang-ingworkloads.SPEXchannelobjectsaregeneralizedFIFO
Inordertoverifytheefficiencyofourprocessorarchitec-(first-in,firstout)structuresthatsupportrandomreadac-ture,wefirstimplementedboththecompletephysicallayercessandSIMDobjectsasqueueentries.Inaddition,globalofatransmitterandreceiverforW-CDMAand802.11avariablesaredisallowed,asallcommunicationmustbeper-inC.Wethencompiledbothofthesetwoprotocolsontoformedthroughchannels.Webelievetheseextensionspro-ourprocessorarchitecture,andshownthatourarchitecturevideanintuitiveprogrammingmodelforexpressinghigh-wasabletomeettheperformanceandpowerrequirements:throughputDSPapplications,aswellasanefficientinterface2MbpsW-CDMAat270mW,24Mbps802.11aat370mW.forcompilingtomulti-coreDSPprocessors.OurDSPsystemisamodular,multi-coreDSParchitecturewhereeachDSPalgorithmcanbedesignedandverified
individuallyandseparatelyfromthesystem-leveldevelop-2ArchitectureOverviewment.Unlikeotherproposedmulti-coreSDRarchitectures,
(e.g.[6]and[7]),eachhardwarecomponentinoursystem2.1Multi-coreSystemhasastandardizedinterface.DSPalgorithmsaremapped
ontoindividualprocessors,notacrossmultipleprocessors.Oursystemisaheterogeneousmultiprocessorarchitecture,Thus,theDSPkernelimplementationscanbedevelopedandshowninFigure1(a).Thesystemconsistsofmultiplehighverifiedindividually.System-leveldevelopmentcanviewthroughputSIMD-basedprocessingelements(PEs),alowthesekernelcodesassoftwareASICs,andcontroldifferentthroughputscalarcontroller,andglobalscratchpadmemo-ries(MEM).ThesecomponentsareallconnectedthroughakernelsthroughapredefinedstandardizedSoCinterface.
Wepresentoursoftwaredevelopmentflow,whichin-sharedbus.PEsconsistoftightly-coupledscalarandSIMDcludesboththesystem-levelandalgorithm-leveldevelop-pipelines.TheSIMDpipelinesaregenerallyusedforcom-mentflows.ThecentralelementofoursoftwaredesignisputationallyheavyDSPalgorithms,suchasfilter,FFT,andSPEX(SignalProcessingEXtension),asetoflanguageex-channeldecoders.Thescalarpipelinesareusedforthese-tensionsforC,whichnarrowthesemanticgapbetweenDSPquentialportionsofalgorithmsandaddressgenerationfor
jects.MatlabisanotherlanguagewhichisverypopularamongDSPprogrammers.ItprovidesSIMD-centricfirstclassdatastructuresandpipeline-levelconcur-rencythatcanbeexpressedusingSimulink.However,Matlabdoesnotsupportexplicitobjectdefinitions,in-cludingSIMDvariabletypes,concurrentthreads,andcommunicationchannels.Thelackofthisinformationmakesitveryhardforcompilerstoproduceefficientassemblycode.
theSIMDpipelines.Thecontrollerisusedforoverallsys-temmanagement,suchaspowercontrol.MEMismainlyusedtobufferintermediatedatatransfersbetweenDSPal-gorithms.
PEsarethemaincomputationunitsinthissystem.Theytakethemostareaandconsumethemostpower.Thenum-berofPEsandthearchitecturalorganizationofthesePEsareoneofthemaindesignconsiderations.Figure1(b)showsanapproximateefficiencytrade-offforrunningW-CDMAprotocolswithmultiplePEconfigurations,fromlefttorightwithincreasinglybigger,butfewerprocessingelements.Alloftheconfigurationshaveconstantcomputationthroughputandmeetthereal-timeW-CDMArequirements.Asshowninthegraph,configurationswithasmallnumberofwideSIMDunits–4x32to1x128appeartobethemostefficient.However,awiderSIMDarchitecturehasgreaterofprogram-mingchallenges.Inmostprograms,itisveryhardtofind128independentdataelementstocomputeinparallel.Sig-nalprocessingalgorithmshavemuchinherentparallelism,ie,thetapsofafilter,thatcanbecalculatedinparallel.Buttherearealsomanysignalprocessingalgorithmsthatdonothavewideparallelism.InourcasestudywithW-CDMAand802.11a,wechooseadesignpointneartheinflectionpointofthegraph:4PEs,eachwith32-wideSIMDunits.
PESIMD Pipeline8bit16x8bitRegFile32x8bit8bit8bit8bit ALU8bitSIMDMemory4 KB16x8bitRegFile8bit8bit8bit ALU8bit16x8bitRegFile8bitDataShuffleNetwork8bit8bit ALU8bitSoCInterfaceScalarMemory4KBScalar Pipeline16bit16x8bitRegFile8bit8bit8bit ALU16x16bitRegFile16bit16bit16bit ALU16bitFigure2:PEArchitecturalDiagram
interfaceconsistsofaDMA(DirectMemoryAccess)unit,areal-timeclock,andhardwaresynchronizationregisters.
2.2ProcessingElement
Figure2showsthearchitecturaldetailofaPE.ThePEcon-sistsoftwocoupledparts:ascalarpipelineandaSIMDpipeline.Thescalarpipelinecontainstheaddressgenera-tionunit(AGU)andisasingleissue,in-order,16-bitRISC
Intypicalcommercialwirelesssystemsolutions,lowarchitecture.Itsmainpurposeistogeneratememoryad-computationalgorithmsarehandledbyDSPs,highcompu-dressesfortheSIMDpipeline,handlethekernel’scontroltationalgorithmsaredesignedwithASICs,andthewholeflow,andprocessscalarDSPalgorithms(suchastheinter-systemisanintegratedSoCwithasimplecontroller,suchleaver).InmostDSPalgorithms,thecorekernelsaremadeasanARMprocessor.Giventhecomplexityofthesereal-upofshallownestedloops(oneortwolevels).Becauseoftimesystems,wewanttoseparatethedesignofindividualthis,wechoosenottoimplementabranchpredictor,butaddDSPalgorithmsfromthedesignoftheprotocolsystem.Inloopcounter-basedbranchinstructionsinstead.Inaddition,ourSDRsolution,eachDSPalgorithmisdesignedindepen-DSPkernelsprocessdatainstreambuffers,thusmostofthedentlyasa“softwareASIC”,withinternalstatesandvari-memoryaccessarefromdataqueues,whicharedirectlysup-ables,andacommunicationinterfacetotheoutsideworld.portedbytheAGU.System-leveldevelopment,consistsoflinkingtheseDSPal-TheSIMDpipelineconsistsof328-bitclusters.Through
gorithmstogether,mappingalgorithmsontoPEs,anddefin-theimplementationofW-CDMAand802.11aprotocols,weingreal-timedeadlinerequirements.Lowcomputationalgo-foundthatmostDSPalgorithmshavehighdegreeofSIMDrithms,likefiltersandFFTs,maybecombinedtogetherontoparallelism.Thecoreoperationsoffilter,FFT,Viterbi/TurboonePE.Highcomputationalgorithms,suchassearchers,decoder,andrakereceiverallarebasedonwidevectorvari-ViterbidecodersandTurbodecoders,generallyrequiretheirablesofnarrowdata-width.Therefore,withthesupportofownPE.conditionaloperationsontheclusters,wecanefficientlyuti-Inordertosupportsuchadesignmethodologyefficiently,lize32clustersof8-bitALUcomputations.ThePE’slocaleachhardwarecomponentisdesignedwithastandardizedscratchpadmemoryisdividedintotwoclusters:oneforthesysteminterface.Thisinterfaceincludesbothhardwarere-SIMDunitandtheotherforthescalarunit.Bothmemoriesquirementsandsoftwareprogrammingspecifications.Anyhavetworead/writeportsandthereisaDMAenginethathardwareunitsthatareconnectedtothesystemhastosup-servesbothmemories.portthisinterface.ThisisshowninFigure1(a)asthe“SoCManyDSPshavesupportfor8-and16-bitoperations.Interface”.ThesoftwarespecificationisdefinedasasetofHowever,theirclockcycletimeisoptimizedfor32-bitarith-assemblyinstructions,includingcommunication,synchro-meticoperations.Thisleadstolowerpowerefficiencyfornization,andmemoryaccessinstructions.Allprocessing8-and16-bitoperations.Inwirelessprotocols,themajor-elementsmustsupporttheseinstructionswithpre-definedityofthealgorithmsoperateon1-to8-bitdata,someal-timingrequirements.Thehardwareimplementationofthegorithmsoperateon16-bitdata,andfewoperateon32-bit
data.Therefore,oursystemisoptimizedfor8-bitopera-tionsintheSIMDunitand16-bitoperationsinthescalarunit.16-bitsupportisprovidedintheSIMDunitbytreatingtworegisterentriesasoneandusingtwocyclesfor16-bitALUoperations(alongwithspecialhardwaresupportforthecarryin/outbits).TheAGUregistersare12-bit,butonlysupport8-bitadditionandsubtraction.ThisisbecauseAGUismainlyusedforsoftwaremanagementofdatabuffers,inwhich8bitsaresufficient.Thehigher4bitsareusedtoaddressdifferentPEs,aswellasdifferentmemorybufferswithinPEs.
WCDMA TransmitterControllerPN CodeTX/RX23 MopsMisc. Control1 MopsBuffer(10 Bytes)PowerControl15 KopsGlobalMemoryFIFO Queue(12.5 KBytes)Buffer(20 KBytes)Buffer(20 KBytes)PEBuffer(1024 Bytes)Turbo Encoder2 MopsInterleaver2 MopsSpreader4.7 MopsScrambler8.6 Mops4 LPF-Rx307 MopsDeinterleaver16 MopsWCDMA Receiver3SystemEvaluation
3.1WirelessProtocolMapping
Figure3showsthemappingofW-CDMAand802.11aontoour4PEsystem.AsW-CDMAisafullduplexprotocol,thereceiverandtransmitterarerunningatthesametime.Becauseofthis,thetransmitterandreceiveraremappedontotheirownPEsforW-CDMA.Thiscontrastswith802.11,wherethetransmissionandreceptionphasesaredisjointintimeandthusthekernelsforthesemodescansharePEs.Thisprovidesforamorebalancedtaskallocation.
W-CDMAMapping.InW-CDMA,thereceiverrequiresmuchmorecomputationthanthetransmitter.AsshowninFigure3(a),thereceiverisassignedto3PEs,andthetrans-mitterisassignedto1PE.Globalmemorycontainsthreebuffers.TheFIFObufferisusedtobufferresultsbetweenthereceiverFIRfilterandthesearcher.Theother2buffersareusedtostoreintermediateresultsbetweentheRakere-ceiverandtheinterleaver,andtheinterleaverandtheTurbodecoder.
802.11aMapping.In802.11a,bothreceiversandtrans-mittersaremappedontothesamesetofhardware.SimilartoW-CDMAcase,globalmemoryismainlyusedtobuffertheintermediatedatatrafficoftheinterleaver.Unlikemostotheralgorithms,theinterleaverisahighlysequentialalgo-rithm.Itrequiresawholeframetobebufferedbeforeitcanoutputitsresults.
PEBuffer(1280 Bytes)2 LPF-Rx182 MopsBuffer(1360 Bytes)Descrambler22.5 MopsDespreader11.3 MopsCombiner3 MopsPEBuffer(2560 Bytes)Searcher200 MopsPEBuffer(1024 Bytes)Turbo Decoder324 Mops(a)W-CDMAProtocolMapping
802.11a TransmitterControllerInterleaver60 MopsGlobalMemoryBuffer(30 KBytes)Buffer(30 KBytes)PEBuffer(22048 Bytes)Channel Enc.20 MopsBuffer(1024 Bytes)Sync.20 MopsMisc. Control80 MopsDeinterleaver60 MopsBuffer(30 KBytes)Buffer(2048 Bytes)Viterbi Dec.398 MopsBuffer(30 KBytes)802.11a ReceiverPEBuffer(2048 Bytes)FIR (Rx)320 MopsPEBuffer(2048 Bytes)Interplator250 MopsFFT120 MopsPE(2048 Bytes)Freq. Eq.120 MopsQAM Demod2 Mops(1024 Bytes)3.2AreaandPowerResults
Buffer(2048 Bytes)FIR (Tx)320 MopsQAM Mod2 MopsIFFT120 MopsPreamble Ins.Table1showsthepowerconsumptionandareabreakdown
50 Mopsfora2MbpsthroughputW-CDMAanda24Mbpsthrough-802.11a Transmitterput802.11a.Theoverallpowerresultswere1,381mWand1,909mWforW-CDMAand802.11a,respectively.Thisas-(b)802.11aProtocolMapping
sumed180nmtechnologyat1.8Vand400MHz.Scalingtheseresultsto90nmtechnologyat1Vand400MHzresults
inpowervaluesof268mWand370mW.ThreecomponentsFigure3:MappingofW-CDMAand802.11aontothepro-consumethemajorityofthepower:1)theregisterfilewhichcessingelementsconsumes34%forW-CDMAand30%for802.11a;2)the
ComponentsMemory(12KB)ALU,Shifter&Mult.
Units
44
AreaTotalArea
mm23.6
2%
4.7
11%3%
MainMem(64KB)
System
DMA
Total
11
2.9
1%
0.1
12%100%
Power:W-CDMA2Mbps
Power
%37.3%
562.4
7.5%
183.24.8
0.7%
2.5
0.0%
0.01381.7
90nm(1V@400MHZ)3.8
Power:802.11a24MbpsTotalPower
mW618.6
35.4%
297.6
10.6%0.5%
80.0
1.3%
1.0
0.0%100%
370.4
Table1:SystemareaandpowersummaryforW-CDMAand802.11a
localmemorywhichconsumes16%forW-CDMAand14%for802.11a;and3)thescalarpipleinewhichconsistsofthescalarmemory,instructionqueueandmiscellaneouslogicwhichconsumes10%,9%,and10%,respectively,forbothW-CDMAand802.11a.
Thistablealsoshowstheareabreak-down.Unliketra-ditionalprocessorarchitectures,thebiggestcomponentisthearithmeticunits,notmemoryunits.Thisindicatesthatthisprocessorarchitecturehasaveryhighcomputationef-ficiency.Byhavingsmalllocalmemories,weareabletoreducethepowerconsumptionaswellasdecreasethediearea.Theglobalmemoryissharedbetweenprocessors.Itssizeisrequiredtostoreenoughdataforbufferingframesduringinterleavingprocessing.Unlessamoreefficientin-terleaveralgorithmisfound,thisglobalmemoryspaceisunavoidable.
4SoftwareDevelopmentFlow
4.1OverallDesignFlow
Figure4showsoursoftwaredesignflow.AlgorithmsarefirstdebuggedandverifiedfunctionallythrougheitherMat-lab/Simulinkorfloating-pointCimplementations.Inaman-nersimilartotraditionalSoCdesign,thedevelopmentflowthenseparatesintosystem-levelandkernel-leveldesign.Theyarebothimplementedinfixed-pointformatinSPEX(SignalProcessingEXtensionsforC).SPEXisaMatlab-likeprogrammingextensionwhichoffersfirst-classvectorandmatrixvariablesandoperations.UnlikeMatlab,SPEXalsoallowsexplicitvariablesdeclarations,includingvari-ablebitwidthandsaturationmodeoperations.SPEXisex-plainedinfurtherdetailinnextsection.Fromthispointon,
thecompilercanautomaticallygenerateassemblycode,butprogrammerscanalsochoosetohandcodetheDSPkernelassemblyfilesforfurtheroptimizations.
MachinecodeisgeneratedinthreestepsfromSPEXde-scriptions.Firstmachineandtimingindependentassemblycodeisgenerated.Atthislevel,kernelsarenotassignedtoprocessors,andsystemprotocoldescriptionsarenotin-corporatedwithkernelassemblycode.Secondmachine-dependentassemblycodeisgenerated,wherekernelsaremappedintoprocessors,andsystemprotocoldescriptionsaretranslatedintorealDMAandcontrolinstructions.Fi-nally,realmachinecodeisgeneratedbymergingthesystem-levelandkernel-levelassemblycode.Programmersaregiventheflexibilitytodevelopanddebugcodeduringanystageofthecompilation.TheefficiencyofSPEXmeansourcompilerdoesnotneedcomplexcode-transformationtech-niques,makingtheassemblycodeeasilyaccessibletoDSPdevelopers.
4.2High-LevelProgrammingModel
Weproposedamulti-core,wideSIMDprocessorarchitec-ture.GiventhedifficultyinprogrammingtraditionalDSPs,thisnewprocessorarchitectureprovidesevengreaterchal-lengesfortheprogrammersandcompilers.Inthissection,webrieflydescribeourClanguageextensioncalledSPEX(SignalProcessingEXtension),whichisaimedatnarrowingthesemanticgapbetweenthedescriptionofhigh-endsig-nalprocessingalgorithmsandtheirimplementation.SPEXcontainstwomajorcomponents:SIMDvariablesandcon-currentkernelsupport.Theformerissuitableforexpress-ingdataparallelismwithinalgorithmsbyprovidingSIMDdatastructuresandexplicitSIMDoperations.Thelatter
ProgrammerDefinedMatlab-Simulink/CFloating PointAlgorithm PrototypingKernel-levelDesign FlowSystem-LevelDesign FlowDSPprogrammershavetomanuallyimplementthesecom-municationmemorystructures.Thisresultsindifficult-to-readcodeforhumansandcompilersalike.InSPEX,kernelsarewrittenindependentlyandcommunicatethroughvirtualchannelobjects.Thisseparationremovesmemorymanage-menttasksfromtheprogrammers.
TimingRequirementsC-SPEXFixed PointAlgorithm KernelImplementationsIO/ControlCodeC-SPEXSystemDescription5Conclusion
Inthispaper,wehavepresentedahardwareandsoftwareso-lutionforSDR.Thehardwaresystemiscomposedofasetofdual-issueasymmetricprocessingelementsthateachcon-tainascalarandwideSIMDpipeline.A4processorversionofthissystemisshowntomeettheperformancerequire-mentsofW-CDMAand802.11aphysicallayerprocessing,andhavethepowercharacteristicsneededformobiletermi-nals.Tosupportsoftwaredevelopmentonthissystem,weprovideamodularprogrammingenvironmentthatincludesseparatesystemandkernellevelspecificationanddebug-ging.ProgrammingiscarriedoutusingSPEX,asetofex-tensionstoCforspecifyingvector/matrixobjectsandopera-tors,alongwithvirtualizedinter-kernelcommunication.Ourfutureworkincludestheimplementationofalargervarietyofprotocolsandadeeperexplorationofefficiencytrade-offsofprogrammablesignalprocessingarchitectures.
TimingRequirementsTiming/MachineIndependentasm codeMemoryAccessPatternsSystemDescriptionasm codeCompilerGeneratedControllerasmcodePEasmcodecFigure4:SoftwareDevelopmentFlow
issuitableforexpressingthread-levelparallelismwithinal-gorithmsthroughtheuseofconcurrentkernelobjectsthatcommunicatethroughchannelobjects.Theintentistosep-aratecoarsegraincommunicationfromfinegraincommu-nication.Coarsegraincommunicationisbestrepresentedusingthekernelextensions,andfinegraincommunicationisbestrepresentedusingtheSIMDvariableextension.Wehavenotexplicitlysupportedinstruction-levelparallelisminSPEXbecausemodernparallelizingcompilersaregoodatdiscoveringitautomatically.
VariableExtensions.SPEXcontainsafirstclassSIMDdatatypeandanattributemechanismforspecifyingimple-mentationdetails.TheSIMDextensionsconsistoftwoma-jordatastructures:onefordescribingscalars,anotherfordescribingSIMDdata.TheSIMDdatastructureiscon-structedinternallyasanarrayofscalarvariables,whichsup-portsbothvectorandmatrixobjects.Bothdatastructurescanbefurtherelaboratedthroughattributes,whichallowtheprogrammertospecifyimplementationdetailsatthepointofvariabledeclaration.Thesevariableattributesarethentreatedinternallyascompilerdirectives,whichareinter-pretedbythecompilerbasedonthespecificsofthetargetDSParchitecture.
KernelExtensions.SPEXkernelextensionsconsistoftwotypesofdataobjects:kernelobjectsandchannelob-jects.Conceptually,kernelobjectsrepresentfunctionsthatcanbeexecutedconcurrently,andchannelobjectsarecom-municationinterfacesbetweenkernels.WithtraditionalC,
6Acknowledgement
YuanLinissupportedbyaMotorolaUniversityPartnershipinResearchGrant.ThisresearchisalsosupportedbyARMLtd.,theNationalScienceFoundationgrantCCR-0325898,andequipmentdonatedbyIntelCorporation.
References
[1]Dspdevelopers’village,
http://dspvillage.ti.com.
Texas
Instruments,
[2]Freescalestarcore,http://www.freescale.com.[3]Morphotechnologies:http://www.morphotech.com/.[4]T.Austin,D.Blaauw,S.Mahlke,T.Mudge,C.Chakrabati,
andW.Wolf.Mobilesupercomputers.CommunicationsoftheACM,May2004.[5]R.BainesandD.Pulley.ThepicoArrayandreconfigurable
basebandprocessingforwirelessbasestations.InSoftwareDe-finedRadio,2004.[6]J.Glossner,E.Hokenek,andM.Moudgill.TheSandbridge
SandblasterCommunicationsProcessor.2004.[7]B.MohebbiandF.Kurdahi.ReconfigurableparallelDSP-rDSP.InSoftwareDefinedRadio,2004.
因篇幅问题不能全部显示,请点此查看更多更全内容