搜索
您的当前位置:首页A System Solution for High-Performance, Low Power SDR Abstract

A System Solution for High-Performance, Low Power SDR Abstract

来源:乌哈旅游
ASystemSolutionforHigh-Performance,LowPowerSDR

YuanLin1,HyunseokLee1,YoavHarel1,MarkWoh1,ScottMahlke1,

TrevorMudge1andKriszti´anFlautner2

2

AdvancedComputerArchitectureLaboratoryARM,Ltd.UniversityofMichigan-AnnArbor,MICambridge,UnitedKingdom

{linyz,leehzz,yoavh,mwoh,mahlke,tnm}@umich.edu{krisztian.flautner}@arm.com

1

Abstract

OnecentralchallengeintherealizationofSoftwareDe-finedRadio(SDR)istoprovideaprogrammablesolutionthatmeetsthechallenginghigh-performance,low-powerrequirements,whileprovidinganefficientsoftwaredevel-opmentinterface.Inthispaper,wepresentanoverviewofafullyprogrammablemulti-coreSIMDarchitectureforSDR.Oursolutioncansupport2MbpsW-CDMAatabout270mW,and24Mbps802.11aatabout370mWin90nmtechnology.Thishighcomputationalefficiencyisachievedbyexploitingthevectorcharacteristicsofthealgorithms,throughauniquemulti-corearchitecturethatconsistsoftightlycoupledscalarandwideSIMDpipelines.Inaddi-tion,wepresentasoftwaredesignflowthatsupportseffi-cientDSPprogrammingandimplementationthroughasetofsignalprocessingextensionstoC,referredtoasSPEX.

1Introduction

SoftwareDefinedRadio(SDR)promisestorevolutionizethecommunicationindustrybydeliveringlow-cost,flexiblesoftwaresolutionsforwirelessmobilecommunicationpro-tocols.Wirelessprotocolsaresystemsconsistingofacol-lectionofdistinctDSPalgorithms.Thedifficultiesofimple-mentingacompletesysteminsoftwareincludechallengesforbothDSPhardwareandsoftwaredesigners.Inthispaper,wepresentasystemsolutionforSDRthatincludesanovelDSPprocessorarchitecturethatisdesignedspecificallyforSDR,andaprogrammingmodelthatallowsefficientDSPsoftwaredevelopment.WehavedevelopedthecompleteW-CDMAand802.11aprotocols’physicallayers,programmedthemontooursystem,andshownthattheyachievethere-quiredbandwidthandthepowerefficiencyformobiletermi-nals.

ThetwomajorchallengesinSDRarethedesignofeffi-cienthardwaresystemsandsoftwaredevelopmentenviron-ments.

•Hardwarerequirementsforcurrentandnextgenerationwirelessprotocolsareextremelyhigh.SDRproces-sorsmustachievesupercomputer-likecomputationalthroughput,maintainASIC-likepowerconsumption,meettheprotocols’latencyrequirements,andsupportreal-timesystemswithdynamicallychangingcontrolstates[4].ExistingandnextgenerationDSPproces-sors,suchastheTITMS320C64x[1]andtheFreescaleStarCore[2]arenotdesignedspecificallyforSDR.Theyeitherconsumetoomuchpowerordonotmeettheperformancerequirements.Inaddition,becausewirelessprotocolsarecomplexsystemsofmanyDSPkernels,itisalsodesirableforthehardwaredesignerstoprovideaneasydesigninterfaceforDSPsoftwareprogrammers.SomeoftheemergingSDRprocessorsolutionsmeettheperformancerequirements,butareverydifficulttodevelopordebug.MorphoTechnol-ogy’sRCArray[3]consistsof2Darrayofprocessingelements,whichcannotbeexpressedefficientlyintra-ditionalC-likeprogramminglanguages.Similarly,thePicoArray[5]alsoconsistsofanarrayofprocessingelementsonwhichitisdifficulttomapapplications.•DSPprogrammingsupportneedstoprovideaneasysystemdevelopmentflowforthesoftwaredevelopers.Atthesystem-level,thesoftwareneedstoprovideade-velopmentflowsimilartoexistingSoCdevelopmentflows,wheretheinter-algorithmcommunicationandprotocolstatecontrolaredevelopedanddebugged.Atthealgorithm-level,programmingsupportmustgen-erateefficientmachinecodeforindividualDSPker-nelswritteninahigh-levelprogramminglanguage.ItalsoneedstobeflexibleenoughtoallowDSPpro-grammerstodevelophand-writtenassemblycodefortheextraoptimizations.Cisperhapsthemostpopu-larprogramminglanguageintheDSPcommunityeventhoughitlackssomenecessaryfeaturesforproperlyde-scribingtheapplicationdomain.Forexample,itdoesnotsupportfirst-classSIMDorconcurrentfunctionob-

Controller󰀀ALU󰀀Number of Processing Elements x SIMD width󰀀 For W-CDMA 2Mbps (51.2GOP/sec) 90nm 1V @400MHz󰀀MEM󰀀Global󰀀Memory󰀀SoC󰀀Interface󰀀MEM󰀀Global󰀀Memory󰀀300󰀀Memory󰀀250󰀀SoC󰀀Interface󰀀SoC󰀀Interface󰀀4x32󰀀2x64󰀀8x16󰀀1x128󰀀1x256 󰀀(50% Utilization)󰀀MOPS/mW󰀀Interconnect󰀀SoC󰀀Interface󰀀SoC󰀀Interface󰀀SoC󰀀Interface󰀀200󰀀PE󰀀PE󰀀150󰀀16x8󰀀PE󰀀100󰀀Scalar󰀀Memory󰀀SIMD󰀀Memory󰀀Scalar󰀀Memory󰀀SIMD󰀀Memory󰀀Scalar󰀀Memory󰀀SIMD󰀀Memory󰀀32x4󰀀64x2󰀀128x1󰀀ALU󰀀0󰀀50󰀀100󰀀50󰀀Scalar󰀀RF󰀀Scalar󰀀ALU󰀀SIMD RF󰀀Scalar󰀀RF󰀀Scalar󰀀ALU󰀀SIMD RF󰀀Scalar󰀀RF󰀀Scalar󰀀ALU󰀀SIMD RF󰀀0󰀀SIMD ALU󰀀SIMD Inter-connect󰀀150󰀀200󰀀250󰀀SIMD ALU󰀀SIMD ALU󰀀300󰀀critical 󰀀delay 󰀀path󰀀Scalar󰀀Pipeline󰀀SIMD󰀀Pipeline󰀀Scalar󰀀Pipeline󰀀SIMD󰀀Pipeline󰀀Scalar󰀀Pipeline󰀀SIMD󰀀Pipeline󰀀SIMD width (arithmetic units)󰀀(a)Multi-coresystemarchitectureforSDR

(b)EffectofSIMDwidthoncomputationalefficiencyforW-CDMA2Mbps

Figure1:SystemOverview

algorithmdescriptionsandtheirimplementations.LikeMat-lab,SPEXprovidesbuilt-insupportforvectors(throughtheuseofSIMDdatavariables),aswellasSIMDdataoper-ations,suchasvectorpermutationandvectorpredication.However,SIMDdatavariablescarryadditionalattributes,suchasdatabitwidthforefficientimplementation.SPEXalsoprovidesthreadandcommunicationobjects,calledker-nelsandchannels.Kernelobjectssupportdynamicthreadspawninganddeletiontoaccountfordynamicallychang-ingworkloads.SPEXchannelobjectsaregeneralizedFIFO

Inordertoverifytheefficiencyofourprocessorarchitec-(first-in,firstout)structuresthatsupportrandomreadac-ture,wefirstimplementedboththecompletephysicallayercessandSIMDobjectsasqueueentries.Inaddition,globalofatransmitterandreceiverforW-CDMAand802.11avariablesaredisallowed,asallcommunicationmustbeper-inC.Wethencompiledbothofthesetwoprotocolsontoformedthroughchannels.Webelievetheseextensionspro-ourprocessorarchitecture,andshownthatourarchitecturevideanintuitiveprogrammingmodelforexpressinghigh-wasabletomeettheperformanceandpowerrequirements:throughputDSPapplications,aswellasanefficientinterface2MbpsW-CDMAat270mW,24Mbps802.11aat370mW.forcompilingtomulti-coreDSPprocessors.OurDSPsystemisamodular,multi-coreDSParchitecturewhereeachDSPalgorithmcanbedesignedandverified

individuallyandseparatelyfromthesystem-leveldevelop-2ArchitectureOverviewment.Unlikeotherproposedmulti-coreSDRarchitectures,

(e.g.[6]and[7]),eachhardwarecomponentinoursystem2.1Multi-coreSystemhasastandardizedinterface.DSPalgorithmsaremapped

ontoindividualprocessors,notacrossmultipleprocessors.Oursystemisaheterogeneousmultiprocessorarchitecture,Thus,theDSPkernelimplementationscanbedevelopedandshowninFigure1(a).Thesystemconsistsofmultiplehighverifiedindividually.System-leveldevelopmentcanviewthroughputSIMD-basedprocessingelements(PEs),alowthesekernelcodesassoftwareASICs,andcontroldifferentthroughputscalarcontroller,andglobalscratchpadmemo-ries(MEM).ThesecomponentsareallconnectedthroughakernelsthroughapredefinedstandardizedSoCinterface.

Wepresentoursoftwaredevelopmentflow,whichin-sharedbus.PEsconsistoftightly-coupledscalarandSIMDcludesboththesystem-levelandalgorithm-leveldevelop-pipelines.TheSIMDpipelinesaregenerallyusedforcom-mentflows.ThecentralelementofoursoftwaredesignisputationallyheavyDSPalgorithms,suchasfilter,FFT,andSPEX(SignalProcessingEXtension),asetoflanguageex-channeldecoders.Thescalarpipelinesareusedforthese-tensionsforC,whichnarrowthesemanticgapbetweenDSPquentialportionsofalgorithmsandaddressgenerationfor

jects.MatlabisanotherlanguagewhichisverypopularamongDSPprogrammers.ItprovidesSIMD-centricfirstclassdatastructuresandpipeline-levelconcur-rencythatcanbeexpressedusingSimulink.However,Matlabdoesnotsupportexplicitobjectdefinitions,in-cludingSIMDvariabletypes,concurrentthreads,andcommunicationchannels.Thelackofthisinformationmakesitveryhardforcompilerstoproduceefficientassemblycode.

theSIMDpipelines.Thecontrollerisusedforoverallsys-temmanagement,suchaspowercontrol.MEMismainlyusedtobufferintermediatedatatransfersbetweenDSPal-gorithms.

PEsarethemaincomputationunitsinthissystem.Theytakethemostareaandconsumethemostpower.Thenum-berofPEsandthearchitecturalorganizationofthesePEsareoneofthemaindesignconsiderations.Figure1(b)showsanapproximateefficiencytrade-offforrunningW-CDMAprotocolswithmultiplePEconfigurations,fromlefttorightwithincreasinglybigger,butfewerprocessingelements.Alloftheconfigurationshaveconstantcomputationthroughputandmeetthereal-timeW-CDMArequirements.Asshowninthegraph,configurationswithasmallnumberofwideSIMDunits–4x32to1x128appeartobethemostefficient.However,awiderSIMDarchitecturehasgreaterofprogram-mingchallenges.Inmostprograms,itisveryhardtofind128independentdataelementstocomputeinparallel.Sig-nalprocessingalgorithmshavemuchinherentparallelism,ie,thetapsofafilter,thatcanbecalculatedinparallel.Buttherearealsomanysignalprocessingalgorithmsthatdonothavewideparallelism.InourcasestudywithW-CDMAand802.11a,wechooseadesignpointneartheinflectionpointofthegraph:4PEs,eachwith32-wideSIMDunits.

PE󰀀SIMD Pipeline󰀀8bit󰀀16x8bit󰀀RegFile󰀀32x8bit󰀀8bit󰀀8bit󰀀8bit ALU󰀀8bit󰀀SIMD󰀀Memory󰀀4 KB󰀀16x8bit󰀀RegFile󰀀8bit󰀀8bit󰀀8bit ALU󰀀8bit󰀀16x8bit󰀀RegFile󰀀8bit󰀀Data󰀀Shuffle󰀀Network󰀀8bit󰀀8bit ALU󰀀8bit󰀀SoC󰀀Interface󰀀Scalar󰀀Memory󰀀4KB󰀀Scalar Pipeline󰀀16bit󰀀16x8bit󰀀RegFile󰀀8bit󰀀8bit󰀀8bit ALU󰀀16x16bit󰀀RegFile󰀀16bit󰀀16bit󰀀16bit ALU󰀀16bit󰀀Figure2:PEArchitecturalDiagram

interfaceconsistsofaDMA(DirectMemoryAccess)unit,areal-timeclock,andhardwaresynchronizationregisters.

2.2ProcessingElement

Figure2showsthearchitecturaldetailofaPE.ThePEcon-sistsoftwocoupledparts:ascalarpipelineandaSIMDpipeline.Thescalarpipelinecontainstheaddressgenera-tionunit(AGU)andisasingleissue,in-order,16-bitRISC

Intypicalcommercialwirelesssystemsolutions,lowarchitecture.Itsmainpurposeistogeneratememoryad-computationalgorithmsarehandledbyDSPs,highcompu-dressesfortheSIMDpipeline,handlethekernel’scontroltationalgorithmsaredesignedwithASICs,andthewholeflow,andprocessscalarDSPalgorithms(suchastheinter-systemisanintegratedSoCwithasimplecontroller,suchleaver).InmostDSPalgorithms,thecorekernelsaremadeasanARMprocessor.Giventhecomplexityofthesereal-upofshallownestedloops(oneortwolevels).Becauseoftimesystems,wewanttoseparatethedesignofindividualthis,wechoosenottoimplementabranchpredictor,butaddDSPalgorithmsfromthedesignoftheprotocolsystem.Inloopcounter-basedbranchinstructionsinstead.Inaddition,ourSDRsolution,eachDSPalgorithmisdesignedindepen-DSPkernelsprocessdatainstreambuffers,thusmostofthedentlyasa“softwareASIC”,withinternalstatesandvari-memoryaccessarefromdataqueues,whicharedirectlysup-ables,andacommunicationinterfacetotheoutsideworld.portedbytheAGU.System-leveldevelopment,consistsoflinkingtheseDSPal-TheSIMDpipelineconsistsof328-bitclusters.Through

gorithmstogether,mappingalgorithmsontoPEs,anddefin-theimplementationofW-CDMAand802.11aprotocols,weingreal-timedeadlinerequirements.Lowcomputationalgo-foundthatmostDSPalgorithmshavehighdegreeofSIMDrithms,likefiltersandFFTs,maybecombinedtogetherontoparallelism.Thecoreoperationsoffilter,FFT,Viterbi/TurboonePE.Highcomputationalgorithms,suchassearchers,decoder,andrakereceiverallarebasedonwidevectorvari-ViterbidecodersandTurbodecoders,generallyrequiretheirablesofnarrowdata-width.Therefore,withthesupportofownPE.conditionaloperationsontheclusters,wecanefficientlyuti-Inordertosupportsuchadesignmethodologyefficiently,lize32clustersof8-bitALUcomputations.ThePE’slocaleachhardwarecomponentisdesignedwithastandardizedscratchpadmemoryisdividedintotwoclusters:oneforthesysteminterface.Thisinterfaceincludesbothhardwarere-SIMDunitandtheotherforthescalarunit.Bothmemoriesquirementsandsoftwareprogrammingspecifications.Anyhavetworead/writeportsandthereisaDMAenginethathardwareunitsthatareconnectedtothesystemhastosup-servesbothmemories.portthisinterface.ThisisshowninFigure1(a)asthe“SoCManyDSPshavesupportfor8-and16-bitoperations.Interface”.ThesoftwarespecificationisdefinedasasetofHowever,theirclockcycletimeisoptimizedfor32-bitarith-assemblyinstructions,includingcommunication,synchro-meticoperations.Thisleadstolowerpowerefficiencyfornization,andmemoryaccessinstructions.Allprocessing8-and16-bitoperations.Inwirelessprotocols,themajor-elementsmustsupporttheseinstructionswithpre-definedityofthealgorithmsoperateon1-to8-bitdata,someal-timingrequirements.Thehardwareimplementationofthegorithmsoperateon16-bitdata,andfewoperateon32-bit

data.Therefore,oursystemisoptimizedfor8-bitopera-tionsintheSIMDunitand16-bitoperationsinthescalarunit.16-bitsupportisprovidedintheSIMDunitbytreatingtworegisterentriesasoneandusingtwocyclesfor16-bitALUoperations(alongwithspecialhardwaresupportforthecarryin/outbits).TheAGUregistersare12-bit,butonlysupport8-bitadditionandsubtraction.ThisisbecauseAGUismainlyusedforsoftwaremanagementofdatabuffers,inwhich8bitsaresufficient.Thehigher4bitsareusedtoaddressdifferentPEs,aswellasdifferentmemorybufferswithinPEs.

WCDMA Transmitter󰀀Controller󰀀PN Code󰀀TX/RX󰀀23 Mops󰀀Misc. Control󰀀1 Mops󰀀Buffer󰀀(10 Bytes)󰀀Power󰀀Control󰀀15 Kops󰀀Global󰀀Memory󰀀FIFO Queue󰀀(12.5 KBytes)󰀀Buffer󰀀(20 KBytes)󰀀Buffer󰀀(20 KBytes)󰀀PE󰀀Buffer󰀀(1024 Bytes)󰀀Turbo Encoder󰀀2 Mops󰀀Interleaver󰀀2 Mops󰀀Spreader󰀀4.7 Mops󰀀Scrambler󰀀8.6 Mops󰀀4 LPF-Rx󰀀307 Mops󰀀Deinterleaver󰀀16 Mops󰀀WCDMA Receiver󰀀3SystemEvaluation

3.1WirelessProtocolMapping

Figure3showsthemappingofW-CDMAand802.11aontoour4PEsystem.AsW-CDMAisafullduplexprotocol,thereceiverandtransmitterarerunningatthesametime.Becauseofthis,thetransmitterandreceiveraremappedontotheirownPEsforW-CDMA.Thiscontrastswith802.11,wherethetransmissionandreceptionphasesaredisjointintimeandthusthekernelsforthesemodescansharePEs.Thisprovidesforamorebalancedtaskallocation.

W-CDMAMapping.InW-CDMA,thereceiverrequiresmuchmorecomputationthanthetransmitter.AsshowninFigure3(a),thereceiverisassignedto3PEs,andthetrans-mitterisassignedto1PE.Globalmemorycontainsthreebuffers.TheFIFObufferisusedtobufferresultsbetweenthereceiverFIRfilterandthesearcher.Theother2buffersareusedtostoreintermediateresultsbetweentheRakere-ceiverandtheinterleaver,andtheinterleaverandtheTurbodecoder.

802.11aMapping.In802.11a,bothreceiversandtrans-mittersaremappedontothesamesetofhardware.SimilartoW-CDMAcase,globalmemoryismainlyusedtobuffertheintermediatedatatrafficoftheinterleaver.Unlikemostotheralgorithms,theinterleaverisahighlysequentialalgo-rithm.Itrequiresawholeframetobebufferedbeforeitcanoutputitsresults.

PE󰀀Buffer󰀀(1280 Bytes)󰀀2 LPF-Rx󰀀182 Mops󰀀Buffer󰀀(1360 Bytes)󰀀Descrambler󰀀22.5 Mops󰀀Despreader󰀀11.3 Mops󰀀Combiner󰀀3 Mops󰀀PE󰀀Buffer󰀀(2560 Bytes)󰀀Searcher󰀀200 Mops󰀀PE󰀀Buffer󰀀(1024 Bytes)󰀀Turbo Decoder󰀀324 Mops󰀀(a)W-CDMAProtocolMapping

802.11a Transmitter󰀀Controller󰀀Interleaver󰀀60 Mops󰀀Global󰀀Memory󰀀Buffer󰀀(30 KBytes)󰀀Buffer󰀀(30 KBytes)󰀀PE󰀀Buffer󰀀(22048 Bytes)󰀀Channel Enc.󰀀20 Mops󰀀Buffer󰀀(1024 Bytes)󰀀Sync.󰀀20 Mops󰀀Misc. Control󰀀80 Mops󰀀Deinterleaver󰀀60 Mops󰀀Buffer󰀀(30 KBytes)󰀀Buffer󰀀(2048 Bytes)󰀀Viterbi Dec.󰀀398 Mops󰀀Buffer󰀀(30 KBytes)󰀀802.11a Receiver󰀀PE󰀀Buffer󰀀(2048 Bytes)󰀀FIR (Rx)󰀀320 Mops󰀀PE󰀀Buffer󰀀(2048 Bytes)󰀀Interplator󰀀250 Mops󰀀FFT󰀀120 Mops󰀀PE󰀀(2048 Bytes)󰀀Freq. Eq.󰀀120 Mops󰀀QAM Demod󰀀2 Mops󰀀(1024 Bytes)󰀀3.2AreaandPowerResults

Buffer󰀀(2048 Bytes)󰀀FIR (Tx)󰀀320 Mops󰀀QAM Mod󰀀2 Mops󰀀IFFT󰀀120 Mops󰀀Preamble Ins.󰀀Table1showsthepowerconsumptionandareabreakdown

50 Mops󰀀fora2MbpsthroughputW-CDMAanda24Mbpsthrough-802.11a Transmitter󰀀put802.11a.Theoverallpowerresultswere1,381mWand1,909mWforW-CDMAand802.11a,respectively.Thisas-(b)802.11aProtocolMapping

sumed180nmtechnologyat1.8Vand400MHz.Scalingtheseresultsto90nmtechnologyat1Vand400MHzresults

inpowervaluesof268mWand370mW.ThreecomponentsFigure3:MappingofW-CDMAand802.11aontothepro-consumethemajorityofthepower:1)theregisterfilewhichcessingelementsconsumes34%forW-CDMAand30%for802.11a;2)the

ComponentsMemory(12KB)ALU,Shifter&Mult.

Units

44

AreaTotalArea

mm23.6

2%

4.7

11%3%

MainMem(64KB)

System

DMA

Total

11

2.9

1%

0.1

12%100%

Power:W-CDMA2Mbps

Power

%37.3%

562.4

7.5%

183.24.8

0.7%

2.5

0.0%

0.01381.7

90nm(1V@400MHZ)3.8

Power:802.11a24MbpsTotalPower

mW618.6

35.4%

297.6

10.6%0.5%

80.0

1.3%

1.0

0.0%100%

370.4

Table1:SystemareaandpowersummaryforW-CDMAand802.11a

localmemorywhichconsumes16%forW-CDMAand14%for802.11a;and3)thescalarpipleinewhichconsistsofthescalarmemory,instructionqueueandmiscellaneouslogicwhichconsumes10%,9%,and10%,respectively,forbothW-CDMAand802.11a.

Thistablealsoshowstheareabreak-down.Unliketra-ditionalprocessorarchitectures,thebiggestcomponentisthearithmeticunits,notmemoryunits.Thisindicatesthatthisprocessorarchitecturehasaveryhighcomputationef-ficiency.Byhavingsmalllocalmemories,weareabletoreducethepowerconsumptionaswellasdecreasethediearea.Theglobalmemoryissharedbetweenprocessors.Itssizeisrequiredtostoreenoughdataforbufferingframesduringinterleavingprocessing.Unlessamoreefficientin-terleaveralgorithmisfound,thisglobalmemoryspaceisunavoidable.

4SoftwareDevelopmentFlow

4.1OverallDesignFlow

Figure4showsoursoftwaredesignflow.AlgorithmsarefirstdebuggedandverifiedfunctionallythrougheitherMat-lab/Simulinkorfloating-pointCimplementations.Inaman-nersimilartotraditionalSoCdesign,thedevelopmentflowthenseparatesintosystem-levelandkernel-leveldesign.Theyarebothimplementedinfixed-pointformatinSPEX(SignalProcessingEXtensionsforC).SPEXisaMatlab-likeprogrammingextensionwhichoffersfirst-classvectorandmatrixvariablesandoperations.UnlikeMatlab,SPEXalsoallowsexplicitvariablesdeclarations,includingvari-ablebitwidthandsaturationmodeoperations.SPEXisex-plainedinfurtherdetailinnextsection.Fromthispointon,

thecompilercanautomaticallygenerateassemblycode,butprogrammerscanalsochoosetohandcodetheDSPkernelassemblyfilesforfurtheroptimizations.

MachinecodeisgeneratedinthreestepsfromSPEXde-scriptions.Firstmachineandtimingindependentassemblycodeisgenerated.Atthislevel,kernelsarenotassignedtoprocessors,andsystemprotocoldescriptionsarenotin-corporatedwithkernelassemblycode.Secondmachine-dependentassemblycodeisgenerated,wherekernelsaremappedintoprocessors,andsystemprotocoldescriptionsaretranslatedintorealDMAandcontrolinstructions.Fi-nally,realmachinecodeisgeneratedbymergingthesystem-levelandkernel-levelassemblycode.Programmersaregiventheflexibilitytodevelopanddebugcodeduringanystageofthecompilation.TheefficiencyofSPEXmeansourcompilerdoesnotneedcomplexcode-transformationtech-niques,makingtheassemblycodeeasilyaccessibletoDSPdevelopers.

4.2High-LevelProgrammingModel

Weproposedamulti-core,wideSIMDprocessorarchitec-ture.GiventhedifficultyinprogrammingtraditionalDSPs,thisnewprocessorarchitectureprovidesevengreaterchal-lengesfortheprogrammersandcompilers.Inthissection,webrieflydescribeourClanguageextensioncalledSPEX(SignalProcessingEXtension),whichisaimedatnarrowingthesemanticgapbetweenthedescriptionofhigh-endsig-nalprocessingalgorithmsandtheirimplementation.SPEXcontainstwomajorcomponents:SIMDvariablesandcon-currentkernelsupport.Theformerissuitableforexpress-ingdataparallelismwithinalgorithmsbyprovidingSIMDdatastructuresandexplicitSIMDoperations.Thelatter

Programmer󰀀Defined󰀀Matlab-Simulink/C󰀀Floating Point󰀀Algorithm Prototyping󰀀Kernel-level󰀀Design Flow󰀀System-Level󰀀Design Flow󰀀DSPprogrammershavetomanuallyimplementthesecom-municationmemorystructures.Thisresultsindifficult-to-readcodeforhumansandcompilersalike.InSPEX,kernelsarewrittenindependentlyandcommunicatethroughvirtualchannelobjects.Thisseparationremovesmemorymanage-menttasksfromtheprogrammers.

Timing󰀀Requirements󰀀C-SPEX󰀀Fixed Point󰀀Algorithm Kernel󰀀Implementations󰀀IO/󰀀Control󰀀Code󰀀C-SPEX󰀀System󰀀Description󰀀5Conclusion

Inthispaper,wehavepresentedahardwareandsoftwareso-lutionforSDR.Thehardwaresystemiscomposedofasetofdual-issueasymmetricprocessingelementsthateachcon-tainascalarandwideSIMDpipeline.A4processorversionofthissystemisshowntomeettheperformancerequire-mentsofW-CDMAand802.11aphysicallayerprocessing,andhavethepowercharacteristicsneededformobiletermi-nals.Tosupportsoftwaredevelopmentonthissystem,weprovideamodularprogrammingenvironmentthatincludesseparatesystemandkernellevelspecificationanddebug-ging.ProgrammingiscarriedoutusingSPEX,asetofex-tensionstoCforspecifyingvector/matrixobjectsandopera-tors,alongwithvirtualizedinter-kernelcommunication.Ourfutureworkincludestheimplementationofalargervarietyofprotocolsandadeeperexplorationofefficiencytrade-offsofprogrammablesignalprocessingarchitectures.

Timing󰀀Requirements󰀀Timing/Machine󰀀Independent󰀀asm code󰀀Memory󰀀Access󰀀Patterns󰀀System󰀀Description󰀀asm code󰀀Compiler󰀀Generated󰀀Controller󰀀asm󰀀code󰀀PE󰀀asm󰀀code󰀀c󰀀Figure4:SoftwareDevelopmentFlow

issuitableforexpressingthread-levelparallelismwithinal-gorithmsthroughtheuseofconcurrentkernelobjectsthatcommunicatethroughchannelobjects.Theintentistosep-aratecoarsegraincommunicationfromfinegraincommu-nication.Coarsegraincommunicationisbestrepresentedusingthekernelextensions,andfinegraincommunicationisbestrepresentedusingtheSIMDvariableextension.Wehavenotexplicitlysupportedinstruction-levelparallelisminSPEXbecausemodernparallelizingcompilersaregoodatdiscoveringitautomatically.

VariableExtensions.SPEXcontainsafirstclassSIMDdatatypeandanattributemechanismforspecifyingimple-mentationdetails.TheSIMDextensionsconsistoftwoma-jordatastructures:onefordescribingscalars,anotherfordescribingSIMDdata.TheSIMDdatastructureiscon-structedinternallyasanarrayofscalarvariables,whichsup-portsbothvectorandmatrixobjects.Bothdatastructurescanbefurtherelaboratedthroughattributes,whichallowtheprogrammertospecifyimplementationdetailsatthepointofvariabledeclaration.Thesevariableattributesarethentreatedinternallyascompilerdirectives,whichareinter-pretedbythecompilerbasedonthespecificsofthetargetDSParchitecture.

KernelExtensions.SPEXkernelextensionsconsistoftwotypesofdataobjects:kernelobjectsandchannelob-jects.Conceptually,kernelobjectsrepresentfunctionsthatcanbeexecutedconcurrently,andchannelobjectsarecom-municationinterfacesbetweenkernels.WithtraditionalC,

6Acknowledgement

YuanLinissupportedbyaMotorolaUniversityPartnershipinResearchGrant.ThisresearchisalsosupportedbyARMLtd.,theNationalScienceFoundationgrantCCR-0325898,andequipmentdonatedbyIntelCorporation.

References

[1]Dspdevelopers’village,

http://dspvillage.ti.com.

Texas

Instruments,

[2]Freescalestarcore,http://www.freescale.com.[3]Morphotechnologies:http://www.morphotech.com/.[4]T.Austin,D.Blaauw,S.Mahlke,T.Mudge,C.Chakrabati,

andW.Wolf.Mobilesupercomputers.CommunicationsoftheACM,May2004.[5]R.BainesandD.Pulley.ThepicoArrayandreconfigurable

basebandprocessingforwirelessbasestations.InSoftwareDe-finedRadio,2004.[6]J.Glossner,E.Hokenek,andM.Moudgill.TheSandbridge

SandblasterCommunicationsProcessor.2004.[7]B.MohebbiandF.Kurdahi.ReconfigurableparallelDSP-rDSP.InSoftwareDefinedRadio,2004.

因篇幅问题不能全部显示,请点此查看更多更全内容

Top