《【英文原版】StableDiffusion3技术报告-英.docx》由会员分享,可在线阅读,更多相关《【英文原版】StableDiffusion3技术报告-英.docx(30页珍藏版)》请在第壹文秘上搜索。
1、ScalingRectifledFlowTransformersforHigh-ResolutionImageSynthesisPatrickEsserSumithKulalAndreasBlattmannRahimEntezariJonasMu,llerHarrySainiYam1.eviDominik1.orenzAxelSauerFredericBoeselDustinPodelITimDockhornZionEnglishKyle1.aceyAlexGoodwinYannikMarekRobinRombach*StabilityAIFigure1.High-resolutionsamp
2、lesfromour8Brectifiedflowmodel,showcasingitscapabilitiesintypography,precisepromptfollowingandspatialreasoning,attentiontofinedetails,andhighimagequalityacrossawidevarietyofstyles.AbstractDiffusionmodelscreatedatafromnoisebyinvertingtheforwardpathsofdatatowardsnoiseandhaveemergedasapowerfulgenerativ
3、emodelingtechniqueforhigh-dimensional,perceptualdatasuchasimagesandvideos.Rectifiedflowisarecentgenerativemodelformulationthatconnectsdataandnoiseinastraightline.Despiteitsbettertheoreticalpropertiesandconceptualsimplicity,itisnotyetdecisivelyestablishedasstandardpractice.Inthiswork,weimproveexistin
4、gnoisesamplingtechniquesfbrtrainingrectifiedflowmodelsbybiasingthemtowardsperceptuallyrelevantscales.Throughalarge-scalestudy,wedemon-4Equalcontribution.stability.ai.stratethesuperiorperformanceofthisapproachcomparedtoestablisheddiffusionformulationsforhigh-resolutiontext-to-imagesynthesis.Additiona
5、lly,wepresentanoveltransformer-basedarchitecturefortext-to-imagegenerationthatusesseparateweightsforthetwomodalitiesandenablesabidirectionalflowofinformationbetweenimageandtexttokens,improvingtextcomprehension,typography,andhumanpreferenceratings.Wedemonstratethatthisarchitecturefollowspredictablesc
6、alingtrendsandcorrelateslowervalidationlosstoimprovedtext-to-imagesynthesisasmeasuredbyvariousmetricsandhumanevaluations.Ourlargestmodelsoutperformstate-of-the-artmodels,andwewillmakeourexperimentaldata,code,andmodelweightspubliclyavailable.1. IntroductionDiffusionmodelscreatedatafromnoise(Songetal.
7、,2020).Theyaretrainedtoinvertforwardpathsofdatatowardsrandomnoiseand,thus,inconjunctionwithapproximationandgeneralizationpropertiesofneuralnetworks,canbeusedtogeneratenewdatapointsthatarenotpresentinthetrainingdatabutfollowthedistributionofthetrainingdata(Sohl-Dicksteinetal.,2015;Song&Ermon,2020).Th
8、isgenerativemodelingtechniquehasproventobeveryeffectiveformodelinghigh-dimensional,perceptualdatasuchasimages(HOetal.,2020).Inrecentyears,diffusionmodelshavebecomethede-factoapproachforgeneratinghigh-resolutionimagesandvideosfromnaturallanguageinputswithimpressivegeneralizationcapabilities(Sahariaet
9、al.,2022b;Rameshetal.,2022;Rombachetal.,2022;Podelletal.,2023;Daietal.,2023;Esseretal.,2023;Blattmannetal.,2023b;Betkeretal.,2023;Blattmannetal.,2023a;Singeretal.l2022).Duetotheiriterativenatureandtheassociatedcomputationalcosts,aswellasthelongsamplingtimesduringinference,researchonformulationsformo
10、reefficienttrainingand/orfastersamplingofthesemodelshasincreased(Karrasetal.,2023;1.iuetal.,2022).Whilespecifyingaforwardpathfromdatatonoiseleadstoefficienttraining,italsoraisesthequestionofwhichpathtochoose.Thischoicecanhaveimportantimplicationsforsampling.Forexample,aforwardprocessthatfailstoremov
11、eallnoisefromthedatacanleadtoadiscrepancyintrainingandtestdistributionandresultinartifactssuchasgrayimagesamples(1.inetal.,2024).Importantly,thechoiceoftheforwardprocessalsoinfluencesthelearnedbackwardprocessand,thus,thesamplingefficiency.Whilecurvedpathsrequiremanyintegrationstepstosimulatetheproce
12、ss,astraightpathcouldbesimulatedwithasinglestepandislesspronetoerroraccumulation.Sinceeachstepcorrespondstoanevaluationoftheneuralnetwork,thishasadirectimpactonthesamplingspeed.Aparticularchoicefortheforwardpathisaso-calledRectifiedFlow(1.iuetal.,2022;Albergo&Vanden-Eijnden,2022;1.ipmanetal.,2023),w
13、hichconnectsdataandnoiseonastraightline.Althoughthismodelclasshasbettertheoreticalproperties,ithasnotyetbecomedecisivelyestablishedinpractice.Sofar,someadvantageshavebeenempiricallydemonstratedinsmallandmedium-sizedexperiments(Maetal.,2024),butthesearemostlylimitedtoclass-conditionalmodels.Inthiswor
14、k,wechangethisbyintroducingare-weightingofthenoisescalesinrectifiedflowmodels,similartonoise-predictivediffusionmodels(Hoetal.,2020).Throughalarge-scalestudy,wecompareournewformulationtoexistingdiffusionformulationsanddemonstrateitsbenefits.Weshowthatthewidelyusedapproachfortext-to-imagesynthesis,wh
15、ereafixedtextrepresentationisfeddirectlyintothemodel(e.g.,viacross-attention(Vaswanietal.,2017;Rombachetal.,2022),isnotideal,andpresentanewarchitecturethatincorporatesIeamablestreamsforbothimageandtexttokens,whichenablesatwo-wayflowOfinformationbetweenthem.Wecombinethiswithourimprovedrectifiedflowfo
16、rmulationandinvestigateitsscalability.Wedemonstrateapredictablescalingtrendinthevalidationlossandshowthatalowervalidationlosscorrelatesstronglywithimprovedautomaticandhumanevaluations.Ourlargestmodelsoutperformstate-of-theartopenmodelssuchasSDX1.(Podelletal.,2023),SDX1.-Turbo(Saueretal.,2023),Pixart-(Chenetal.,2023),andclosed-sourcemodelssuchasDA1.1.-E3(Betkeretal.,2023)bothinquantitativeevaluation