GraphGen4Code | A Toolkit for Generating Code Knowledge ...

文章推薦指數: 80 %
投票人數:10人

Knowledge graphs have been proven extremely useful in powering diverse ... applying it to 1.3 million Python files drawn from GitHub, 2,300 Python modules, ... Knowledgegraphshavebeenprovenextremelyusefulinpoweringdiverseapplicationsinsemanticsearchandnaturallanguageunderstanding. Inthiswork,wepresentGraphGen4Code,atoolkittobuildcodeknowledgegraphsthatcansimilarlypowervariousapplicationssuchasprogramsearch,codeunderstanding,bugdetection,andcodeautomation.GraphGen4Codeusesgenerictechniquestocapturecodesemanticswiththekeynodesinthegraphrepresentingclasses,functionsandmethods.Edgesindicatefunctionusage(e.g.,howdataflowsthroughfunctioncalls,asderivedfromprogramanalysisofrealcode),anddocumentationaboutfunctions(e.g.,codedocumentation,usagedocumentation,orforumdiscussionssuchasStackOverflow).OurtoolkitusesnamedgraphsinRDFtomodelgraphsperprogram,orcanoutputgraphsasJSON.Weshowthescalabilityofthetoolkitbyapplyingitto1.3millionPythonfilesdrawnfromGitHub,2,300Pythonmodules,and47millionforumposts.Thisresultsinanintegratedcodegraphwithover2billiontriples.Wemakethetoolkittobuildsuchgraphsaswellasthesampleextractionofthe2billiontriplesgraphpubliclyavailabletothecommunityforuse. Paper:https://arxiv.org/abs/2002.09440 DownloadGraphGen4Codedatasetasnquadsfromhere. TableofContents SampleGraphsGeneratedbyGraphGen4Code Applications AutomatedMachineLearning(AutoML) Recommendationenginefordevelopers BuildingLanguageModelsforCodeUnderstanding Enforcingbestpractices Learningfrombigcode SampleSchema Createyourowngraph ExampleQueries Publications SampleGraphsGeneratedbyGraphGen4Code 1.3MillionPythonProgramsfromGithub TodemonstrateGraphGen4Code’sscalability,webuildgraphsfor1.3millionPythonprograms(whereprogramreferstoasinglePythonscript)onGitHub,eachanalyzedintoitsownseparategraph.Wealsousethetoolkittolinklibrarycallstodocumentationandforumdiscussions,byidentifyingthemostcommonlyusedmodulesincode,andtryingtoconnecttheirclasses,methodsorfunctionstorelevantdocumentationorposts.Forforumposts,weusedinformationretrievaltechniquestoconnectittoitsrelevantmethodsorclasses.Weperformedthislinkingfor257Kclasses,5.8Mmethods,and278Kfunctions,andprocessed47MpostsfromStackOverflowandStackExchange.ThisshowsthefeasibilityofusingtheGraph4CodeGentoolkitforbuildinglarge-scaleknowledgegraphsforcodethatcapturescodesemanticsaswellasnaturallanguageartifcactsaboutcode. Allgraphfilesareavailablehere. Toloadandquerythisdata,pleasefollowtheinstructionshere.Wealsoprovidescriptsforcreatingadockerimagewiththegraphdatabasereadytouse. ETH150kPythonDataset WealsousedGraphGen4CodetoproducegraphsforETH150kPythonDatasetcollectedfromGithub.ETH-150Kdatasethasbeenusedtotrainmodelsforcoderecommendation,typeinferencing,programrepairs,…etc.WeprovidegraphdataforthisdatasetinbothJSONandRDFN-Quadsformats. Schema ThefollowingshowsacodesnippetexampleaswellasahighleveloverviewoftheinformationgeneratedbyGraphGen4Codefromcodeanalysis,StackOverflow,anddocstrings.WeprovidearandomsampleofeachdatasourceinRDFformathere. CodeSnippetExample Dataflowgraphfortherunningexample StackOverflowGraphExample DocstringsGraphExample Publications IfyouuseGraphGen4Codeinyourresearch,pleaseciteourwork: @article{abdelaziz2020codebreaker, title={ADemonstrationofCodeBreaker:AMachineInterpretableKnowledgeGraphforCode}, author={Abdelaziz,IbrahimandSrinivas,KavithaandDolby,JulianandMcCusker,JamesP}, journal={InternationalSemanticWebConference(ISWC)(DemonstrationTrack)}, year={2020} } @article{abdelaziz2021graph4code, title={AToolkitforGeneratingCodeKnowledgeGraphs}, author={Abdelaziz,IbrahimandDolby,JulianandMcCusker,JamesPandSrinivas,Kavitha}, journal={TheEleventhInternationalConferenceonKnowledgeCapture(K-CAP)}, year={2021} } @inproceedings{abdelaziz2022blanca, title={CanMachinesReadCodingManualsYet?--ABenchmarkforBuildingBetterLanguageModelsforCodeUnderstanding}, author={IbrahimAbdelazizandJulianDolbyandJamieMcCuskerandKavithaSrinivas}, booktitle={ProceedingsoftheAAAIConferenceonArtificialIntelligence(AAAI2022)}, year={2022} }



請為這篇文章評分?