GraphGen4Code | A Toolkit for Generating Code Knowledge ...
文章推薦指數: 80 %
Knowledge graphs have been proven extremely useful in powering diverse ... applying it to 1.3 million Python files drawn from GitHub, 2,300 Python modules, ... Knowledgegraphshavebeenprovenextremelyusefulinpoweringdiverseapplicationsinsemanticsearchandnaturallanguageunderstanding. Inthiswork,wepresentGraphGen4Code,atoolkittobuildcodeknowledgegraphsthatcansimilarlypowervariousapplicationssuchasprogramsearch,codeunderstanding,bugdetection,andcodeautomation.GraphGen4Codeusesgenerictechniquestocapturecodesemanticswiththekeynodesinthegraphrepresentingclasses,functionsandmethods.Edgesindicatefunctionusage(e.g.,howdataflowsthroughfunctioncalls,asderivedfromprogramanalysisofrealcode),anddocumentationaboutfunctions(e.g.,codedocumentation,usagedocumentation,orforumdiscussionssuchasStackOverflow).OurtoolkitusesnamedgraphsinRDFtomodelgraphsperprogram,orcanoutputgraphsasJSON.Weshowthescalabilityofthetoolkitbyapplyingitto1.3millionPythonfilesdrawnfromGitHub,2,300Pythonmodules,and47millionforumposts.Thisresultsinanintegratedcodegraphwithover2billiontriples.Wemakethetoolkittobuildsuchgraphsaswellasthesampleextractionofthe2billiontriplesgraphpubliclyavailabletothecommunityforuse. Paper:https://arxiv.org/abs/2002.09440 DownloadGraphGen4Codedatasetasnquadsfromhere. TableofContents SampleGraphsGeneratedbyGraphGen4Code Applications AutomatedMachineLearning(AutoML) Recommendationenginefordevelopers BuildingLanguageModelsforCodeUnderstanding Enforcingbestpractices Learningfrombigcode SampleSchema Createyourowngraph ExampleQueries Publications SampleGraphsGeneratedbyGraphGen4Code 1.3MillionPythonProgramsfromGithub TodemonstrateGraphGen4Code’sscalability,webuildgraphsfor1.3millionPythonprograms(whereprogramreferstoasinglePythonscript)onGitHub,eachanalyzedintoitsownseparategraph.Wealsousethetoolkittolinklibrarycallstodocumentationandforumdiscussions,byidentifyingthemostcommonlyusedmodulesincode,andtryingtoconnecttheirclasses,methodsorfunctionstorelevantdocumentationorposts.Forforumposts,weusedinformationretrievaltechniquestoconnectittoitsrelevantmethodsorclasses.Weperformedthislinkingfor257Kclasses,5.8Mmethods,and278Kfunctions,andprocessed47MpostsfromStackOverflowandStackExchange.ThisshowsthefeasibilityofusingtheGraph4CodeGentoolkitforbuildinglarge-scaleknowledgegraphsforcodethatcapturescodesemanticsaswellasnaturallanguageartifcactsaboutcode. Allgraphfilesareavailablehere. Toloadandquerythisdata,pleasefollowtheinstructionshere.Wealsoprovidescriptsforcreatingadockerimagewiththegraphdatabasereadytouse. ETH150kPythonDataset WealsousedGraphGen4CodetoproducegraphsforETH150kPythonDatasetcollectedfromGithub.ETH-150Kdatasethasbeenusedtotrainmodelsforcoderecommendation,typeinferencing,programrepairs,…etc.WeprovidegraphdataforthisdatasetinbothJSONandRDFN-Quadsformats. Schema ThefollowingshowsacodesnippetexampleaswellasahighleveloverviewoftheinformationgeneratedbyGraphGen4Codefromcodeanalysis,StackOverflow,anddocstrings.WeprovidearandomsampleofeachdatasourceinRDFformathere. CodeSnippetExample Dataflowgraphfortherunningexample StackOverflowGraphExample DocstringsGraphExample Publications IfyouuseGraphGen4Codeinyourresearch,pleaseciteourwork: @article{abdelaziz2020codebreaker, title={ADemonstrationofCodeBreaker:AMachineInterpretableKnowledgeGraphforCode}, author={Abdelaziz,IbrahimandSrinivas,KavithaandDolby,JulianandMcCusker,JamesP}, journal={InternationalSemanticWebConference(ISWC)(DemonstrationTrack)}, year={2020} } @article{abdelaziz2021graph4code, title={AToolkitforGeneratingCodeKnowledgeGraphs}, author={Abdelaziz,IbrahimandDolby,JulianandMcCusker,JamesPandSrinivas,Kavitha}, journal={TheEleventhInternationalConferenceonKnowledgeCapture(K-CAP)}, year={2021} } @inproceedings{abdelaziz2022blanca, title={CanMachinesReadCodingManualsYet?--ABenchmarkforBuildingBetterLanguageModelsforCodeUnderstanding}, author={IbrahimAbdelazizandJulianDolbyandJamieMcCuskerandKavithaSrinivas}, booktitle={ProceedingsoftheAAAIConferenceonArtificialIntelligence(AAAI2022)}, year={2022} }
延伸文章資訊
- 1KGCNs: Machine Learning over Knowledge Graphs with ...
... model: the Knowledge Graph Convolutional Network (KGCN), available free to use from the GitHu...
- 2Build knowledge graph using python - Kaggle
A Knowledge Graph is a set of data points connected by relations that describe a domain, for inst...
- 3Complete Guide to Implement Knowledge Graph Using Python
It consists of sub fields which cannot be easily solved. Therefore, an approach to store data in ...
- 4GraphGen4Code | A Toolkit for Generating Code Knowledge ...
Knowledge graphs have been proven extremely useful in powering diverse ... applying it to 1.3 mil...
- 5Knowledge graph completion with PyKEEN and Neo4j
PyKEEN is a Python library that features knowledge graph embedding models and simplifies multi-cl...