Thursday, February 7, 2013

ArcGIS Dissolve challenge

Ragnvald Larsen has a solution to dissolve hanging when there are a lot of features. Additionally, he proposes there is a threshold for efficiency when running Dissolve. Take a look at his article. Solving the ESRI arcpy dissolve challenge

Note that his solution would work outside of Python. For instance, you may use ModelBuilder to construct this logical model.

I suspect, though, if he were to covert the shapefile to a file geodatabase, the results would be different. When you have 'big', shapefiles tend to not work well.

I've just received a great explanation of a possible cause for this and a solution that could shed some light on this. It's from Charles Convis of ESRI. I've highlighted some statements that I think are valuable. Thank you, Charles.

Hi, I'm working with datasets in the many millions of features with lots of vector processes including dissolves.  A possible source of your problem is "godzilla polygons", ie single polygons with a large number of vertices.  I would suspect this is very likely with the norwegian coastline. Godzillas will often hang and crash without informative errors. Godzillas are also common when working with data from different scales, and data that was originally hand-digitized by someone who didn't know the difference between streaming and point modes. i.e. they are more common than you think.
Here is a systematic way to deal with them:

1. Add a vertexcount field to your attribute table and calc it to !shape!.pointcount, as in:
arcpy.AddField_management(gpoly, "VERTEXCOUNT", "LONG") arcpy.CalculateField_management(gpoly, "VERTEXCOUNT", "!shape.pointcount!", "PYTHON", "")

2. open up your attribute table and sort descending on VERTEXCOUNT to get a quick summary look at your possible godzilla polygons.  Depending upon your hardware, anything over 10,000 vertices can cause problems.  Geodatabases on a higher end machine can handle 50,000 for most processes.

3. You get rid of vertices with the dice command, using the limit you determine from the exercise above and some old fashioned trial and error on your machine, as in:
arcpy.Dice_management(gpoly, gpolydice, 50000)
  Dice is analagous to the script you wrote, but rather than lowering feature counts by splitting files
it cuts large polygons up so they'll behave.   (If your script split up your files along abritrary boundaries
you would have been achieving the same effect of cutting up large polygons at the same time as you were lowering your feature counts in each file.)

4. Now your polygons should be much more amenable to all of the rest of your processes. Also you are more likely to be able to successfully run any of the other more standard polygon simplify commands that thin or generalize your linework so as to have fewer vertices.

5. In the end, a simple dissolve will get rid of your dice lines, but it's worth re-calculating your vertexcount just to make sure you didn't inadvertently create godzillas with your dissolve operations.  Godzillas are a common side-effect of dissolves.

general tips for handling problems and crashes:

6. If possible move to a file geodatabase, stability and capability is orders of magnitude greater
      than shapefiles.  7,000 polygons may stress a shapefile, but it won't make a geodatabase
      even break a sweat. I run geodatabases with 5 million features on an average PC often.

7. If possible, fire up task manager and watch your processes while they are runing. %cpu use is less informative than physical memory useage. A normal process will run along at, say, 50% ram useage with plenty of fluctuations up and down, sometimes strong fluctuations. That's normal.
The behavior of a runaway process is often to ramp up linearly and steadily with no fluctuations.
If it hits 100% and stays there you likely have a crash.   Try watching it sometime when it's
running a job you are having problems with and you may find other early warning signs.

8. As I've said several times before, problems with ArcGIS processing can more often be traced to these kinds of data issues than to faults in the software.  Sure there are bugs, but in my experience problems in the datasets themselves are a lot more common.  Also, as a general
observation,  software issues seem to me to manifest as soon as I enter the command.   Data
issues I uncover tend to show up later on during processing.

Charles Convis
Esri Conservation Program

1 comment:

  1. Thanks for the article, everyone doing Geoprocessing should keep these informations in mind.