Open Grid Forum

WHAT ARE OGF DOCUMENTS?

OGF DOCUMENT SERIES
Recommendation
Informational
Community Practice
Experimental

PUBLIC COMMENTS
Archived Comments

DRAFT DOCUMENTS


EGA DOCUMENTS

OGF Public Comments

Public comments are a very important part of the OGF document approval process.  Through public comments, documents are given scrutiny by people with a wide range of expertise and interests. Ideally, a OGF document will be self-contained, relying only on the other documents and standards it cites to be clear and useful.  Public comments of any type are welcomed, from small editorial comments to broader comments about the scope or merit of the proposed document.  The simple act of reading a document and providing a public comment that you read it and found it suitable for publication is very useful, and provides valuable feedback to the document authors.

Thank you for making public comments on this document!


Comments for Document: GLUE Specification v. 2.0
Author(s):S. Andreozzi, S. Burke, F. Ehm, L. Field, G. Galang, B. Konya, M. Litmaath, P. Millar, J. P. Navarro
Type:P-REC
Area:Management
Group:GLUE
Public Comment End:13 Aug, 2008

To make anonymous comments, please use 'anonymous' and 'guest' as the un/pw.


Comments:


Posted by: Baur 2008-06-19 08:06:17Improve and consolidate specification of DNs
in the public comment version of GLUE 2.0, two kinds of DNs
(Distinguished Names) with different delimiters are specified.

Section 16.3.8 defines as DNs: "X509 uses a X500 namespace represented
as several Relative Domain-Names (RDNs)
concatenated by forward-slashes". A slash-separated DN notation is also
used in the examples throughout the document.
I was not able to find such a definition in the X509 spec. As X509 stay
rather general, are you sure it implements a forward-slash
notation ?

Section 17.4., in contrast, defines a DataType DN_T as a RFC 4515
Distinguished name.
RFC 4515 says "There is zero or more relative distinguished names,
separated by , for a distinguished name."

I propose to either

- specify both delimiters, fix the X509 citation and state clearly in
which cases which notation is to be used, or

- decide for the RFC4515 notation (comma separated), which seems to be
(better) standardized and rewrite the examples.

Also at the beginning of section 16.3.8, the sentence "It must start
[...]" (state ?) should be improved.

ciao,

Timo


Posted by: wallom 2008-06-24 09:58:02Hetergeneous systems
The current schema doesn't seem to include any way in which a system can be hetergeneous, i.e. with different types of worker node. We have two clusters which are arranged like this. The other point is that at the moment it appears rather messy to cover multiple different queues within a single compute resource.



Posted by: lfield 2008-06-25 04:18:18Relationship between User Domain and Admin Domain
In the main entities diagram, I would like to suggest a relationship between the User Domain and Admin Domain. This relationship is of the form of a Service Level agreement.



Posted by: mviljoen 2008-06-25 11:31:57Miscellanous observations
Section 7
I think the first sentence referring to "specializations" could be reworded - it's ambiguous for me what the word means in this context.

Section 17.24 AppEnvState_t
For the pending removal description, isn't "is due to be removed" more appropriate than "as soon as possible"? I think perhaps giving indications of time is out of context here.

Matt


Posted by: gevorg 2008-06-26 09:24:20Extending the Entities with OtherInfo
OtherInfo in form of String is a possibility for particular implementations easily extend the information of Entity. But unfortunately not all of Entities have that Field.

I would suggest to include in all Entities the field
OtherInfo - as it is in some of them in form of placeholder to publish info that does not fit in any other attribute. Free-form string, comma-separated tags, (name, value ) pair are all examples of valid syntax.


Posted by: urbah 2008-07-01 11:31:26Miscellanous observations
First, thank you very much for your work :

Our EDGeS project will have to publish information on entities of 'Desktop Grids' (Grids of Computer Scavenging), where resources are volatile, and GLUE schema 2.0 seems flexible enough and have adequate place-holders for unknown data.


Chapter 5.3 Contact
-------------------
I suggest to add a 'Name' (String, 0..1) property :
Even if this property will not always contain the name of a real person, it can contain the detailed name of a responsibility.

Chapter 5.6 Endpoint
--------------------
Attribute 'TrustedCA' :
Typo in the description : 'issues' --> 'issued'

Chapter 5.11 Policy
-------------------
- Attribute 'Rule' :
Typo in the description : 'is provide' --> 'is provided'
- Last paragraph : 'then these policy instances SHOULD be expected to be consumed independently' : Could you be more explicit ?

Chapter 6.1 ComputingService
----------------------------
End of the second paragraph : 'as part of the computing service' :
I suggest to add 'same' before 'computing service'.

Chapter 6.6 ExecutionEnvironment
--------------------------------
Entity 'ExecutionEnvironment' :
Typo in the description : 'envonrment' --> 'environment'

Chapter 6.10 ToStorageService
-----------------------------
Entity 'ToStorageService' :
This entity is not at all symmetrical to the 'ToComputingService' entity.
In order to avoid confusion, I therefore suggest to rename 'ToStorageService' as 'ToPosixStorageService'.

Chapter 7. Conceptual Model of the Storage Service
--------------------------------------------------
Throughout this chapter, you use both words 'capacity' and 'extent' as if they are synonyms.
- If yes, please state it clearly at the beginning of the chapter.
- If not, please explain the difference.

Chapter 8. Relationship to OGF Reference Model
----------------------------------------------
- Can you provide more explanations :
For example, what is the meaning of the arrow between 'Entity' and 'GridComponent' ?
- Is it possible for you to show examples ?

Chapter 9. Security Considerations
----------------------------------
I suggest to write, at least, that concrete data models must ensure availability and reliability of the published data. Therefore :
- Resiliency to DoS attacks is mandatory,
- Resiliency to intrusion and counterfeits is mandatory,
- Dynamic redundancy can help.

Chapter 17. Appendix B: Data Types
----------------------------------
Is it possible to sort the enumeration types alphabetically ?
That would permit a reader in a hurry to find an enumeration type quicker.

Chapter 17.5 Capability_t
-------------------------
Value 'executionmanagement.candidatesetgenerator' :
Typos in the description : 'a nit of workcan' --> 'a unit of work can'

Chapter 17.9 EndpointHealthState_t
----------------------------------
I would suggest to add following value :
'compromised' : It was possible to check that there are security issues

Chapter 17.19 Platform_t
------------------------
I suggest to be consistent with 'draft-ggf-jsdl-spec-28.doc' describing JSDL, and to use the values listed in Table 5-2 'Processor Architectures' of Chapter 5.2.1 of the JSDL document.

Chapter 17.21 OSFamily_t
------------------------
I suggest to be consistent with 'draft-ggf-jsdl-spec-28.doc' describing JSDL, and to use the values listed in Table 5-4 'Operating System Types' of Chapter 5.2.3 of the JSDL document.

Etienne URBAH


Posted by: urbah 2008-07-02 04:50:58Chapter 5.10 Activity - There are 2 association ends named 'Activity.Id'
In chapter 5.10 'Activity', there are 2 association ends named 'Activity.Id'.
In order to avoid ambiguity, I suggest to rename them :
- One as 'ReferredActivity.Id'
- One as 'ReferredByActivity.Id'

Etienne URBAH


Posted by: jamesc 2008-07-18 04:37:31ipService/ipHost/ipPort instead of endpoint?
The document assumes that all services are accessible via a URI - this leads to us 'making up' URI schemes for services to just express their ip hostname/port/service e.g. lfc://host:port

RFC 2307 (http://www.ietf.org/rfc/rfc2307.txt) gives a standard mapping for entities related to TCP/IP networking and services (e.g. hostname/port/service combinations) to LDAP. Perhaps a similar model could be used in GLUE 2.0?

This mapping is already used by some common services for dynamic lookup, e.g. Apache ActiveMQ :


Posted by: jamesc 2008-07-18 04:45:10ComputingActivity and Usage Records
The information in a ComputingActivity seems to have overlap with that in the Usage Records. Perhaps an effort should be made to standardize between these two efforts.


Posted by: loomis 2008-07-24 03:54:04typographical errors
Table of contents: All of the page numbers are incorrect.

Sec. 6.3, p21, MinWallTime entry: "than" -> "then".

Sec. 6.4, p24, CacheTotal and CacheFree entries: "consequent" -> "subsequent".




Posted by: loomis 2008-07-24 05:11:52disk space types and sizes
There is a general need for users to be able to specify the minimum amount of free disk space for a job. For more complicated jobs, this requirement also entails specifying the amount of free disk space "locally" (e.g. on the worker node itself) and the amount of free disk space on a "shared" area, particularly for parallel jobs. This draft moves closer to satisfying these requirements and is appreciated. However, I do think that some changes are needed to satisfy fully those requirements.

In Sec. 6.4 for the ComputingManager entity, I have a couple comments related to the WorkingArea* attributes:

1) On my site, we typically give normal and MPI jobs different current working directories (WorkingAreas). For the normal jobs, we create a temporary area in /var for the job. This area is removed at the end of the job. For MPI jobs, we set the working directory to the shared home directory so that all of the processes in the MPI job see the same files. For this case, are we expected to publish multiple ComputingManager entities? Probably this is really a question on whether this information is placed correctly in the overall model.

2) Actually for all jobs both the temporary area in /var and the shared home are always visible. We set the current working directory as appropriate; however, a job technically has access to both of these areas. I imagine that there may be jobs (esp. parallel ones) that would like to take advantage of both. Perhaps you should consider being able to publish multiple "WorkingAreas" for a ComputingManager with a mechanism to identify the default for a particular job.

The "Cache*" attributes imply that some caching mechanism will be made available to users. However, this brings lots of questions about who the cache is open to, who manages contention for the cache between multiple jobs, how the existance of files in the cache are published to users, etc. These two attributes provide too little information for end-users to know what to do and I question whether including them is useful.

For the "ScratchDir" attribute, the description says this is a shared area. However, the extent of that sharing isn't clear. Is it shared between all grid jobs (from the same user) or all processes in a job? For a normal job, would a dedicated area on the local disk count as a "ScratchDir"? It is important to clearly distinguish the "TmpDir", "ScratchDir", and "ApplicationDir" properties so that system administrators publish something consistent and these values are meaningful to users.

For the three attributes "TmpDir", "ScratchDir", and "ApplicationDir", the values are described as absolute paths or paths. However, all of these are very likely to be different depending on the user running a job. Are environmental variables allowed to be published for these values? If not, in cases where a unique value cannot be given, what should be published?

The "*Dir" attributes indicate that at least three different types of storage can potentially be made available to users. There are certainly applications that will want to take advantage of the differences between those storage resources. However, this also means that users will want to specify how much space they need in those various areas. Currently, there are no attributes to actually indicate how much space is available (to a job) in those areas. Having attributes that indicate the existance of these areas without giving parameters such as the size is probably not terribly helpful to users.

The last comment raises a similar problem with Sec. 6.3 "Computing Share". There is only one attribute ("MaxDiskSpace") to specify the maximum disk space policy. If multiple types of storage areas are advertised (as in the ComputingManager), then the policy should also contain attributes corresponding to each type.

In addition, the description of MaxDiskSpace implies that the value corresponds to the "WorkingArea" of the ComputingManager. If that is the case, then explicitly saying "WorkingArea" would make it clear what the limit applies to.

Cal



Posted by: loomis 2008-07-24 05:49:24WallTime and CPUTime specifications
Consistently throughout the specification WallTime and CPUTime attributes are specified per slot. However, this is likely to make it very difficult to publish reliable values for queue limits when parallel jobs are permitted on a site.

Many batch systems enforce overall wall times and total CPU times; hence, those are the values that a system administrator will set when configuring her batch system. These will correspond to the "MaxTotal*Time" attributes (Sec. 6.3 Computing Share). However, what will she publish for the "Max*Time" attributes? If only normal jobs are accepted there is no problem; they are the same as "MaxTotal*Time". But if the site accepts parallel jobs with up to 100 slots, what is the correct value to publish for "Max*Time"? The actual limit depends on the number of slots requested; a single correct number cannot be published.

No doubt one can configure the batch system for per slot limits, but this is certainly not the usual case nor the most straight-forward. I suspect that there will be many sites that publish incorrect values for the "per slot" attributes diminishing the utility of these values for scheduling.

The "per slot" values are also likely to be much less interesting for users as well. The typical case is that one has an parallelized application and one increases the number of CPUs to find the most efficient scale of the application. With the "per slot" values, the user must recalculate the CPU and Wall limits everytime she resubmits the job with a different number of slots. However, the total CPU consumed is approximately the same and the wall clock time diminishes. This means that it would be much more convenient to specify the wall and total CPU limits once and only have to change one parameter in the job description.

Overall, I would suggest revisiting the decision to use "per slot" values. For users and for system administrators the "total" values are likely to be more consistent and useful.

Cal



Posted by: Baur 2008-07-28 11:02:51serviceType_t
The second elements of the proposed serviceTypes in chapter 17.6 are not consistent with the definition of the second element in the same chapter (which was defined as middleware name).

There should be a clearly defined distinction in that type tree between middleware and grid organisations to avoid the usage of different strings for the same service by different grids.

For globus, e.g. org.globus.ws-gram could possibly be better than org.teragrid.ws-gram because the former notation would be consistent with the definition of org.glite.wms

It could also be an improvement to define the second element as the name of the service implementation provider (like globus-alliance, EGEE, etc.)

Timo


Posted by: romberg 2008-08-01 09:36:12ComputingService / serviceType_t
Given that the computing service represents the abstract functionality of a system then the serviceType_t as defined in the document doesn't make sense. It must be compute service or data service or something alike.


Posted by: romberg 2008-08-01 10:00:41ExecutionEnvironment
It is not obvious from the definition/description of the ExecutionEnvironment what the definition of Instances is. That is is the number of homogeneous nodes in a cluster which can be requested by a ComputeManager. The concept allows moddeling heterogeneous clusters - it needed some discussion before it was understood.


Posted by: romberg 2008-08-01 10:03:20number of jobs per middleware
We were looking for a way to include the number of total/running/waitung jobs submitted by a specific middleware. Could the ComputingEndpoint be extended by these values?


Posted by: fisher 2008-08-05 11:59:12Various comments
Firstly I must say that I think this is a useful and important document.

There are quite a lot of typos to eliminate - and even words that a spell checker should pick up.

In the introduction you refer to a conceptual model but then in section 3 you refer to an information provider - this does not sound conceptual.

The last paragraph of section does not add anything. Are you allowed to add and delete from GLUE and still call it GLUE?

In chapter 5 the terminology gets confused. You use the term entity (as in ER I presume) where you have previously used class and you also have an entity called "entity". I would change, for example, at the top of page 7 to read "This entity is the root class from which all the GLUE classes inherit ...". Your UML diagram then has classes and one class is called Entity.

Entity - creation time - is this the creation time of this description of the entity or of its real world counterpart? In either case it cannot be optional - if we ignore theological debates - everything was created at some time. This is supposed to be a conceptual model as it says in the introduction. I see, reading ahead a bit, in Appendix A that you explain about place holders for unknown data. Again this is not conceptual unless some information is inherently unknowable. All of appendix A should be in the particular bindings. For example if it is relational and you don't know the value then you can use NULL. It might be the responsibility of a specific profile to define what is actually meant by missing information - e.g. "Don't know", "not applicable", "I am not going to tell you" etc.

Location - latitude and longitude - this cannot be optional for a geographical location. On the contrary there are many geographical locations that have no human readable name.

AdminDomain - Distributed - I don't like the definition very much. More of a problem is implementing the optional status. Many information systems only have 2 values for booleans.

I suggest that you are very clear about what this document is, how it relates to the bindings and whether or not you are going to use profiles to define what information should actually be published and how to interpret missing/null information.

Steve Fisher


Posted by: rodwalker 2008-08-11 18:12:59RAM per job slot
MainMemorySize/CPUMultiplicity could give RAM per job but it still doesn`t allow for the Max RAM of an execution queue to be set. A soft and hard limit is often configured by prudent admins.
As I watch ATLAS reco jobs being butchered by various Batch Systems, this is the
quantity I`m most interested in.


> login   RSS RSS Contact Webmaster

OGFSM, Open Grid ForumSM, Grid ForumSM, and the OGF Logo are trademarks of OGF